-
Thomas Vliagkoftis authoredThomas Vliagkoftis authored
Unified Data Structure db scrips
Contains:
- script to import file data to db
- db queries Python/R &JSON files or embedded in python or R scripts
- sample code to use that
Install
-
Read and apply first-time git setup instructions.
-
Clone git repo:
git clone https://jeodpp.jrc.ec.europa.eu/apps/gitlab/use_cases/legent/uds-scripts.git
-
Create a new secrets.py file and paste the db credentials that will be provided to you by your db admin
-
Run this bash command to append scripts-path into $PATH and $PYTHONPATH:
source ./install.sh
-
Create and activate a new virtual environment using the environment.yml file:
conda env create -f environment.yml conda activate uds-scripts
For more info on managing Conda environments click here. For the jeodpp pycharm/conda documentation click here here
-
Setup Conda virtual environment as your PyCharm interpreter. Follow the steps: File -> Settings -> Project -> Python Interpreter -> Press gear symbol and select “Add” -> Conda Environment -> Click "Existing environment" -> Select uds-scripts
Usage
-
CD to a dir where you are going to work, inside your home-dir, eg:
/home/{user}/{foo}
(this may be a new one) -
--(process db-data)--
- run scripts & queries to get the data from the db (and possibly create files)
- BASH-HISTORY is your friend
- modify script-files
- update db (only for users with adb admin credentials)
- run scripts & queries to get the data from the db (and possibly create files)
-
copy/move specific files for the NextCloud agent to pick them up, from:
/home/{user}/Documents/{foo} --> /home/{user}/{foo}/Documents --> /eos/jeodpp/home/users/{user}/{bar} --> /eos/jeodpp/data/projects/LEGENT/transfer/{baz}
Sample code
from db import eea_2020_flattened
from utils import save_to_parquet
from config import Config
import pandas as pd
df = pd.DataFrame.from_records(eea_2020_flattened.find())
... # process your data
save_to_parquet(df, Config.PUBLIC_SAVE_PATH, 'test.parquet')
save_to_excel(df, Config.PUBLIC_SAVE_PATH, 'test.xlsx')
TIP: Don't store your pandas in CSVs, they are big, slow and loose precision.
Use excel-files when sharing data, they are also big, but keep precision are not very slow. Or preferably, store them in parquet.
Maybe it's not the best idea to use out bdap's home-ds for experimenting with big-data: https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/howto/data/Access-and-transfer-files#storage-places-not-to-be-used-for-data-storage
NextCloud mapping of folders
BDAP PC-folder of NextCloud
========================================= =========================
/home/{user}/Documents <--> ~/BDAPCloud/data
/eos/jeodpp/home/users/{user} <--> ~/BDAPCloud/eos_user_home
/eos/jeodpp/data/projects/LEGENT/transfer <--> ~/BDAPCloud/eos_LEGENT