# Unified Data Structure db scrips Contains: * script to import file data to db * db queries Python/R &JSON files or embedded in python or R scripts * sample code to use that ## Install * Read and apply [first-time git setup instructions](./GIT_SETUP.md). * Clone git repo: ```bash git clone https://jeodpp.jrc.ec.europa.eu/apps/gitlab/use_cases/legent/uds-scripts.git ``` * Create a new secrets.py file and paste the db credentials that will be provided to you by your db admin * Run this bash command to append scripts-path into $PATH and $PYTHONPATH: ```bash source ./install.sh ``` * Create and activate a new virtual environment using the environment.yml file: ```bash conda env create -f environment.yml conda activate uds-scripts ``` For more info on managing Conda environments click [here](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html). For the jeodpp pycharm/conda documentation click here [here](https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/howto/jeodesk/Set-up-pycharm-with-conda-virtual-environment) * Setup Conda virtual environment as your PyCharm interpreter. Follow the steps: File -> Settings -> Project -> Python Interpreter -> Press gear symbol and select “Add” -> Conda Environment -> Click "Existing environment" -> Select uds-scripts ## Usage * CD to a dir where you are going to work, inside your home-dir, eg: `/home/{user}/{foo}` (this may be a new one) * --(process db-data)-- * run scripts & queries to get the data from the db (and possibly create files) * BASH-HISTORY is your friend * modify script-files * update db (only for users with adb admin credentials) * copy/move specific files for the NextCloud agent to pick them up, from: ``` /home/{user}/Documents/{foo} --> /home/{user}/{foo}/Documents --> /eos/jeodpp/home/users/{user}/{bar} --> /eos/jeodpp/data/projects/LEGENT/transfer/{baz} ``` ### Sample code ```python from db import eea_2020_flattened from utils import save_to_parquet from config import Config import pandas as pd df = pd.DataFrame.from_records(eea_2020_flattened.find()) ... # process your data save_to_parquet(df, Config.PUBLIC_SAVE_PATH, 'test.parquet') save_to_excel(df, Config.PUBLIC_SAVE_PATH, 'test.xlsx') ``` > **TIP:** Don't store your pandas in CSVs, they are big, slow and loose precision. > > Use excel-files when sharing data, they are also big, but keep precision are not very slow. Or preferably, store them in parquet. > > Maybe it's not the best idea to use out bdap's home-ds for experimenting with big-data: > https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/howto/data/Access-and-transfer-files#storage-places-not-to-be-used-for-data-storage ### NextCloud mapping of folders ``` BDAP PC-folder of NextCloud ========================================= ========================= /home/{user}/Documents <--> ~/BDAPCloud/data /eos/jeodpp/home/users/{user} <--> ~/BDAPCloud/eos_user_home /eos/jeodpp/data/projects/LEGENT/transfer <--> ~/BDAPCloud/eos_LEGENT ```