# Unified Data Structure db scrips Code, files & db queries for the UDS data-lake on the BDAP linux platform. Contains: * script to import file data into mongo db * db queries Python/R & JSON files embedded in python or R scripts * sample code to make use of them * data from those C4 projects & activities: * EEA * DICE * commercial vehicle specs * ... ## 1. Installation 1. Read and apply [first-time `git` setup instructions](./GIT_SETUP.md) 2. Clone this repo inside your **"personal-eos" dir** (`/eos/jeodpp/home/users/$USER`): ```bash cd $/eos/jeodpp/home/users/$USER # may use $USTORE after you setup the project git clone https://jeodpp.jrc.ec.europa.eu/apps/gitlab/use_cases/legent/uds-scripts \ && cd uds-scripts ``` 3. Launch the `Makefile::setup` & `install` rules with *GNU make*, to setup your BDAP linux account for the first time, and install this project respectively: ```bash make setup make install ``` > **NOTE:** the 1st command modifies your unix rc-files of your BDAP-home; > It's better that you logout and re-login, to pick up the changes. > **HINT:** the 2nd command will initiate a new conda environment, > install dependencies, and install this Python project in "editable" mode > (meaning, any code/script changes are imediately reflected). > It might take some time to complete. 4. Create a new `secrets.py` file inside the cloned project and paste the db credentials that will be provided to you by your db administrator. 5. Setup the *conda-env* as your IDE's Python interpreter: * **PyCharm:** File --> Settings --> Project --> Python Interpreter --> Press gear symbol and select “Add” --> Conda Environment --> Click "Existing environment" --> Select uds-scripts * **VSCode:** TODO 6. (optional) install and configure **nextcloud-agent** on your PC, to sync local files with BDAP, like "dropbox" does (read section below). * For more info on managing Conda environments click [here](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html). * For the bdap pycharm/conda documentation click here [here](https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/howto/jeodesk/Set-up-pycharm-with-conda-virtual-environment) ## 1.1. Nextcloud mapping of BDAP folders with your PC The [*nextcloud* BDAP service](https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/Jeodpp_services/Document-Sharing) is the only gateway of big data files into and outof BDAP (the other one is the *gitlab server*, but it's for code and smaller data-files). \ You may selectively upload files through [its web-based service](https://jeodpp.jrc.ec.europa.eu/apps/cloud/), but it's easier to install [the nextcloud-agent](https://nextcloud.com/install/#install-clients) on your PC and have it sync files in the background, like "dropbox" does. We suggest to configure nextcloud-agent to sync your dirs like that: ```text Name BDAP path your PC path ========= ========================================= ============= private: /home/{user}/Documents <--> data personal-eos: /eos/jeodpp/home/users/{user}/transfer <--> eos_user_home public-eos: /eos/jeodpp/data/projects/LEGENT/transfer <--> eos_LEGENT ``` > **NOTE:** by default, BDAP maps the 2nd "personal-eos" folder directly to `{user}` dir, > thus leaving *no space for your experiments* there, beyond nextcloud's mappings; > and this means wasting network bandwidth & storage on your PC while experimenting. > > > Please reconfigure *nextcloud-agent* so as to replicate the setup of the "public-eos" dir, > by creating a `transfer` subdir and limitting nextcloud to sync only that one. Work with the above dirs is facilitated by the definitons of these Python and ENVIRONMENT (after installation) variables: ```bash export USTORE="/eos/jeodpp/home/users/${USER}" export UTRANS="/eos/jeodpp/home/users/${USER}/transfer" export PSTORE="/eos/jeodpp/data/projects/LEGENT" export PTRANS="/eos/jeodpp/data/projects/LEGENT/transfer" ``` ## 2. Usage * activate you conda-env: ```bash conda activate uds-scripts` ``` or launch PyCharm/VSCode. * Create ANOTHER working folder under `USTORE/idea-1` so as to let others view (read-only) your intermediate result (or `cd` to your private home-dir, ie `~/idea-1`). * *--(process data)--* * run scripts & queries to get the data from the db (and possibly create files) * BASH-HISTORY is your friend * modify script-files in this repo * update db (only for users with adb admin credentials) * copy/move specific files for/to the folders synced by the *nextcloud-agent*, * or update the "public-eos" dir `PSTORE` (if you're certain about the results). ### 2.1. Sample code ```python from db import eea_2020_flattened from utils import save_to_parquet from config import Config import pandas as pd df = pd.DataFrame.from_records(eea_2020_flattened.find()) ... # process your data save_to_parquet(df, Config.USTORE, 'test.parquet') save_to_excel(df, Config.USTORE, 'test.xlsx') ``` > **TIP:** Don't store your pandas in CSVs, they are big, slow and loose precision. > > Use excel-files when sharing data, they are also big, but keep precision are not very slow. Or preferably, store them in parquet. > > Maybe it's not the best idea to use out bdap's home-ds for experimenting with big-data: > https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/howto/data/Access-and-transfer-files#storage-places-not-to-be-used-for-data-storage