# Unified Data Structure db scrips Code, files & db queries for the UDS data-lake on the BDAP linux platform. **NOTE:** the data are provided separately, due to their size, preloaded in *eos-dirs* (read section below about Nextcloud mapping). ## 0. Contents * `build_eea.py` - script to import EEA's files into mongo db * `example.py` - sample script to query db and extract docs ("rows") to files * `src/uds4jrc/` - python project to support the scripts, acting as their library * data from those C4 projects & activities: * EEA - vehicle registrations in EU, reported by MSs & packaged by EEA * Fiat500x - campaign with OBD data from JRC "amateur" drivers * DICE - TODO * commercial vehicle specs - TODO * ... ## 1. Installation > **TIP:** Prefer _Chrome browser_ when logging into BDAP. > > _Chrome_ is a much better fit for BDAP's JEODesk service; > _Firefox_ does not send any of its default keyboard shortcuts to the _Gucamole_-based > web-page (ie to the terminal or the IDE you run in BDAP). > For instance, **[Ctrl+U]/[Ctrl+U]** in _firefox_ do not kill chars left/right of the cursor, > but _"Show page-source"_ & _"Search in address-bar"_ instead, respectively. 1. Read and apply [first-time `git@BDAP` setup instructions](./GIT_SETUP.md). 2. Clone this repo inside your **"personal-eos" dir** (`/eos/jeodpp/home/users/$USER`): ```bash cd $/eos/jeodpp/home/users/$USER # use $UEOS var in the future git clone https://jeodpp.jrc.ec.europa.eu/apps/gitlab/use_cases/legent/uds-scripts.git/ \ && cd uds-scripts ``` 3. Run the `setup-account.sh` script to setup your BDAP linux account for the first time, and create a new conda-environment to install the project: ```bash ./setup-account.sh all ``` > **NOTE:** the command will modify your BDAP's rc-files to offer: > - searching terminal history with UP arrow, > - autocompletion for `git` & `conda` commands, > - bash & ipython aliases & variables to *eos-dirs*, > - prepare your `git` identity, for when committing. > > It's better that you logout and re-login, to pick up the changes, > but this is not a hard requirement for the next steps. > > If everything went as planned, the 2nd command is just `conda activate ./env` 4. Activate the created *conda-env*: ```bash conda activate ./env ``` > **TIP:** you would have to re-issue this command every time you start a new login session or a new bash shell is launched. 5. Create a new file `mysecrets.py` inside the root dir of the cloned project, and paste inside the db credentials provided by your db administrator. 6. Setup the *conda-env* as your IDE's Python interpreter: * **PyCharm:** File --> Settings --> Project --> Python Interpreter --> Press gear symbol and select “Add” --> Conda Environment --> Click "Existing environment" --> Select "uds-scripts/env" path. * **VSCode:** Press **[Ctrl+P]** and start typing "select interpreter" and select "Python: select interpreter" it from the drop-down list, then navigate to path of the *conda-env*. 7. (optional) install and configure **nextcloud-agent** on your PC, to sync local files with BDAP, like "dropbox" does (read section below). * For more info on managing Conda environments click [here](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html). * For the bdap pycharm/conda documentation click here [here](https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/howto/jeodesk/Set-up-pycharm-with-conda-virtual-environment) ## 1.1. Nextcloud mapping of BDAP folders with your PC The [*nextcloud* BDAP service](https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/Jeodpp_services/Document-Sharing) is the only gateway of big data files into and outof BDAP (the other one is the *gitlab server*, but it's for code and smaller data-files). \ You may selectively upload files through [its web-based service](https://jeodpp.jrc.ec.europa.eu/apps/cloud/), but it's easier to install [the nextcloud-agent](https://nextcloud.com/install/#install-clients) on your PC and have it sync files in the background, like "dropbox" does. We suggest to configure nextcloud-agent to sync your dirs like that: ```text Name BDAP path your PC path ========= ========================================= ============= private: /home/{user}/Documents <--> data personal-eos: /eos/jeodpp/home/users/{user}/transfer <--> eos_user_home public-eos: /eos/jeodpp/data/projects/LEGENT/transfer <--> eos_LEGENT ``` > **NOTE:** by default, BDAP maps the 2nd "personal-eos" folder directly to `{user}` dir, > thus leaving *no space for your experiments* there, beyond nextcloud's mappings; > and this means wasting network bandwidth & storage on your PC while experimenting. > > > Please reconfigure *nextcloud-agent* so as to replicate the setup of the "public-eos" dir, > by creating a `transfer` subdir and limitting nextcloud to sync only that one. Work with the above dirs is facilitated by the definitons of these Python and ENVIRONMENT (after installation) variables: ```bash export UEOS="/eos/jeodpp/home/users/${USER}" export UTRANS="/eos/jeodpp/home/users/${USER}/transfer" export PEOS="/eos/jeodpp/data/projects/LEGENT" export PTRANS="/eos/jeodpp/data/projects/LEGENT/transfer" ``` ## 2. Usage * activate you conda-env: ```bash cd $UEOS/uds-scripts conda activate $UEOS/uds-scripts/env ``` or launch PyCharm/VSCode. * **Create ANOTHER working folder for you code & data** under `UEOS/idea-1` so as to let others view (read-only) your intermediate result (or `cd` to your private home-dir, ie `~/idea-1`). * *--(process data)--* * run scripts & queries to get the data from the db (and possibly create files) * BASH-HISTORY is your friend * modify script-files in this repo * update db (only for users with adb admin credentials) * copy/move specific files for/to the folders synced by the *nextcloud-agent*, * or update the "public-eos" dir `PEOS` (if you're certain about the results). ### 2.1. Sample code ```python from uds4jrc.db import eea_2020_flattened from uds4jrc.utils import save_to_parquet from uds4jrc.config import Config import pandas as pd df = pd.DataFrame.from_records(eea_2020_flattened.find()) ... # process your data save_to_parquet(df, Config.UEOS, 'test.parquet') save_to_excel(df, Config.UEOS, 'test.xlsx') ``` > **TIP:** Don't store your pandas in CSVs, they are big, slow and loose precision. > > Use excel-files when sharing data, they are also big, but keep precision are not very slow. Or preferably, store them in parquet. > > Maybe it's not the best idea to use out bdap's home-ds for experimenting with big-data: > https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/howto/data/Access-and-transfer-files#storage-places-not-to-be-used-for-data-storage