Unified Data Structure db scrips
Code, files & db queries for the UDS data-lake on the BDAP linux platform.
Contains:
- script to import file data into mongo db
- db queries Python/R & JSON files embedded in python or R scripts
- sample code to make use of them
- data from those C4 projects & activities:
- EEA
- DICE
- commercial vehicle specs
- ...
1. Installation
-
Read and apply first-time
git
setup instructions -
Clone this repo inside your "personal-eos" dir (
/eos/jeodpp/home/users/$USER
):cd $/eos/jeodpp/home/users/$USER # may use $USTORE after you setup the project git clone https://jeodpp.jrc.ec.europa.eu/apps/gitlab/use_cases/legent/uds-scripts \ && cd uds-scripts
-
Launch the
Makefile::setup
&install
rules with GNU make, to setup your BDAP linux account for the first time, and install this project respectively:make setup make install
NOTE: the 1st command modifies your unix rc-files of your BDAP-home; It's better that you logout and re-login, to pick up the changes.
HINT: the 2nd command will initiate a new conda environment, install dependencies, and install this Python project in "editable" mode (meaning, any code/script changes are imediately reflected). It might take some time to complete.
-
Create a new
secrets.py
file inside the cloned project and paste the db credentials that will be provided to you by your db administrator. -
Setup the conda-env as your IDE's Python interpreter:
-
PyCharm: File --> Settings --> Project --> Python Interpreter --> Press gear symbol and select “Add” --> Conda Environment --> Click "Existing environment" --> Select uds-scripts
-
VSCode: TODO
-
-
(optional) install and configure nextcloud-agent on your PC, to sync local files with BDAP, like "dropbox" does (read section below).
- For more info on managing Conda environments click here.
- For the bdap pycharm/conda documentation click here here
1.1. Nextcloud mapping of BDAP folders with your PC
The nextcloud BDAP service
is the only gateway of big data files into and outof BDAP
(the other one is the gitlab server, but it's for code and smaller data-files).
You may selectively upload files through its web-based service,
but it's easier to install the nextcloud-agent
on your PC and have it sync files in the background, like "dropbox" does.
We suggest to configure nextcloud-agent to sync your dirs like that:
Name BDAP path your PC path
========= ========================================= =============
private: /home/{user}/Documents <--> data
personal-eos: /eos/jeodpp/home/users/{user}/transfer <--> eos_user_home
public-eos: /eos/jeodpp/data/projects/LEGENT/transfer <--> eos_LEGENT
NOTE: by default, BDAP maps the 2nd "personal-eos" folder directly to
{user}
dir, thus leaving no space for your experiments there, beyond nextcloud's mappings; and this means wasting network bandwidth & storage on your PC while experimenting.Please reconfigure nextcloud-agent so as to replicate the setup of the "public-eos" dir, by creating a
transfer
subdir and limitting nextcloud to sync only that one.
Work with the above dirs is facilitated by the definitons of these Python and ENVIRONMENT (after installation) variables:
export USTORE="/eos/jeodpp/home/users/${USER}"
export UTRANS="/eos/jeodpp/home/users/${USER}/transfer"
export PSTORE="/eos/jeodpp/data/projects/LEGENT"
export PTRANS="/eos/jeodpp/data/projects/LEGENT/transfer"
2. Usage
-
activate you conda-env:
conda activate uds-scripts`
or launch PyCharm/VSCode.
-
Create ANOTHER working folder under
USTORE/idea-1
so as to let others view (read-only) your intermediate result (orcd
to your private home-dir, ie~/idea-1
). -
--(process data)--
- run scripts & queries to get the data from the db (and possibly create files)
- BASH-HISTORY is your friend
- modify script-files in this repo
- update db (only for users with adb admin credentials)
- run scripts & queries to get the data from the db (and possibly create files)
-
copy/move specific files for/to the folders synced by the nextcloud-agent,
-
or update the "public-eos" dir
PSTORE
(if you're certain about the results).
2.1. Sample code
from uds4jrc.db import eea_2020_flattened
from uds4jrc.utils import save_to_parquet
from uds4jrc.config import Config
import pandas as pd
df = pd.DataFrame.from_records(eea_2020_flattened.find())
... # process your data
save_to_parquet(df, Config.USTORE, 'test.parquet')
save_to_excel(df, Config.USTORE, 'test.xlsx')
TIP: Don't store your pandas in CSVs, they are big, slow and loose precision.
Use excel-files when sharing data, they are also big, but keep precision are not very slow. Or preferably, store them in parquet.
Maybe it's not the best idea to use out bdap's home-ds for experimenting with big-data: https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/howto/data/Access-and-transfer-files#storage-places-not-to-be-used-for-data-storage