Unified Data Structure db scrips
Code, files & db queries for the UDS data-lake on the BDAP linux platform.
0. Contents
- script to import file data into mongo db
- db queries Python/R & JSON files embedded in python or R scripts
- sample code to make use of them
- data from those C4 projects & activities:
- EEA
- DICE
- commercial vehicle specs
- ...
1. Installation
TIP: Chrome browser is a much better fit for BDAP's JEODesk service; Firefox does not send any of its default keyboard shortcuts to the Gucamole-based web-page (ie to the terminal or the IDE you run in BDAP). For instance, [Ctrl+U]/[Ctrl+U] do not kill chars left/right of the cursor, but "Show page-source" & "Search in address-bar" instead, respectively.
-
Read and apply first-time
git@BDAP
setup instructions -
Clone this repo inside your "personal-eos" dir (
/eos/jeodpp/home/users/$USER
):cd $/eos/jeodpp/home/users/$USER # use $USTORE in the future git clone https://jeodpp.jrc.ec.europa.eu/apps/gitlab/use_cases/legent/uds-scripts.git/ \ && cd uds-scripts
-
Run the cript to setup your BDAP linux account for the first time, and to create a new conda-environment to install the project:
./setup-account.sh all
NOTE: the command will modify your BDAP's rc-files to offer:
- searching terminal history with UP arrow,
- autocompletion for
git
&conda
commands, - bash & ipython aliases & variables to eos-dirs,
- prepare
git
with your identity, for when commintting.
It's better that you logout and re-login, to pick up the changes, but this is not a hard requirement for the next steps.
If everything went as planned, the 2nd command is just
conda activate uds-scripts
-
Activate the created conda-env:
conda activate uds-scripts
NOTE: you would have to re-issue this command every time you start a new login session or a new bash shell is launched.
-
Create a new file
mysecrets.py
inside the cloned project and paste the db credentials that will be provided to you by your db administrator. -
Setup the conda-env as your IDE's Python interpreter:
-
PyCharm: File --> Settings --> Project --> Python Interpreter --> Press gear symbol and select “Add” --> Conda Environment --> Click "Existing environment" --> Select "uds-scripts"
-
VSCode: Press [Ctrl+P] and start typing "select interpreter" and select "Python: select interpreter" it from the drop-down list, then navigate to path of the conda-env.
-
-
(optional) install and configure nextcloud-agent on your PC, to sync local files with BDAP, like "dropbox" does (read section below).
- For more info on managing Conda environments click here.
- For the bdap pycharm/conda documentation click here here
1.1. Nextcloud mapping of BDAP folders with your PC
The nextcloud BDAP service
is the only gateway of big data files into and outof BDAP
(the other one is the gitlab server, but it's for code and smaller data-files).
You may selectively upload files through its web-based service,
but it's easier to install the nextcloud-agent
on your PC and have it sync files in the background, like "dropbox" does.
We suggest to configure nextcloud-agent to sync your dirs like that:
Name BDAP path your PC path
========= ========================================= =============
private: /home/{user}/Documents <--> data
personal-eos: /eos/jeodpp/home/users/{user}/transfer <--> eos_user_home
public-eos: /eos/jeodpp/data/projects/LEGENT/transfer <--> eos_LEGENT
NOTE: by default, BDAP maps the 2nd "personal-eos" folder directly to
{user}
dir, thus leaving no space for your experiments there, beyond nextcloud's mappings; and this means wasting network bandwidth & storage on your PC while experimenting.Please reconfigure nextcloud-agent so as to replicate the setup of the "public-eos" dir, by creating a
transfer
subdir and limitting nextcloud to sync only that one.
Work with the above dirs is facilitated by the definitons of these Python and ENVIRONMENT (after installation) variables:
export USTORE="/eos/jeodpp/home/users/${USER}"
export UTRANS="/eos/jeodpp/home/users/${USER}/transfer"
export PSTORE="/eos/jeodpp/data/projects/LEGENT"
export PTRANS="/eos/jeodpp/data/projects/LEGENT/transfer"
2. Usage
-
activate you conda-env:
conda activate uds-scripts`
or launch PyCharm/VSCode.
-
Create ANOTHER working folder under
USTORE/idea-1
so as to let others view (read-only) your intermediate result (orcd
to your private home-dir, ie~/idea-1
). -
--(process data)--
- run scripts & queries to get the data from the db (and possibly create files)
- BASH-HISTORY is your friend
- modify script-files in this repo
- update db (only for users with adb admin credentials)
- run scripts & queries to get the data from the db (and possibly create files)
-
copy/move specific files for/to the folders synced by the nextcloud-agent,
-
or update the "public-eos" dir
PSTORE
(if you're certain about the results).
2.1. Sample code
from uds4jrc.db import eea_2020_flattened
from uds4jrc.utils import save_to_parquet
from uds4jrc.config import Config
import pandas as pd
df = pd.DataFrame.from_records(eea_2020_flattened.find())
... # process your data
save_to_parquet(df, Config.USTORE, 'test.parquet')
save_to_excel(df, Config.USTORE, 'test.xlsx')
TIP: Don't store your pandas in CSVs, they are big, slow and loose precision.
Use excel-files when sharing data, they are also big, but keep precision are not very slow. Or preferably, store them in parquet.
Maybe it's not the best idea to use out bdap's home-ds for experimenting with big-data: https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/howto/data/Access-and-transfer-files#storage-places-not-to-be-used-for-data-storage