Code development platform for open source projects from the European Union institutions

Skip to content
Snippets Groups Projects

Unified Data Structure db scrips

Code, files & db queries for the UDS data-lake on the BDAP linux platform.

0. Contents

  • script to import file data into mongo db
  • db queries Python/R & JSON files embedded in python or R scripts
  • sample code to make use of them
  • data from those C4 projects & activities:
    • EEA
    • DICE
    • commercial vehicle specs
    • ...

1. Installation

TIP: Chrome browser is a much better fit for BDAP's JEODesk service; Firefox does not send any of its default keyboard shortcuts to the Gucamole-based web-page (ie to the terminal or the IDE you run in BDAP). For instance, [Ctrl+U]/[Ctrl+U] do not kill chars left/right of the cursor, but "Show page-source" & "Search in address-bar" instead, respectively.

  1. Read and apply first-time git@BDAP setup instructions

  2. Clone this repo inside your "personal-eos" dir (/eos/jeodpp/home/users/$USER):

    cd $/eos/jeodpp/home/users/$USER  # use $USTORE in the future
    git clone https://jeodpp.jrc.ec.europa.eu/apps/gitlab/use_cases/legent/uds-scripts.git/ \
         && cd uds-scripts
  3. Run the cript to setup your BDAP linux account for the first time, and to create a new conda-environment to install the project:

    ./setup-account.sh all

    NOTE: the command will modify your BDAP's rc-files to offer:

    • searching terminal history with UP arrow,
    • autocompletion for git & conda commands,
    • bash & ipython aliases & variables to eos-dirs,
    • prepare git with your identity, for when commintting.

    It's better that you logout and re-login, to pick up the changes, but this is not a hard requirement for the next steps.

    If everything went as planned, the 2nd command is just conda activate ./env

  4. Activate the created conda-env:

    conda activate ./env

    NOTE: you would have to re-issue this command every time you start a new login session or a new bash shell is launched.

  5. Create a new file mysecrets.py inside the cloned project and paste the db credentials that will be provided to you by your db administrator.

  6. Setup the conda-env as your IDE's Python interpreter:

    • PyCharm: File --> Settings --> Project --> Python Interpreter --> Press gear symbol and select “Add” --> Conda Environment --> Click "Existing environment" --> Select "uds-scripts/env" path.

    • VSCode: Press [Ctrl+P] and start typing "select interpreter" and select "Python: select interpreter" it from the drop-down list, then navigate to path of the conda-env.

  7. (optional) install and configure nextcloud-agent on your PC, to sync local files with BDAP, like "dropbox" does (read section below).

  • For more info on managing Conda environments click here.
  • For the bdap pycharm/conda documentation click here here

1.1. Nextcloud mapping of BDAP folders with your PC

The nextcloud BDAP service is the only gateway of big data files into and outof BDAP (the other one is the gitlab server, but it's for code and smaller data-files).
You may selectively upload files through its web-based service, but it's easier to install the nextcloud-agent on your PC and have it sync files in the background, like "dropbox" does.

We suggest to configure nextcloud-agent to sync your dirs like that:

Name           BDAP path                                      your PC path
=========      =========================================      =============
private:       /home/{user}/Documents                    <--> data
personal-eos:  /eos/jeodpp/home/users/{user}/transfer    <--> eos_user_home
public-eos:    /eos/jeodpp/data/projects/LEGENT/transfer <--> eos_LEGENT

NOTE: by default, BDAP maps the 2nd "personal-eos" folder directly to {user} dir, thus leaving no space for your experiments there, beyond nextcloud's mappings; and this means wasting network bandwidth & storage on your PC while experimenting.

Please reconfigure nextcloud-agent so as to replicate the setup of the "public-eos" dir, by creating a transfer subdir and limitting nextcloud to sync only that one.

Work with the above dirs is facilitated by the definitons of these Python and ENVIRONMENT (after installation) variables:

export USTORE="/eos/jeodpp/home/users/${USER}"
export UTRANS="/eos/jeodpp/home/users/${USER}/transfer"
export PSTORE="/eos/jeodpp/data/projects/LEGENT"
export PTRANS="/eos/jeodpp/data/projects/LEGENT/transfer"

2. Usage

  • activate you conda-env:

    cd $USTORE/uds-scripts
    conda activate $USTORE/uds-scripts/env

    or launch PyCharm/VSCode.

  • Create ANOTHER working folder under USTORE/idea-1 so as to let others view (read-only) your intermediate result (or cd to your private home-dir, ie ~/idea-1).

  • --(process data)--

    • run scripts & queries to get the data from the db (and possibly create files)
      • BASH-HISTORY is your friend
    • modify script-files in this repo
    • update db (only for users with adb admin credentials)
  • copy/move specific files for/to the folders synced by the nextcloud-agent,

  • or update the "public-eos" dir PSTORE (if you're certain about the results).

2.1. Sample code

from uds4jrc.db import eea_2020_flattened
from uds4jrc.utils import save_to_parquet
from uds4jrc.config import Config
import pandas as pd

df = pd.DataFrame.from_records(eea_2020_flattened.find())

...  # process your data

save_to_parquet(df, Config.USTORE, 'test.parquet')
save_to_excel(df, Config.USTORE, 'test.xlsx')

TIP: Don't store your pandas in CSVs, they are big, slow and loose precision.

Use excel-files when sharing data, they are also big, but keep precision are not very slow. Or preferably, store them in parquet.

Maybe it's not the best idea to use out bdap's home-ds for experimenting with big-data: https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/howto/data/Access-and-transfer-files#storage-places-not-to-be-used-for-data-storage