Code development platform for open source projects from the European Union institutions

Skip to content
Snippets Groups Projects
Kostis Anagnostopoulos's avatar
Kostis ANAGNOSTOPOULOS authored
- refact: rename `secrets.py` --> mysecrets.py`
- refact: add more, rename & homogenize dir vars with bash ones.
- FEAT: +`setup.cfg` +`pyproject.toml` to package lib-code.
- feat: add project coordinates in new package(`__init__.py`).
- feat: +`LICENSE.txt` of EUPL v1.2

Rational: refactored as package not to pollute/hide standard-lib modules,
and, through being pip-installable, to facilitate running scripts from dirs anywhere.
0f11755f
History

Unified Data Structure db scrips

Code, files & db queries for the UDS data-lake on the BDAP linux platform.

Contains:

  • script to import file data into mongo db
  • db queries Python/R & JSON files embedded in python or R scripts
  • sample code to make use of them
  • data from those C4 projects & activities:
    • EEA
    • DICE
    • commercial vehicle specs
    • ...

1. Installation

  1. Read and apply first-time git setup instructions

  2. Clone this repo inside your "personal-eos" dir (/eos/jeodpp/home/users/$USER):

    cd $/eos/jeodpp/home/users/$USER  # may use $USTORE after you setup the project
    git clone https://jeodpp.jrc.ec.europa.eu/apps/gitlab/use_cases/legent/uds-scripts \
        && cd uds-scripts
  3. Launch the Makefile::setup & install rules with GNU make, to setup your BDAP linux account for the first time, and install this project respectively:

    make setup
    make install

    NOTE: the 1st command modifies your unix rc-files of your BDAP-home; It's better that you logout and re-login, to pick up the changes.

    HINT: the 2nd command will initiate a new conda environment, install dependencies, and install this Python project in "editable" mode (meaning, any code/script changes are imediately reflected). It might take some time to complete.

  4. Create a new secrets.py file inside the cloned project and paste the db credentials that will be provided to you by your db administrator.

  5. Setup the conda-env as your IDE's Python interpreter:

    • PyCharm: File --> Settings --> Project --> Python Interpreter --> Press gear symbol and select “Add” --> Conda Environment --> Click "Existing environment" --> Select uds-scripts

    • VSCode: TODO

  6. (optional) install and configure nextcloud-agent on your PC, to sync local files with BDAP, like "dropbox" does (read section below).

  • For more info on managing Conda environments click here.
  • For the bdap pycharm/conda documentation click here here

1.1. Nextcloud mapping of BDAP folders with your PC

The nextcloud BDAP service is the only gateway of big data files into and outof BDAP (the other one is the gitlab server, but it's for code and smaller data-files).
You may selectively upload files through its web-based service, but it's easier to install the nextcloud-agent on your PC and have it sync files in the background, like "dropbox" does.

We suggest to configure nextcloud-agent to sync your dirs like that:

Name           BDAP path                                      your PC path
=========      =========================================      =============
private:       /home/{user}/Documents                    <--> data
personal-eos:  /eos/jeodpp/home/users/{user}/transfer    <--> eos_user_home
public-eos:    /eos/jeodpp/data/projects/LEGENT/transfer <--> eos_LEGENT

NOTE: by default, BDAP maps the 2nd "personal-eos" folder directly to {user} dir, thus leaving no space for your experiments there, beyond nextcloud's mappings; and this means wasting network bandwidth & storage on your PC while experimenting.

Please reconfigure nextcloud-agent so as to replicate the setup of the "public-eos" dir, by creating a transfer subdir and limitting nextcloud to sync only that one.

Work with the above dirs is facilitated by the definitons of these Python and ENVIRONMENT (after installation) variables:

export USTORE="/eos/jeodpp/home/users/${USER}"
export UTRANS="/eos/jeodpp/home/users/${USER}/transfer"
export PSTORE="/eos/jeodpp/data/projects/LEGENT"
export PTRANS="/eos/jeodpp/data/projects/LEGENT/transfer"

2. Usage

  • activate you conda-env:

    conda activate uds-scripts`

    or launch PyCharm/VSCode.

  • Create ANOTHER working folder under USTORE/idea-1 so as to let others view (read-only) your intermediate result (or cd to your private home-dir, ie ~/idea-1).

  • --(process data)--

    • run scripts & queries to get the data from the db (and possibly create files)
      • BASH-HISTORY is your friend
    • modify script-files in this repo
    • update db (only for users with adb admin credentials)
  • copy/move specific files for/to the folders synced by the nextcloud-agent,

  • or update the "public-eos" dir PSTORE (if you're certain about the results).

2.1. Sample code

from uds4jrc.db import eea_2020_flattened
from uds4jrc.utils import save_to_parquet
from uds4jrc.config import Config
import pandas as pd

df = pd.DataFrame.from_records(eea_2020_flattened.find())

...  # process your data

save_to_parquet(df, Config.USTORE, 'test.parquet')
save_to_excel(df, Config.USTORE, 'test.xlsx')

TIP: Don't store your pandas in CSVs, they are big, slow and loose precision.

Use excel-files when sharing data, they are also big, but keep precision are not very slow. Or preferably, store them in parquet.

Maybe it's not the best idea to use out bdap's home-ds for experimenting with big-data: https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/howto/data/Access-and-transfer-files#storage-places-not-to-be-used-for-data-storage