Code development platform for open source projects from the European Union institutions

Skip to content
Snippets Groups Projects
Code owners
Assign users and groups as approvers for specific file changes. Learn more.

Unified Data Structure db scrips

Contains:

  • script to import file data to db
  • db queries Python/R &JSON files or embedded in python or R scripts
  • sample code to use that

Install

  • Read and apply first-time git setup instructions.

  • Clone git repo:

    git clone https://jeodpp.jrc.ec.europa.eu/apps/gitlab/use_cases/legent/uds-scripts.git
  • Create a new secrets.py file and paste the db credentials that will be provided to you by your db admin

  • Run this bash command to append scripts-path into $PATH and $PYTHONPATH:

    source ./install.sh
  • Create and activate a new virtual environment using the environment.yml file:

    conda env create -f environment.yml
    conda activate uds-scripts

    For more info on managing Conda environments click here. For the jeodpp pycharm/conda documentation click here here

  • Setup Conda virtual environment as your PyCharm interpreter. Follow the steps: File -> Settings -> Project -> Python Interpreter -> Press gear symbol and select “Add” -> Conda Environment -> Click "Existing environment" -> Select uds-scripts

Usage

  • CD to a dir where you are going to work, inside your home-dir, eg: /home/{user}/{foo} (this may be a new one)

  • --(process db-data)--

    • run scripts & queries to get the data from the db (and possibly create files)
      • BASH-HISTORY is your friend
    • modify script-files
    • update db (only for users with adb admin credentials)
  • copy/move specific files for the NextCloud agent to pick them up, from:

    /home/{user}/Documents/{foo} --> /home/{user}/{foo}/Documents
                                 --> /eos/jeodpp/home/users/{user}/{bar}
                                 --> /eos/jeodpp/data/projects/LEGENT/transfer/{baz}

Sample code

from db import eea_2020_flattened
from utils import save_to_parquet
from config import Config
import pandas as pd

df = pd.DataFrame.from_records(eea_2020_flattened.find())

...  # process your data

save_to_parquet(df, Config.PUBLIC_SAVE_PATH, 'test.parquet')
save_to_excel(df, Config.PUBLIC_SAVE_PATH, 'test.xlsx')

TIP: Don't store your pandas in CSVs, they are big, slow and loose precision.

Use excel-files when sharing data, they are also big, but keep precision are not very slow. Or preferably, store them in parquet.

Maybe it's not the best idea to use out bdap's home-ds for experimenting with big-data: https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/howto/data/Access-and-transfer-files#storage-places-not-to-be-used-for-data-storage

NextCloud mapping of folders

BDAP                                           PC-folder of NextCloud
=========================================      =========================
/home/{user}/Documents                    <--> ~/BDAPCloud/data
/eos/jeodpp/home/users/{user}             <--> ~/BDAPCloud/eos_user_home
/eos/jeodpp/data/projects/LEGENT/transfer <--> ~/BDAPCloud/eos_LEGENT