Code development platform for open source projects from the European Union institutions

Skip to content
Snippets Groups Projects
README.md 5.54 KiB
Newer Older
Thomas Vliagkoftis's avatar
Thomas Vliagkoftis committed
# Unified Data Structure db scrips

Code, files & db queries for the UDS data-lake on the BDAP linux platform.

Thomas Vliagkoftis's avatar
Thomas Vliagkoftis committed
Contains:

* script to import file data into mongo db
* db queries Python/R & JSON files embedded in python or R scripts
* sample code to make use of them
* data from those C4 projects & activities:
  * EEA
  * DICE
  * commercial vehicle specs
  * ...
1. Read and apply [first-time `git` setup instructions](./GIT_SETUP.md)
2. Clone this repo inside your **"personal-eos" dir** (`/eos/jeodpp/home/users/$USER`):
   ```bash
   cd $/eos/jeodpp/home/users/$USER  # may use $USTORE after you setup the project
   git clone https://jeodpp.jrc.ec.europa.eu/apps/gitlab/use_cases/legent/uds-scripts \
       && cd uds-scripts
   ```
3. Launch the `Makefile::setup` & `install` rules with *GNU make*,
   to setup your BDAP linux account for the first time, and install this project
   respectively:
   ```bash
   make setup
   make install
   ```

   > **NOTE:** the 1st command modifies your unix rc-files of your BDAP-home;
   > It's better that you logout and re-login, to pick up the changes.

   > **HINT:** the 2nd command will initiate a new conda environment,
   > install dependencies, and install this Python project in "editable" mode
   >  (meaning, any code/script changes are imediately reflected).
   > It might take some time to complete.

4. Create a new `secrets.py` file inside the cloned project and paste the db credentials
   that will be provided to you by your db administrator.

5. Setup the *conda-env* as your IDE's Python interpreter:

    * **PyCharm:** File --> Settings --> Project --> Python Interpreter
     --> Press gear symbol and select “Add” --> Conda Environment
     --> Click "Existing environment" --> Select uds-scripts

    * **VSCode:** TODO

6. (optional) install and configure **nextcloud-agent** on your PC,
   to sync local files with BDAP, like "dropbox" does (read section below).


* For more info on managing Conda environments click [here](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html).
* For the bdap pycharm/conda documentation click here [here](https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/howto/jeodesk/Set-up-pycharm-with-conda-virtual-environment)

## 1.1. Nextcloud mapping of BDAP folders with your PC

The [*nextcloud* BDAP service](https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/Jeodpp_services/Document-Sharing)
is the only gateway of big data files into and outof BDAP
(the other one is the *gitlab server*, but it's for code and smaller data-files). \
You may selectively upload files through [its web-based service](https://jeodpp.jrc.ec.europa.eu/apps/cloud/),
but it's easier to install [the nextcloud-agent](https://nextcloud.com/install/#install-clients)
on your PC and have it sync files in the background, like "dropbox" does.

We suggest to configure nextcloud-agent to sync your dirs like that:

```text
Name           BDAP path                                      your PC path
=========      =========================================      =============
private:       /home/{user}/Documents                    <--> data
personal-eos:  /eos/jeodpp/home/users/{user}/transfer    <--> eos_user_home
public-eos:    /eos/jeodpp/data/projects/LEGENT/transfer <--> eos_LEGENT
```

> **NOTE:** by default, BDAP maps the 2nd "personal-eos" folder directly to `{user}` dir,
> thus leaving *no space for your experiments* there, beyond nextcloud's mappings;
> and this means wasting network bandwidth & storage on your PC while experimenting.
> >
> Please reconfigure *nextcloud-agent* so as to replicate the setup of the "public-eos" dir,
> by creating a `transfer` subdir and limitting nextcloud to sync only that one.
Work with the above dirs is facilitated by the definitons of these Python and
ENVIRONMENT (after installation) variables:

```bash
export USTORE="/eos/jeodpp/home/users/${USER}"
export UTRANS="/eos/jeodpp/home/users/${USER}/transfer"
export PSTORE="/eos/jeodpp/data/projects/LEGENT"
export PTRANS="/eos/jeodpp/data/projects/LEGENT/transfer"
```

## 2. Usage

* activate you conda-env:
* Create ANOTHER working folder under `USTORE/idea-1` so as to let others view (read-only)
  your intermediate result (or `cd` to  your private home-dir, ie `~/idea-1`).
Thomas Vliagkoftis's avatar
Thomas Vliagkoftis committed
  * run scripts & queries to get the data from the db (and possibly create files)
    * BASH-HISTORY is your friend
  * modify script-files in this repo
  * update db (only for users with adb admin credentials)
* copy/move specific files for/to the folders synced by the *nextcloud-agent*,
* or update the "public-eos" dir `PSTORE` (if  you're certain about the results).
Thomas Vliagkoftis's avatar
Thomas Vliagkoftis committed

```python
from db import eea_2020_flattened
from utils import save_to_parquet
from config import Config
import pandas as pd

df = pd.DataFrame.from_records(eea_2020_flattened.find())

...  # process your data

save_to_parquet(df, Config.USTORE, 'test.parquet')
save_to_excel(df, Config.USTORE, 'test.xlsx')
> **TIP:** Don't store your pandas in CSVs, they are big, slow and loose precision.
> Use excel-files when sharing data, they are also big, but keep precision are not very slow.  Or preferably, store them in parquet.
> Maybe it's not the best idea to use out bdap's home-ds for experimenting with big-data:
> https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/howto/data/Access-and-transfer-files#storage-places-not-to-be-used-for-data-storage