README.md

# Unified Data Structure db scrips

Code, files & db queries for the UDS data-lake on the BDAP linux platform.

**NOTE:** the data are provided separately, due to their size,
preloaded in *eos-dirs* (read section below about Nextcloud mapping).

## 0. Contents

* `build_eea.py` - script to import EEA's files into mongo db
* `example.py` - sample script to query db and extract docs ("rows") to files
* `src/uds4jrc/` - python project to support the scripts, acting as their library
* data from those C4 projects & activities:
  * EEA - vehicle registrations in EU, reported by MSs & packaged by EEA (raw data)
  * Fiat500x - campaign with OBD data from JRC "amateur" drivers - TODO
  * RealWorld (Travelcard, Geco Air, Spritmonitor)
  * ATCT 
  * DICE - TODO
  * commercial vehicle specs - TODO
  * ...

## 1. Installation

> **TIP:** Prefer _Chrome browser_ when logging into BDAP.
>
> _Chrome_ is a much better fit for BDAP's JEODesk service;
> _Firefox_ does not send any of its default keyboard shortcuts to the _Gucamole_-based
> web-page (ie to the terminal or the IDE you run in BDAP).
> For instance, **[Ctrl+U]/[Ctrl+U]** in _firefox_ do not kill chars left/right of the cursor,
> but _"Show page-source"_ & _"Search in address-bar"_ instead, respectively.

1. Read and apply [first-time `git@BDAP` setup instructions](./GIT_SETUP.md).

2. Clone this repo inside your **home dir** (`/home/$USER`):

   ```bash
   cd ~
   git clone https://jeodpp.jrc.ec.europa.eu/apps/gitlab/use_cases/legent/uds-scripts.git/ \
        && cd uds-scripts
   ```

3. Run the `setup-account.sh` script to setup your BDAP linux account for the first time,
   and create a new conda-environment to install the project:

   ```bash
   ./setup-account.sh all
   ```

   > **NOTE:** the command will modify your BDAP's rc-files to offer:
   > - searching terminal history with UP arrow,
   > - autocompletion for `git` & `conda` commands,
   > - bash & ipython aliases & variables to *eos-dirs*,
   > - prepare your `git` identity, for when committing.
   >
   > It's better that you logout and re-login, to pick up the changes,
   > but this is not a hard requirement for the next steps.
   >
   > If everything went as planned, the 2nd command is just `conda activate ./env`

4. Activate the created *conda-env*:

   ```bash
   conda activate ./env
   ```

   > **TIP:** you would have to re-issue this command every time you start a new login session or a new bash shell is launched.

5. Create a new file `mysecrets.py` inside the root dir of the cloned project,
   and paste inside the db credentials provided by your db administrator.

6. Setup the *conda-env* as your IDE's Python interpreter:

    * **PyCharm:** File --> Settings --> Project --> Python Interpreter
      --> Press gear symbol and select “Add” --> Conda Environment
      --> Click "Existing environment" --> Select "uds-scripts/env" path.

    * **VSCode:** Press **[Ctrl+P]** and start typing "select interpreter" and select
      "Python: select interpreter" it from the drop-down list,
      then navigate to path of the *conda-env*.

7. (optional) install and configure **nextcloud-agent** on your PC,
   to sync local files with BDAP, like "dropbox" does (read section below).


* For more info on managing Conda environments click [here](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html).
* For the bdap pycharm/conda documentation click here [here](https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/howto/jeodesk/Set-up-pycharm-with-conda-virtual-environment)

## 1.1. Nextcloud mapping of BDAP folders with your PC

The [*nextcloud* BDAP service](https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/Jeodpp_services/Document-Sharing)
is the only gateway of big data files into and outof BDAP
(the other one is the *gitlab server*, but it's for code and smaller data-files). \
You may selectively upload files through [its web-based service](https://jeodpp.jrc.ec.europa.eu/apps/cloud/),
but it's easier to install [the nextcloud-agent](https://nextcloud.com/install/#install-clients)
on your PC and have it sync files in the background, like "dropbox" does.

We suggest to configure nextcloud-agent to sync your dirs like that:

```text
Name           BDAP path                                      your PC path
=========      =========================================      =============
private:       /home/{user}/Documents                    <--> data
personal-eos:  /eos/jeodpp/home/users/{user}/transfer    <--> eos_user_home
public-eos:    /eos/jeodpp/data/projects/LEGENT/transfer <--> eos_LEGENT
```

> **NOTE:** by default, BDAP maps the 2nd "personal-eos" folder directly to `{user}` dir,
> thus leaving *no space for your experiments* there, beyond nextcloud's mappings;
> and this means wasting network bandwidth & storage on your PC while experimenting.
> >
> Please reconfigure *nextcloud-agent* so as to replicate the setup of the "public-eos" dir,
> by creating a `transfer` subdir and limitting nextcloud to sync only that one.

Work with the above dirs is facilitated by the definitons of these Python and
ENVIRONMENT (after installation) variables:

```bash
export UEOS="/eos/jeodpp/home/users/${USER}"
export UTRANS="/eos/jeodpp/home/users/${USER}/transfer"
export PEOS="/eos/jeodpp/data/projects/LEGENT"
export PTRANS="/eos/jeodpp/data/projects/LEGENT/transfer"
```

## 2. Usage

* activate you conda-env:

  ```bash
  cd ~/uds-scripts
  conda activate ~/uds-scripts/env
  ```

  or launch PyCharm/VSCode.

* **Create ANOTHER working folder for you code & data**
  under `$UEOS/idea-1` so as to let others view (read-only)
  your intermediate result.

* *--(process data)--*
  * run scripts & queries to get the data from the db (and possibly create files)
    * BASH-HISTORY is your friend
  * modify script-files in this repo
  * update db (only for users with adb admin credentials)

* copy/move specific files for/to the folders synced by the *nextcloud-agent*,
* or update the "public-eos" dir `$PEOS` (if  you're certain about the results).

### 2.1. Sample code

```python
from src.uds4jrc.db import eea_2020_flattened
from src.uds4jrc.utils import save_to_parquet
from src.uds4jrc.config import Config
import pandas as pd

df = pd.DataFrame.from_records(eea_2020_flattened.find())

...  # process your data

save_to_parquet(df, Config.UEOS, 'test.parquet')
save_to_excel(df, Config.UEOS, 'test.xlsx')
```

> **TIP:** Don't store your pandas in CSVs, they are big, slow and loose precision.
>
> Use excel-files when sharing data, they are also big, but keep precision are not very slow.  Or preferably, store them in parquet.
>
> Maybe it's not the best idea to use out bdap's home-ds for experimenting with big-data:
> https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/howto/data/Access-and-transfer-files#storage-places-not-to-be-used-for-data-storage

## 3. Data

| Name         |   Type    |                          Path                           |
|:-------------|:---------:|:-------------------------------------------------------:|
| EEA          | raw data  |          /eos/jeodpp/data/projects/LEGENT/eea           |
| f500x        | processed |        /eos/jeodpp/data/projects/LEGENT/fiat500         |
| Geco Air     | raw data  |     /eos/jeodpp/data/projects/LEGENT/realworld/geco     |
| Travelcard   | raw data  |  /eos/jeodpp/data/projects/LEGENT/realworld/travelcard  |
| Spritmonitor | raw data  | /eos/jeodpp/data/projects/LEGENT/realworld/spritmonitor |
| ATCT         | raw data  |        /eos/jeodpp/data/projects/LEGENT/atct            |