Newer
Older
Code, files & db queries for the UDS data-lake on the BDAP linux platform.
**NOTE:** the data are provided separately, due to their size,
preloaded in *eos-dirs* (read section below about Nextcloud mapping).
* `build_eea.py` - script to import EEA's files into mongo db
* `example.py` - sample script to query db and extract docs ("rows") to files
* `src/uds4jrc/` - python project to support the scripts, acting as their library
* data from those C4 projects & activities:
* EEA - vehicle registrations in EU, reported by MSs & packaged by EEA (raw data)
* Fiat500x - campaign with OBD data from JRC "amateur" drivers - TODO
* RealWorld (Travelcard, Geco Air, Spritmonitor)
* ATCT
* DICE - TODO
* commercial vehicle specs - TODO
## 1. Installation
> **TIP:** Prefer _Chrome browser_ when logging into BDAP.
>
> _Chrome_ is a much better fit for BDAP's JEODesk service;
> _Firefox_ does not send any of its default keyboard shortcuts to the _Gucamole_-based
> web-page (ie to the terminal or the IDE you run in BDAP).
> For instance, **[Ctrl+U]/[Ctrl+U]** in _firefox_ do not kill chars left/right of the cursor,
> but _"Show page-source"_ & _"Search in address-bar"_ instead, respectively.
1. Read and apply [first-time `git@BDAP` setup instructions](./GIT_SETUP.md).
2. Clone this repo inside your **home dir** (`/home/$USER`):
git clone https://jeodpp.jrc.ec.europa.eu/apps/gitlab/use_cases/legent/uds-scripts.git/ \
&& cd uds-scripts
3. Run the `setup-account.sh` script to setup your BDAP linux account for the first time,
and create a new conda-environment to install the project:
> **NOTE:** the command will modify your BDAP's rc-files to offer:
> - searching terminal history with UP arrow,
> - autocompletion for `git` & `conda` commands,
> - bash & ipython aliases & variables to *eos-dirs*,
> - prepare your `git` identity, for when committing.
>
> It's better that you logout and re-login, to pick up the changes,
> but this is not a hard requirement for the next steps.
>
> If everything went as planned, the 2nd command is just `conda activate ./env`
4. Activate the created *conda-env*:
```bash
conda activate ./env
> **TIP:** you would have to re-issue this command every time you start a new login session or a new bash shell is launched.
5. Create a new file `mysecrets.py` inside the root dir of the cloned project,
and paste inside the db credentials provided by your db administrator.
6. Setup the *conda-env* as your IDE's Python interpreter:
* **PyCharm:** File --> Settings --> Project --> Python Interpreter
--> Press gear symbol and select “Add” --> Conda Environment
--> Click "Existing environment" --> Select "uds-scripts/env" path.
* **VSCode:** Press **[Ctrl+P]** and start typing "select interpreter" and select
"Python: select interpreter" it from the drop-down list,
then navigate to path of the *conda-env*.
7. (optional) install and configure **nextcloud-agent** on your PC,
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
to sync local files with BDAP, like "dropbox" does (read section below).
* For more info on managing Conda environments click [here](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html).
* For the bdap pycharm/conda documentation click here [here](https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/howto/jeodesk/Set-up-pycharm-with-conda-virtual-environment)
## 1.1. Nextcloud mapping of BDAP folders with your PC
The [*nextcloud* BDAP service](https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/Jeodpp_services/Document-Sharing)
is the only gateway of big data files into and outof BDAP
(the other one is the *gitlab server*, but it's for code and smaller data-files). \
You may selectively upload files through [its web-based service](https://jeodpp.jrc.ec.europa.eu/apps/cloud/),
but it's easier to install [the nextcloud-agent](https://nextcloud.com/install/#install-clients)
on your PC and have it sync files in the background, like "dropbox" does.
We suggest to configure nextcloud-agent to sync your dirs like that:
```text
Name BDAP path your PC path
========= ========================================= =============
private: /home/{user}/Documents <--> data
personal-eos: /eos/jeodpp/home/users/{user}/transfer <--> eos_user_home
public-eos: /eos/jeodpp/data/projects/LEGENT/transfer <--> eos_LEGENT
```
> **NOTE:** by default, BDAP maps the 2nd "personal-eos" folder directly to `{user}` dir,
> thus leaving *no space for your experiments* there, beyond nextcloud's mappings;
> and this means wasting network bandwidth & storage on your PC while experimenting.
> >
> Please reconfigure *nextcloud-agent* so as to replicate the setup of the "public-eos" dir,
> by creating a `transfer` subdir and limitting nextcloud to sync only that one.
Work with the above dirs is facilitated by the definitons of these Python and
ENVIRONMENT (after installation) variables:
```bash
export UEOS="/eos/jeodpp/home/users/${USER}"
export UTRANS="/eos/jeodpp/home/users/${USER}/transfer"
export PEOS="/eos/jeodpp/data/projects/LEGENT"
export PTRANS="/eos/jeodpp/data/projects/LEGENT/transfer"
```
## 2. Usage
* activate you conda-env:
cd ~/uds-scripts
conda activate ~/uds-scripts/env
or launch PyCharm/VSCode.
* **Create ANOTHER working folder for you code & data**
under `$UEOS/idea-1` so as to let others view (read-only)
your intermediate result.
* *--(process data)--*
* run scripts & queries to get the data from the db (and possibly create files)
* BASH-HISTORY is your friend
* modify script-files in this repo
* update db (only for users with adb admin credentials)
* copy/move specific files for/to the folders synced by the *nextcloud-agent*,
* or update the "public-eos" dir `$PEOS` (if you're certain about the results).
### 2.1. Sample code
from src.uds4jrc.db import eea_2020_flattened
from src.uds4jrc.utils import save_to_parquet
from src.uds4jrc.config import Config
import pandas as pd
df = pd.DataFrame.from_records(eea_2020_flattened.find())
... # process your data
save_to_parquet(df, Config.UEOS, 'test.parquet')
save_to_excel(df, Config.UEOS, 'test.xlsx')
> **TIP:** Don't store your pandas in CSVs, they are big, slow and loose precision.
> Use excel-files when sharing data, they are also big, but keep precision are not very slow. Or preferably, store them in parquet.
> Maybe it's not the best idea to use out bdap's home-ds for experimenting with big-data:
> https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/howto/data/Access-and-transfer-files#storage-places-not-to-be-used-for-data-storage
| Name | Type | Path |
|:-------------|:---------:|:-------------------------------------------------------:|
| EEA | raw data | /eos/jeodpp/data/projects/LEGENT/eea |
| f500x | processed | /eos/jeodpp/data/projects/LEGENT/fiat500 |
| Geco Air | raw data | /eos/jeodpp/data/projects/LEGENT/realworld/geco |
| Travelcard | raw data | /eos/jeodpp/data/projects/LEGENT/realworld/travelcard |
| Spritmonitor | raw data | /eos/jeodpp/data/projects/LEGENT/realworld/spritmonitor |
| ATCT | raw data | /eos/jeodpp/data/projects/LEGENT/atct |