Newer
Older
Code, files & db queries for the UDS data-lake on the BDAP linux platform.
* script to import file data into mongo db
* db queries Python/R & JSON files embedded in python or R scripts
* sample code to make use of them
* data from those C4 projects & activities:
* EEA
* DICE
* commercial vehicle specs
* ...
## 1. Installation
1. Read and apply [first-time `git` setup instructions](./GIT_SETUP.md)
2. Clone this repo inside your **"personal-eos" dir** (`/eos/jeodpp/home/users/$USER`):
```bash
cd $/eos/jeodpp/home/users/$USER # may use $USTORE after you setup the project
git clone https://jeodpp.jrc.ec.europa.eu/apps/gitlab/use_cases/legent/uds-scripts \
&& cd uds-scripts
```
3. Launch the `Makefile::setup` & `install` rules with *GNU make*,
to setup your BDAP linux account for the first time, and install this project
respectively:
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
```bash
make setup
make install
```
> **NOTE:** the 1st command modifies your unix rc-files of your BDAP-home;
> It's better that you logout and re-login, to pick up the changes.
> **HINT:** the 2nd command will initiate a new conda environment,
> install dependencies, and install this Python project in "editable" mode
> (meaning, any code/script changes are imediately reflected).
> It might take some time to complete.
4. Create a new `secrets.py` file inside the cloned project and paste the db credentials
that will be provided to you by your db administrator.
5. Setup the *conda-env* as your IDE's Python interpreter:
* **PyCharm:** File --> Settings --> Project --> Python Interpreter
--> Press gear symbol and select “Add” --> Conda Environment
--> Click "Existing environment" --> Select uds-scripts
* **VSCode:** TODO
6. (optional) install and configure **nextcloud-agent** on your PC,
to sync local files with BDAP, like "dropbox" does (read section below).
* For more info on managing Conda environments click [here](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html).
* For the bdap pycharm/conda documentation click here [here](https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/howto/jeodesk/Set-up-pycharm-with-conda-virtual-environment)
## 1.1. Nextcloud mapping of BDAP folders with your PC
The [*nextcloud* BDAP service](https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/Jeodpp_services/Document-Sharing)
is the only gateway of big data files into and outof BDAP
(the other one is the *gitlab server*, but it's for code and smaller data-files). \
You may selectively upload files through [its web-based service](https://jeodpp.jrc.ec.europa.eu/apps/cloud/),
but it's easier to install [the nextcloud-agent](https://nextcloud.com/install/#install-clients)
on your PC and have it sync files in the background, like "dropbox" does.
We suggest to configure nextcloud-agent to sync your dirs like that:
```text
Name BDAP path your PC path
========= ========================================= =============
private: /home/{user}/Documents <--> data
personal-eos: /eos/jeodpp/home/users/{user}/transfer <--> eos_user_home
public-eos: /eos/jeodpp/data/projects/LEGENT/transfer <--> eos_LEGENT
```
> **NOTE:** by default, BDAP maps the 2nd "personal-eos" folder directly to `{user}` dir,
> thus leaving *no space for your experiments* there, beyond nextcloud's mappings;
> and this means wasting network bandwidth & storage on your PC while experimenting.
> >
> Please reconfigure *nextcloud-agent* so as to replicate the setup of the "public-eos" dir,
> by creating a `transfer` subdir and limitting nextcloud to sync only that one.
Work with the above dirs is facilitated by the definitons of these Python and
ENVIRONMENT (after installation) variables:
```bash
export USTORE="/eos/jeodpp/home/users/${USER}"
export UTRANS="/eos/jeodpp/home/users/${USER}/transfer"
export PSTORE="/eos/jeodpp/data/projects/LEGENT"
export PTRANS="/eos/jeodpp/data/projects/LEGENT/transfer"
```
## 2. Usage
* activate you conda-env:
conda activate uds-scripts`
or launch PyCharm/VSCode.
* Create ANOTHER working folder under `USTORE/idea-1` so as to let others view (read-only)
your intermediate result (or `cd` to your private home-dir, ie `~/idea-1`).
* *--(process data)--*
* run scripts & queries to get the data from the db (and possibly create files)
* BASH-HISTORY is your friend
* modify script-files in this repo
* update db (only for users with adb admin credentials)
* copy/move specific files for/to the folders synced by the *nextcloud-agent*,
* or update the "public-eos" dir `PSTORE` (if you're certain about the results).
### 2.1. Sample code
```python
from db import eea_2020_flattened
from utils import save_to_parquet
from config import Config
import pandas as pd
df = pd.DataFrame.from_records(eea_2020_flattened.find())
... # process your data
save_to_parquet(df, Config.USTORE, 'test.parquet')
save_to_excel(df, Config.USTORE, 'test.xlsx')
> **TIP:** Don't store your pandas in CSVs, they are big, slow and loose precision.
> Use excel-files when sharing data, they are also big, but keep precision are not very slow. Or preferably, store them in parquet.
> Maybe it's not the best idea to use out bdap's home-ds for experimenting with big-data:
> https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/howto/data/Access-and-transfer-files#storage-places-not-to-be-used-for-data-storage