Code development platform for open source projects from the European Union institutions EU Login authentication by SMS has been phased out. To see alternatives please check here

Snippets Groups Projects

Upcoming Migration: code.europa.eu will enter read-only mode starting Friday, August 8 at UTC 16:00 and will be unavailable until Sunday, August 10 at UTC 24:00 because of infrastructure migration. Questions? Contact us at eu-code-europa@ec.europa.eu

dsa-tdb

Fix a typo in scripts

Lucas VERNEY authored 8 months ago

dcbd6b7f

dcbd6b7f 8 months ago

Name	Last commit	Last update
app
data
docs
dsa_tdb
scripts
tests
.copyright.tmpl
.dockerignore
.gitignore
.gitlab-ci.yml
.pre-commit-config.yaml
Dockerfile
LICENSE
NOTICE
README.md
aggregation_config.yaml
filtering_config.yaml
poetry.lock
pyproject.toml

DSA Transparency database tools

A set of tools to work with daily or total dumps coming from the DSA Transparency Database.

Requirements

The Transparency database is a large dataset. As of October 2024, you will require a minimum of:

4.1TB disk space to store the daily dump files as downloaded from the DSA Transparency Database website.
500GB to store the daily dumps in a "chunked" form (see documentation below).
1GB to store the aggregated dataset with the default aggregation configuration.

Overall, the data throughput is in the range of 5 to 10GB per day (meaning you should have as a bare minimum 5GB of free disk space per daily dump you want to process).

The "DSA Transparency database tools" Python package aims at making working with such a large dataset easier by providing convenience functions to convert from the raw dumps to more efficient data storage as well as scripts to handle the conversion over a sliding time window to reduce the disk space requirements (see documentation below).

Installation

With Docker/podman (recommended)

Install Docker or podman (recommended) on your machine (if using podman, replace docker with podman in the following commands or install the podman-docker extension to have compatible CLI). You can also download the Desktop versions for the two (here for Docker and here for podman).
Then:

Automatic way (recommended): use the dsa_tdb ==> FIXME.
Manual way: Download the zip of the repository, unzip it, cd into it and docker build -t localhost:your_tag ..

Run the container mounting the notebooks you want to use under /notebooks and the data folder in /data (for instance docker run --rm dsa_tdb -v /path/to/your/nb:/notebooks -v /path/to/data:/data -p 8765:8765 dsa_tdb:latest).
Also a /cache directory can be mounted and it will be the spark.local.dir where Spark saves the cache.
Otherwise, to use it interactively, just use a

docker run --rm dsa_tdb -v /path/to/your/nb:/notebooks -v /path/to/data:/data dsa_tdb:latest /bin/bash

and you will be prompted with a shell where the dsa-tdb-cli command is available (see the documentation).

..note: The default user in the Docker container is user with same user and group ids 1000. The latter can be changed to match your user name and id specifying DOCKER_USER=your_user and DOCKER_USER_ID=1234 when building, such as:

podman build --build-arg DOCKER_USER=$(id -un) --build-arg DOCKER_USER_ID=$(id -u) -t localhost/dsa-tdb-nb .

from the folder containing the code and Dockerfile.

Ports The docker container will expose these ports, use -p yout_host_port:docker_port to change the ports used by your host if already busy:

8765 the Jupyter lab home
4040 the Spark status page
5555 the Celery's flower dashboard to check the status
8000 the Fastapi webapp (visit the docs to see the API usage)

NOTE If using podman, make sure to set --userns=keep-id either in the command line option, environment variable (PODMAN_USERNS=keep-id) or in the Security->Specify user namespace to use tab in the Create container interface. This is needed to have the mounted folders being writeable by the container's user.

NOTE: The default user in the Docker container is user with same user and group ids 1000. The latter can be changed to match your user name and id specifying DOCKER_USER=your_user and DOCKER_USER_ID=1234 when building, such as:
 podman build --build-arg DOCKER_USER=$(id -un) --build-arg DOCKER_USER_ID=$(id -u) -t localhost/dsa-tdb-nb .
from the folder containing the code and Dockerfile.

From source (with poetry)

Install poetry^1.8 on your system with either pip install --user poetry>=1.8 or other methods.
Download and extract the code folder and cd into it.
Create the venv and install the dependencies using poetry install (with --with dev if you also want the jupyter notebook kernel and the developer tools)

Usage

CLI

The package will install a command line interface (cli) installing the command dsa-tdb-cli on your path.

The command has three subcommands:

preprocess will download the specified daily dumps (eventually filtered by platforms or time window), verify their SHA1 checksum and check for new files, later chunking them in smaller csv or parquet files. Optionally, it will delete the original dumps as they are processed (to save disk space), leaving the sha1 files as a proof of work. This allows to repeately run the preprocess step on a daily basis to always have the files in place. The resulting "chunked" files are stored as regular flat csv or parquet files which can be conveniently and efficiently loaded into the data processing pipeline of your choice (Spark, Dask, etc.) without having to go through the complex data structure of the daily dumps (zipped csv files).
aggregate will use a separate configuration file (a template of which is provided in the repo under the Aggregation Configuration Template) to perform aggregation, that is, counting the number of Statements of reasons (SoRs) corresponding to a given combination of the fields in the database (such as content_date, platform_name, category, etc.):
- This command will considerably reduce the size of the database by aggregating together similar rows (each statement of reason is a new row in the chunked data files, when they share the same values as defined in the aggregation configuration, they are represented as a single row with an incremented count in the aggregated files).
- This command will also write an auxiliary csv file (with the same name of the out_file_name) containing the files and dates of the daily dumps used for the aggregation.
- It will also make a copy of the configuration file used in the same folder of the output file with the same name and the configuration.yaml file for later reference.
- If the aggregation mode is set to append in the configuration, it will load only the files that are not already in the (possibly existing) dates auxiliary file and will append the aggregated data to the (possibly already existing) file. Note that the append mode only works if:
  - the schema of the aggregated data is the same as the one from the existing file
  - and the input files are in the same relative or global path as found in the dates auxiliary file.
  - and the parquet output format is used.

NOTE: Also note that, if using the created_at column to group, all the files produced with the append mode will have to be aggregated again on the desired keys as there is no guarantee that all the SoR from one day are in the corresponding daily dump file.

filter will use a separate configuration file (a template of which is provided in the repo under the Filtering Configuration Template) to filter the raw SoRs, that is, keeping only the ones respecting all the filters set (in an "AND" fashion).
- This command will also write an auxiliary csv file (with the same name of the out_file_name) containing the files and dates of the daily dumps used for the filtering. It will also make a copy of the configuration file used in the same folder of the output file with the same name and the configuration.yaml file for later reference.
- If the filtering mode is set to append in the configuration, it will load only the files that are not already in the (possibly existing) dates auxiliary file and will append the filtered data to the (possibly already existing) file. Note that the append mode only works if:
  - the schema of the filtered data is the same of the existing file.
  - and the input files are in the same relative or global path as found in the dates auxiliary file.
  - and the parquet output format is used.

You can see the help and documentation of the cli command by running dsa-tdb-cli --help or dsa-tdb-cli subcommand --help.

Scripts

The scripts folder contains some examples on how to use the library. They can also be readily used in an automated manner to ingest and process on a daily basis the data dumps (e.g. with a crontask).

There are two examples:

scripts/daily_routine.py is a script that can be called with the platform name and dump version (full or light). Without any further argument, it will:
- preprocess (download and chunk) all the missing/newest daily dumps from the full version of the daily dumps for all available platforms.
- aggregate them using the default configuration.
- (optionally) delete the chunked files to save disk space.
scripts/download_platform.py is a subset of the previous script, it just preprocesses (download and chunk) the file for a specific platform and version (full or light).

NOTE: The daily routine script can be called on a daily basis and it will update the files and dumps with the newest ones (leaving the latest as a checkpoint for next run).

Notebooks

An example usage notebook is available in docs/Example.ipynb.

License

dsa-tdb is licensed under European Union Public Licence (EUPL) version 1.2. See the LICENSE for details.

Documentation

Documentation about the fields and values can be found in the official API documentation.

FIXME: Add a link to the built documentation for the Python package.

For further information about the platform and support, please see https://code.europa.eu/info/about