Skip to content

Data Ingestion With RenkuLab

This project uses RenkuLab for reproducible sessions and persistent data storage. Public APIs are accessed by code in this repository; RenkuLab Data Connectors provide the mounted folders for raw and curated data.

Repository Structure

configs/sources.yml      # source registry
scripts/ingest.py        # first ingestion runner
data/raw/                # local raw snapshots, ignored by Git
data/curated/            # local cleaned MVP outputs, ignored by Git
environment.yml          # Python/Jupyter environment for RenkuLab

data/raw/ and data/curated/ are not committed to Git. In local development, the ingestion script writes there directly. In RenkuLab, data connectors are mounted as sibling folders next to the repository, for example ../govtech-raw and ../govtech-curated. The script and notebook automatically use those Renku mounts when they exist.

Local Test

python scripts/ingest.py --list
python scripts/ingest.py --source opendata_swiss

Each successful run writes two files:

data/raw/<source_id>/<timestamp>/payload.json
data/raw/<source_id>/<timestamp>/metadata.json

metadata.json contains the source, access path, format, and concrete retrieval timestamp.

RenkuLab Steps

  1. Create or open the project in RenkuLab.
  2. Connect the GitHub repository https://github.com/allsparkswiss-hub/GovTECH.
  3. Select the branch renku-data-connectors.
  4. Create a Data Connector govtech-raw from the Google Drive folder GovTECH-Renku/raw.
  5. Create a Data Connector govtech-curated from the Google Drive folder GovTECH-Renku/curated.
  6. Create a session launcher from the repository and environment.yml.
  7. Start a session and test ingestion:
python scripts/ingest.py --list
python scripts/ingest.py --all
find ../govtech-raw -maxdepth 3 -type f | head

The same python scripts/ingest.py --all command works locally and in RenkuLab. The default local target is data/raw; the Renku fallback target is ../govtech-raw.

MVP Sources

The first MVP run uses thirteen enabled sources:

  • opendata_swiss
  • sfoe_energy_balance_csv
  • meteo_swiss_smn
  • geoadmin_army_nature_landscape
  • geoadmin_civil_protection_meeting_points
  • geoadmin_surface_runoff_hazard
  • geoadmin_nuclear_emergency_zones
  • armasuisse_st_publications
  • aramis_armasuisse_research_projects
  • parliament_affairs
  • lindas
  • fedlex
  • bfs_pxweb

The sources cover energy, climate/weather, population exposure, public geodata, armasuisse technology publications, federal research projects, parliamentary signals, and regulatory signals. The data first remains as raw snapshots under data/raw/; curated tables and indicators can later be written to data/curated/.