FrenchAdmin processes French tax data and GraphRAG for AI applications. The data focuses on tax law (CGI et annexes, Assemblée Nationale), doctrine (BOFiP, Bercy) and jurisprudence (JADE, Conseil d'état). It downloads, processes, embeds, and stores data in PostgreSQL with PgVector for vector search, and FalkorDB for knowledge graph relationships.
Key capabilities:
- LEGI: French legislative texts (Code Général des Impôts, LPF, etc.)
- JADE: Judicial decisions from French courts
- BOFiP: Tax guidance documents (Bulletin Officiel des Finances Publiques)
- Cross-reference inference: Automatic linking between JADE/BOFiP and LEGI articles for RAG and graphRAG
-
Install the required apt dependencies:
sudo apt-get update sudo apt-get install -y $(cat config/requirements-apt-container.txt) -
Create and activate a virtual environment:
python3 -m venv .venv # Create the virtual environment source .venv/bin/activate # Activate the virtual environment
-
Install the required python dependencies:
pip install -e .
Installing in development mode (
-e) allows you to use themediatechcommand and modify the code without reinstalling.
Note: Make sure your environment is properly configured before continuing.
-
Set up the environment variables in a
.envfile based on the example in.env.example. -
Export
.envvariables :export $(grep -v '^#' .env | xargs)
-
Start the containers with Docker:
docker compose up -d
-
Verify containers are running:
docker ps
You should see:
pg- PostgreSQL with PgVector (vector search)falkor- FalkorDB (graph database)
IMPORTANT: Clean config/data_history.json
After installation, the mediatech command is available globally and replaces python main.py:
If you encounter issues with the
mediatechcommand, you can still usepython main.pyinstead.
The main.py file is the main entry point of the project and provides a command-line interface (CLI) to run each step of the pipeline separately.
You can use it as follows:
mediatech <command> [options]or
python main.py <command> [options]Command examples:
- View help:
mediatech --help
- Create PostgreSQL tables:
mediatech create_tables --model BAAI/bge-m3
- Download all files listed in
data_config.json:mediatech download_files --all
- Download files from the
legisource:mediatech download_files --source legi
- Download and process all files listed in
data_config.json:mediatech download_and_process_files --all --model BAAI/bge-m3
- Process all data:
mediatech process_files --all --model BAAI/bge-m3
- Split a table into subtables based on different criteria (see
main.py):mediatech split_table --source legi
- Export PostgreSQL tables to parquet files:
mediatech export_tables --output data/parquet
- Upload parquet datasets to the Hugging Face repository:
mediatech upload_dataset --input data/parquet/bofip.parquet --dataset-name bofip
- Add full-text search columns for hybrid search (only needed for databases created before this feature — fresh installs get it via
create_tables):mediatech add_fts_columns
- Generate BGE-M3 sparse embeddings for 3-way retrieval fusion (required after
process_files):mediatech add_sparse_embeddings
Run mediatech --help in your terminal to see all available options, or check the code directly in main.py.
If you prefer to use the Python script directly, you can always use:
python main.py <command> [options]Examples:
python main.py download_files
python main.py create_tables --model BAAI/bge-m3
python main.py process_files --all --model BAAI/bge-m3pip install -U hf_transfer
pip install -U huggingface_hub
The processing pipeline now exposes optimization switches via environment variables:
export ENABLE_BATCH_EMBEDDING=true
export ENABLE_FAST_DB_INSERT=true
export ENABLE_BATCH_GRAPH_UPSERT=true
export ENABLE_PARALLEL_PROCESSING=false
export ENABLE_PERF_TELEMETRY=trueTuning variables:
export EMBEDDING_BATCH_MAX_SIZE=64
export FAST_DB_INSERT_PAGE_SIZE=1000
export MAX_WORKERS=4
export BATCH_SIZE_DOCS=32When telemetry is enabled, each run writes a JSON report in data/perf_reports/.
You can run the fixed-sample benchmark helper and enforce a regression gate:
python scripts/benchmark_pipeline.py \
--command "python main.py process_files --source legi --model louisbrulenaudet/lemone-embed-pro" \
--runs 3 \
--run-prefix process_legi \
--reports-dir data/perf_reportsOptional baseline gate (fails if runtime degrades by more than 10%):
python scripts/benchmark_pipeline.py \
--command "python main.py process_files --source legi --model louisbrulenaudet/lemone-embed-pro" \
--runs 3 \
--run-prefix process_legi \
--baseline data/perf_reports/process_legi_baseline.json \
--regression-threshold 0.10Using the update.sh Script
The update.sh script allows you to run the entire data processing pipeline: downloading, table creation, vectorization, and export.
To run it, execute the following command from the project root:
./scripts/update.shThis script will:
- Wait for the PostgreSQL database to be available,
- Create or update the necessary tables in the PostgreSQL database,
- Download public files listed in
data_config.json, - Process and vectorize the data,
- Export the tables in Parquet format,
- Upload the Parquet files to Hugging Face.
main.py: Main entry point with CLI for pipeline commands.pyproject.toml: Python project and dependency configuration.Dockerfile: Docker image for containerized execution, installs system dependencies and project packages.docker-compose.yml: Multi-container setup: PostgreSQL (PgVector) + FalkorDB..github/: GitHub Actions workflows for CI/CD.download_and_processing/: Scripts to download and extract files from DILA (LEGI, JADE) and data.economie.gouv.fr (BOFiP).database/: Database management (table creation, data insertion, FalkorDB graph operations).docs/: Documentation and tutorials.docs/hugging_face_rag_tutorial.ipynb: RAG Tutorial: How to load datasets from Hugging Face and use them in a RAG pipeline ?docs/reconstruct_vector_database.ipynb: Tutorial: How to reconstruct a dataset without chunking and embedding from parquet files?docs/fr/: French translations of documentation.
utils/: Shared utilities (chunking, embedding, HuggingFace, telemetry).config/: Project configuration (data sources, embedding models, optimization flags).logs/: Log files from script execution.scripts/: Shell scripts for pipeline automation.scripts/update.sh: Run the entire data processing pipeline.scripts/periodic_update.sh: Automate pipeline via cron.scripts/backup.sh: Back up PostgreSQL volume and config files.scripts/restore.sh: Restore PostgreSQL volume and config.scripts/initial_deployment.sh: Set up a new server (Docker, dependencies).scripts/containers_deployment.sh: Build and deploy Docker containers.scripts/delete_old_files.sh: Delete old files from logs/ and backups/ directories.scripts/manage_checkpoint.sh: Manage checkpoint files for processing.scripts/write_tchap_message.sh: Send notifications to Tchap (French government chat).
CROSSREFERENCE.md: Technical specification for JADE/BOFiP → LEGI cross-reference inference (RAG/graphRAG).
Files created/modified:
┌─────────────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────┐
│ File │ Purpose |
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ database/cross_reference_manage.py │ 3 new tables, catalog refresh, mention/edge CRUD │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ database/__init__.py │ Added cross-reference exports │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ database/database_manage.py │ Wired create_cross_reference_tables() + init_graph_schema() into create_all_tables │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ database/graph_manage.py │ inject_cross_reference_edges() for APPLIES_TO/INTERPRETS │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/__init__.py │ Package exports │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/normalizer.py │ Primary + loose article number normalization │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/alias_detector.py │ CGI/LPF/CIBS family alias detection │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/extractor.py │ Article token extraction with enumeration support │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/resolver.py │ Full cascade: A→B→C→D→E │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/fuzzy_resolver.py │ rapidfuzz scoped fallback │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/semantic_resolver.py │ Cosine-distance semantic fallback │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/confidence.py │ Confidence scoring with adjustments │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/pipeline.py │ Orchestrator: catalog refresh → doc aggregation → extraction → resolution → edges → graph │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ main.py │ Added infer_crossreferences CLI command │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ pyproject.toml │ Added rapidfuzz, crossreference* to packages │
└─────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────┘
Usage:
1 main.py infer_crossreferences --source all
2 main.py infer_crossreferences --source jade
3 main.py infer_crossreferences --source bofip --debug
Resolution cascade per mention:
A. Exact normalized + temporal + family → 0.99
B. Loose key fallback → 0.92
C. Family-prior deterministic → 0.84
D. Fuzzy scoped (rapidfuzz, ≥96) → 0.74
E. Semantic scoped (cosine <=>) → 0.62
The project includes a full GraphRAG frontend for querying the knowledge graph, performing vector search, and generating LLM-synthesized answers.
[React Frontend (Vite + TypeScript + Tailwind)]
|
v (REST + SSE)
[FastAPI Backend]
|
+----+----+
| |
v v
[PostgreSQL] [FalkorDB]
(pgvector) (Cypher)
- Backend: FastAPI serving REST endpoints + SSE streaming for LLM synthesis
- Frontend: React with react-force-graph-2d for interactive graph visualization
- LLM: OpenRouter (configurable model) for RAG answer generation
| Feature | Description |
|---|---|
| Natural language search | Cosine similarity search across LEGI, JADE, BOFiP with graph augmentation |
| Graph Explorer | Interactive force-directed visualization of the knowledge graph |
| Cross-Reference Browser | Paginated table of inferred JADE/BOFiP → LEGI links with confidence scores |
| Document Detail | Full document view with metadata, chunks, and cross-references |
| LLM Synthesis | Streaming answers grounded in retrieved documents via OpenRouter |
- PostgreSQL + FalkorDB running (via
docker compose up -d) - Node.js 18+ (for frontend build)
- Python 3.10+ with project dependencies installed
mediatech create_tables --model BAAI/bge-m3 # Creates tables + FTS columns/triggers + sparse column
mediatech download_and_process_files --all --model BAAI/bge-m3 # Download, chunk, embed, insert
mediatech add_sparse_embeddings # Generate BGE-M3 sparse lexical weights
mediatech infer_crossreferences --source all --model BAAI/bge-m3 # Link JADE/BOFiP → LEGINote:
add_fts_columnsis NOT needed on fresh installs — the FTS trigger is created bycreate_tables. Only run it on databases that predate the hybrid search feature.
Note:
add_sparse_embeddingsmust be run after data is loaded. There is no trigger for sparse embeddings (encoding requires the FlagEmbedding model).
# Set required environment variables in .env:
# API_URL=https://openrouter.ai/api/v1
# API_KEY=your_openrouter_api_key
# LLM_MODEL=openrouter/hunter-alpha
docker compose up --buildThe web interface will be available at http://localhost:8080.
1. Install Python dependencies:
pip install -e .This installs FastAPI, Uvicorn, SSE-Starlette, and all other dependencies.
2. Install frontend dependencies and build:
cd web/frontend
npm install
npm run build
cd ../..3. Start the backend (serves both API and built frontend):
uvicorn web.app:app --reload --port 8080The app is available at http://localhost:8080.
4. (Optional) Frontend dev mode with hot-reload:
# Terminal 1: backend
uvicorn web.app:app --reload --port 8080
# Terminal 2: frontend dev server (proxies /api to backend)
cd web/frontend
npm run devFrontend dev server runs at http://localhost:5173 with API proxy to port 8080.
| Method | Path | Description |
|---|---|---|
POST |
/api/search |
GraphRAG search (vector + graph fusion) |
POST |
/api/graph/neighbors |
Get N-hop graph neighborhood |
GET |
/api/graph/context/{doc_id} |
Get all relationships for a document |
GET |
/api/graph/subgraph?doc_ids=... |
Get subgraph connecting documents |
GET |
/api/documents/{source_type}/{doc_id} |
Full document detail |
GET |
/api/crossrefs |
Paginated cross-reference listing |
POST |
/api/synthesize |
LLM RAG synthesis (SSE streaming) |
GET |
/api/health |
Health check |
curl -X POST http://localhost:8080/api/search \
-H "Content-Type: application/json" \
-d '{
"query": "conditions application article 200 CGI",
"source_types": ["legi", "jade"],
"top_k": 10,
"min_confidence": 0.6
}'The GraphRAG web interface uses the same environment variables as the rest of the project:
| Variable | Default | Description |
|---|---|---|
POSTGRES_HOST |
localhost |
PostgreSQL host |
POSTGRES_PORT |
5433 |
PostgreSQL port |
POSTGRES_DB |
mediatech |
Database name |
POSTGRES_USER |
user |
Database user |
POSTGRES_PASSWORD |
password |
Database password |
FALKORDB_HOST |
localhost |
FalkorDB host |
FALKORDB_PORT |
6379 |
FalkorDB port |
FALKORDB_GRAPH_NAME |
frenchadmin |
Graph name |
EMBEDDING_MODEL |
BAAI/bge-m3 |
Embedding model (384-dim) |
API_URL |
https://openrouter.ai/api/v1 |
LLM API base URL |
API_KEY |
— | OpenRouter API key |
LLM_MODEL |
openrouter/hunter-alpha |
LLM model for synthesis |
WEB_PORT |
8080 |
Web server port (Docker) |
RERANKER_MODEL |
BAAI/bge-reranker-v2-m3 |
Cross-encoder for relevance reranking |
RERANKER_MAX_LENGTH |
1024 |
Max tokens the reranker sees per document |
RERANKER_MIN_SCORE |
0.01 |
Minimum reranker score to keep a result |
ENABLE_HYBRID_SEARCH |
true |
Combine vector + full-text + sparse search via RRF |
RRF_K |
60 |
RRF smoothing constant |
FTS_WEIGHT |
1.0 |
Relative weight of FTS vs vector in RRF |
FTS_MODE |
auto |
FTS strategy: auto (AND for ≤3 words, OR for 4+), and, or |
ENABLE_QUERY_EXPANSION |
true |
Expand legal acronyms and synonyms before search |
ENABLE_SPARSE_SEARCH |
true |
BGE-M3 sparse retrieval (requires add_sparse_embeddings) |
SPARSE_WEIGHT |
1.0 |
Relative weight of sparse retrieval in RRF |
- User enters a natural language query in French
- Query expansion: legal acronyms (EURL→"entreprise unipersonnelle…") and synonyms (dirigeant→gérant) are appended
- 3-way hybrid retrieval (oversampled 4×):
- Dense: query embedded via BAAI/bge-m3 (1024-dim) → cosine similarity search via pgvector HNSW
- FTS: adaptive AND/OR
plainto_tsquery('french', ...)on title-weighted tsvector GIN indexes - Sparse: BGE-M3 learned lexical weights → JSONB key overlap + dot product scoring
- All three result lists fused using Reciprocal Rank Fusion (RRF)
- Top results are expanded via FalkorDB graph traversal (APPLIES_TO, INTERPRETS, REFERENCES edges)
- Cross-encoder reranking (BAAI/bge-reranker-v2-m3) assigns sigmoid-normalized relevance scores
- Results below
RERANKER_MIN_SCOREare filtered out - (Optional) Retrieved context is sent to LLM for synthesized answer with source citations
web/
├── app.py # FastAPI application + SPA serving
├── dependencies.py # DB pool + graph connection injection
├── models/
│ └── schemas.py # Pydantic request/response models
├── routers/
│ ├── search.py # POST /api/search
│ ├── graph.py # Graph exploration endpoints
│ ├── documents.py # Document detail endpoint
│ ├── crossrefs.py # Cross-reference browser
│ └── synthesize.py # LLM synthesis (SSE)
├── services/
│ ├── embedding.py # Query embedding wrapper
│ ├── query_expansion.py # Legal acronym/synonym expansion
│ ├── sparse_embedding.py # BGE-M3 sparse lexical weights
│ ├── vector_search.py # Hybrid search (dense + FTS + sparse + RRF)
│ ├── reranker.py # Cross-encoder reranking
│ ├── graph_search.py # FalkorDB Cypher queries
│ ├── retrieval.py # GraphRAG fusion orchestrator
│ └── synthesis.py # LLM answer generation
└── frontend/
├── package.json
├── vite.config.ts
└── src/
├── components/ # React UI components
├── hooks/ # Custom React hooks
├── api/client.ts # API client + SSE helper
└── types/index.ts # TypeScript type definitions
Export:
pg_dump -h localhost -p 5433 -U user -d mediatech -F c -f mediatech.dumpRestore on another machine:
# Create the database first if it doesn't exist
createdb -h localhost -p 5433 -U user mediatech
pg_restore -h localhost -p 5433 -U user -d mediatech mediatech.dumpExport:
# Trigger a background save
docker exec falkor redis-cli BGSAVE
# Wait a moment, then copy the dump file
docker cp falkor:/data/dump.rdb ./falkordb_dump.rdbRestore on another machine:
# Place the dump before starting the container
cp falkordb_dump.rdb /path/to/your/falkordb/data/dump.rdb
docker compose up -d falkorIf both machines use the same docker-compose.yml:
Export:
docker compose stop
tar -czf frenchadmin_data.tar.gz \
$(docker volume inspect frenchadmin_pg_data --format '{{.Mountpoint}}') \
$(docker volume inspect frenchadmin_falkor_data --format '{{.Mountpoint}}')
docker compose startRestore:
# Transfer frenchadmin_data.tar.gz to target machine, then:
docker compose stop
sudo tar -xzf frenchadmin_data.tar.gz -C /
docker compose up -dTip: The project also provides
scripts/backup.shandscripts/restore.shwhich automate PostgreSQL volume backup and restoration.
This project is licensed under the MIT License.