TaxFrance

📝 Description

FrenchAdmin processes French tax data and GraphRAG for AI applications. The data focuses on tax law (CGI et annexes, Assemblée Nationale), doctrine (BOFiP, Bercy) and jurisprudence (JADE, Conseil d'état). It downloads, processes, embeds, and stores data in PostgreSQL with PgVector for vector search, and FalkorDB for knowledge graph relationships.

Key capabilities:

LEGI: French legislative texts (Code Général des Impôts, LPF, etc.)
JADE: Judicial decisions from French courts
BOFiP: Tax guidance documents (Bulletin Officiel des Finances Publiques)
Cross-reference inference: Automatic linking between JADE/BOFiP and LEGI articles for RAG and graphRAG

💡 Get Started

</> Use local CLI

Installing Dependencies

Install the required apt dependencies:

sudo apt-get update
sudo apt-get install -y $(cat config/requirements-apt-container.txt)

Create and activate a virtual environment:

python3 -m venv .venv  # Create the virtual environment
source .venv/bin/activate  # Activate the virtual environment

Install the required python dependencies:
```
pip install -e .
```

Installing in development mode (-e) allows you to use the mediatech command and modify the code without reinstalling.

Note: Make sure your environment is properly configured before continuing.

Database Configuration (PostgreSQL + FalkorDB)

Set up the environment variables in a .env file based on the example in .env.example.
Export .env variables :
```
export $(grep -v '^#' .env | xargs)
```
Start the containers with Docker:
```
docker compose up -d
```
Verify containers are running:
```
docker ps
```
You should see:
- pg - PostgreSQL with PgVector (vector search)
- falkor - FalkorDB (graph database)

Downloading, Processing and Uploading Data

IMPORTANT: Clean config/data_history.json

Using the `mediatech` Command

After installation, the mediatech command is available globally and replaces python main.py:

If you encounter issues with the mediatech command, you can still use python main.py instead.

The main.py file is the main entry point of the project and provides a command-line interface (CLI) to run each step of the pipeline separately.
You can use it as follows:

mediatech <command> [options]

or

python main.py <command> [options]

Command examples:

View help:
```
mediatech --help
```

Create PostgreSQL tables:

mediatech create_tables --model BAAI/bge-m3

Download all files listed in data_config.json:
```
mediatech download_files --all
```
Download files from the legi source:
```
 mediatech download_files --source legi
```

Download and process all files listed in data_config.json:

mediatech download_and_process_files --all --model BAAI/bge-m3

Process all data:

mediatech process_files --all --model BAAI/bge-m3

Split a table into subtables based on different criteria (see main.py):
```
mediatech split_table --source legi
```

Export PostgreSQL tables to parquet files:

mediatech export_tables --output data/parquet

Upload parquet datasets to the Hugging Face repository:

 mediatech upload_dataset --input data/parquet/bofip.parquet --dataset-name bofip

Add full-text search columns for hybrid search (only needed for databases created before this feature — fresh installs get it via create_tables):
```
mediatech add_fts_columns
```
Generate BGE-M3 sparse embeddings for 3-way retrieval fusion (required after process_files):
```
mediatech add_sparse_embeddings
```

Run mediatech --help in your terminal to see all available options, or check the code directly in main.py.

Alternative Usage with `python main.py`

If you prefer to use the Python script directly, you can always use:

python main.py <command> [options]

Examples:

python main.py download_files
python main.py create_tables --model BAAI/bge-m3
python main.py process_files --all --model BAAI/bge-m3

Hugging Face download

pip install -U hf_transfer
pip install -U huggingface_hub

Performance and Optimization Flags

The processing pipeline now exposes optimization switches via environment variables:

export ENABLE_BATCH_EMBEDDING=true
export ENABLE_FAST_DB_INSERT=true
export ENABLE_BATCH_GRAPH_UPSERT=true
export ENABLE_PARALLEL_PROCESSING=false
export ENABLE_PERF_TELEMETRY=true

Tuning variables:

export EMBEDDING_BATCH_MAX_SIZE=64
export FAST_DB_INSERT_PAGE_SIZE=1000
export MAX_WORKERS=4
export BATCH_SIZE_DOCS=32

When telemetry is enabled, each run writes a JSON report in data/perf_reports/.

Benchmark and Regression Gate

You can run the fixed-sample benchmark helper and enforce a regression gate:

python scripts/benchmark_pipeline.py \
   --command "python main.py process_files --source legi --model louisbrulenaudet/lemone-embed-pro" \
   --runs 3 \
   --run-prefix process_legi \
   --reports-dir data/perf_reports

Optional baseline gate (fails if runtime degrades by more than 10%):

python scripts/benchmark_pipeline.py \
   --command "python main.py process_files --source legi --model louisbrulenaudet/lemone-embed-pro" \
   --runs 3 \
   --run-prefix process_legi \
   --baseline data/perf_reports/process_legi_baseline.json \
   --regression-threshold 0.10

Using the `update.sh` Script

The update.sh script allows you to run the entire data processing pipeline: downloading, table creation, vectorization, and export.
To run it, execute the following command from the project root:

./scripts/update.sh

This script will:

Wait for the PostgreSQL database to be available,
Create or update the necessary tables in the PostgreSQL database,
Download public files listed in data_config.json,
Process and vectorize the data,
Export the tables in Parquet format,
Upload the Parquet files to Hugging Face.

🗂️ Project Structure

main.py: Main entry point with CLI for pipeline commands.
pyproject.toml: Python project and dependency configuration.
Dockerfile: Docker image for containerized execution, installs system dependencies and project packages.
docker-compose.yml: Multi-container setup: PostgreSQL (PgVector) + FalkorDB.
.github/: GitHub Actions workflows for CI/CD.
download_and_processing/: Scripts to download and extract files from DILA (LEGI, JADE) and data.economie.gouv.fr (BOFiP).
database/: Database management (table creation, data insertion, FalkorDB graph operations).
docs/: Documentation and tutorials.
- docs/hugging_face_rag_tutorial.ipynb: RAG Tutorial: How to load datasets from Hugging Face and use them in a RAG pipeline ?
- docs/reconstruct_vector_database.ipynb: Tutorial: How to reconstruct a dataset without chunking and embedding from parquet files?
- docs/fr/: French translations of documentation.
utils/: Shared utilities (chunking, embedding, HuggingFace, telemetry).
config/: Project configuration (data sources, embedding models, optimization flags).
logs/: Log files from script execution.
scripts/: Shell scripts for pipeline automation.
- scripts/update.sh: Run the entire data processing pipeline.
- scripts/periodic_update.sh: Automate pipeline via cron.
- scripts/backup.sh: Back up PostgreSQL volume and config files.
- scripts/restore.sh: Restore PostgreSQL volume and config.
- scripts/initial_deployment.sh: Set up a new server (Docker, dependencies).
- scripts/containers_deployment.sh: Build and deploy Docker containers.
- scripts/delete_old_files.sh: Delete old files from logs/ and backups/ directories.
- scripts/manage_checkpoint.sh: Manage checkpoint files for processing.
- scripts/write_tchap_message.sh: Send notifications to Tchap (French government chat).
CROSSREFERENCE.md: Technical specification for JADE/BOFiP → LEGI cross-reference inference (RAG/graphRAG).

CROSS REFERENCE details

Files created/modified:


┌─────────────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────┐
│ File                                │ Purpose                                                                                   |
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ database/cross_reference_manage.py  │ 3 new tables, catalog refresh, mention/edge CRUD                                          │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ database/__init__.py                │ Added cross-reference exports                                                             │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ database/database_manage.py         │ Wired create_cross_reference_tables() + init_graph_schema() into create_all_tables        │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ database/graph_manage.py            │ inject_cross_reference_edges() for APPLIES_TO/INTERPRETS                                  │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/__init__.py          │ Package exports                                                                           │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/normalizer.py        │ Primary + loose article number normalization                                              │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/alias_detector.py    │ CGI/LPF/CIBS family alias detection                                                       │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/extractor.py         │ Article token extraction with enumeration support                                         │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/resolver.py          │ Full cascade: A→B→C→D→E                                                                   │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/fuzzy_resolver.py    │ rapidfuzz scoped fallback                                                                 │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/semantic_resolver.py │ Cosine-distance semantic fallback                                                         │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/confidence.py        │ Confidence scoring with adjustments                                                       │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/pipeline.py          │ Orchestrator: catalog refresh → doc aggregation → extraction → resolution → edges → graph │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ main.py                             │ Added infer_crossreferences CLI command                                                   │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ pyproject.toml                      │ Added rapidfuzz, crossreference* to packages                                              │
└─────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────┘

Usage:

 1 main.py infer_crossreferences --source all
 2 main.py infer_crossreferences --source jade
 3 main.py infer_crossreferences --source bofip --debug

Resolution cascade per mention:
A. Exact normalized + temporal + family → 0.99
B. Loose key fallback → 0.92
C. Family-prior deterministic → 0.84
D. Fuzzy scoped (rapidfuzz, ≥96) → 0.74
E. Semantic scoped (cosine <=>) → 0.62

GraphRAG Web Interface

The project includes a full GraphRAG frontend for querying the knowledge graph, performing vector search, and generating LLM-synthesized answers.

Architecture

[React Frontend (Vite + TypeScript + Tailwind)]
        |
        v  (REST + SSE)
[FastAPI Backend]
        |
   +----+----+
   |         |
   v         v
[PostgreSQL] [FalkorDB]
 (pgvector)   (Cypher)

Backend: FastAPI serving REST endpoints + SSE streaming for LLM synthesis
Frontend: React with react-force-graph-2d for interactive graph visualization
LLM: OpenRouter (configurable model) for RAG answer generation

Features

Feature	Description
Natural language search	Cosine similarity search across LEGI, JADE, BOFiP with graph augmentation
Graph Explorer	Interactive force-directed visualization of the knowledge graph
Cross-Reference Browser	Paginated table of inferred JADE/BOFiP → LEGI links with confidence scores
Document Detail	Full document view with metadata, chunks, and cross-references
LLM Synthesis	Streaming answers grounded in retrieved documents via OpenRouter

Prerequisites

PostgreSQL + FalkorDB running (via docker compose up -d)
Node.js 18+ (for frontend build)
Python 3.10+ with project dependencies installed

Fresh install pipeline (run in order)

mediatech create_tables --model BAAI/bge-m3          # Creates tables + FTS columns/triggers + sparse column
mediatech download_and_process_files --all --model BAAI/bge-m3  # Download, chunk, embed, insert
mediatech add_sparse_embeddings                       # Generate BGE-M3 sparse lexical weights
mediatech infer_crossreferences --source all --model BAAI/bge-m3  # Link JADE/BOFiP → LEGI

Note: add_fts_columns is NOT needed on fresh installs — the FTS trigger is created by create_tables. Only run it on databases that predate the hybrid search feature.

Note: add_sparse_embeddings must be run after data is loaded. There is no trigger for sparse embeddings (encoding requires the FlagEmbedding model).

Setup

Option 1: Docker (recommended for production)

# Set required environment variables in .env:
#   API_URL=https://openrouter.ai/api/v1
#   API_KEY=your_openrouter_api_key
#   LLM_MODEL=openrouter/hunter-alpha

docker compose up --build

The web interface will be available at http://localhost:8080.

Option 2: Local development

1. Install Python dependencies:

pip install -e .

This installs FastAPI, Uvicorn, SSE-Starlette, and all other dependencies.

2. Install frontend dependencies and build:

cd web/frontend
npm install
npm run build
cd ../..

3. Start the backend (serves both API and built frontend):

uvicorn web.app:app --reload --port 8080

The app is available at http://localhost:8080.

4. (Optional) Frontend dev mode with hot-reload:

# Terminal 1: backend
uvicorn web.app:app --reload --port 8080

# Terminal 2: frontend dev server (proxies /api to backend)
cd web/frontend
npm run dev

Frontend dev server runs at http://localhost:5173 with API proxy to port 8080.

API Endpoints

Method	Path	Description
`POST`	`/api/search`	GraphRAG search (vector + graph fusion)
`POST`	`/api/graph/neighbors`	Get N-hop graph neighborhood
`GET`	`/api/graph/context/{doc_id}`	Get all relationships for a document
`GET`	`/api/graph/subgraph?doc_ids=...`	Get subgraph connecting documents
`GET`	`/api/documents/{source_type}/{doc_id}`	Full document detail
`GET`	`/api/crossrefs`	Paginated cross-reference listing
`POST`	`/api/synthesize`	LLM RAG synthesis (SSE streaming)
`GET`	`/api/health`	Health check

Search Request Example

curl -X POST http://localhost:8080/api/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "conditions application article 200 CGI",
    "source_types": ["legi", "jade"],
    "top_k": 10,
    "min_confidence": 0.6
  }'

Configuration

The GraphRAG web interface uses the same environment variables as the rest of the project:

Variable	Default	Description
`POSTGRES_HOST`	`localhost`	PostgreSQL host
`POSTGRES_PORT`	`5433`	PostgreSQL port
`POSTGRES_DB`	`mediatech`	Database name
`POSTGRES_USER`	`user`	Database user
`POSTGRES_PASSWORD`	`password`	Database password
`FALKORDB_HOST`	`localhost`	FalkorDB host
`FALKORDB_PORT`	`6379`	FalkorDB port
`FALKORDB_GRAPH_NAME`	`frenchadmin`	Graph name
`EMBEDDING_MODEL`	`BAAI/bge-m3`	Embedding model (384-dim)
`API_URL`	`https://openrouter.ai/api/v1`	LLM API base URL
`API_KEY`	—	OpenRouter API key
`LLM_MODEL`	`openrouter/hunter-alpha`	LLM model for synthesis
`WEB_PORT`	`8080`	Web server port (Docker)
`RERANKER_MODEL`	`BAAI/bge-reranker-v2-m3`	Cross-encoder for relevance reranking
`RERANKER_MAX_LENGTH`	`1024`	Max tokens the reranker sees per document
`RERANKER_MIN_SCORE`	`0.01`	Minimum reranker score to keep a result
`ENABLE_HYBRID_SEARCH`	`true`	Combine vector + full-text + sparse search via RRF
`RRF_K`	`60`	RRF smoothing constant
`FTS_WEIGHT`	`1.0`	Relative weight of FTS vs vector in RRF
`FTS_MODE`	`auto`	FTS strategy: `auto` (AND for ≤3 words, OR for 4+), `and`, `or`
`ENABLE_QUERY_EXPANSION`	`true`	Expand legal acronyms and synonyms before search
`ENABLE_SPARSE_SEARCH`	`true`	BGE-M3 sparse retrieval (requires `add_sparse_embeddings`)
`SPARSE_WEIGHT`	`1.0`	Relative weight of sparse retrieval in RRF

Query Flow

User enters a natural language query in French
Query expansion: legal acronyms (EURL→"entreprise unipersonnelle…") and synonyms (dirigeant→gérant) are appended
3-way hybrid retrieval (oversampled 4×):
- Dense: query embedded via BAAI/bge-m3 (1024-dim) → cosine similarity search via pgvector HNSW
- FTS: adaptive AND/OR plainto_tsquery('french', ...) on title-weighted tsvector GIN indexes
- Sparse: BGE-M3 learned lexical weights → JSONB key overlap + dot product scoring
- All three result lists fused using Reciprocal Rank Fusion (RRF)
Top results are expanded via FalkorDB graph traversal (APPLIES_TO, INTERPRETS, REFERENCES edges)
Cross-encoder reranking (BAAI/bge-reranker-v2-m3) assigns sigmoid-normalized relevance scores
Results below RERANKER_MIN_SCORE are filtered out
(Optional) Retrieved context is sent to LLM for synthesized answer with source citations

Project Structure (web/)

web/
├── app.py                    # FastAPI application + SPA serving
├── dependencies.py           # DB pool + graph connection injection
├── models/
│   └── schemas.py            # Pydantic request/response models
├── routers/
│   ├── search.py             # POST /api/search
│   ├── graph.py              # Graph exploration endpoints
│   ├── documents.py          # Document detail endpoint
│   ├── crossrefs.py          # Cross-reference browser
│   └── synthesize.py         # LLM synthesis (SSE)
├── services/
│   ├── embedding.py          # Query embedding wrapper
│   ├── query_expansion.py    # Legal acronym/synonym expansion
│   ├── sparse_embedding.py   # BGE-M3 sparse lexical weights
│   ├── vector_search.py      # Hybrid search (dense + FTS + sparse + RRF)
│   ├── reranker.py           # Cross-encoder reranking
│   ├── graph_search.py       # FalkorDB Cypher queries
│   ├── retrieval.py          # GraphRAG fusion orchestrator
│   └── synthesis.py          # LLM answer generation
└── frontend/
    ├── package.json
    ├── vite.config.ts
    └── src/
        ├── components/       # React UI components
        ├── hooks/            # Custom React hooks
        ├── api/client.ts     # API client + SSE helper
        └── types/index.ts    # TypeScript type definitions

💾 Exporting & Restoring the Database

PostgreSQL (pg_dump)

Export:

pg_dump -h localhost -p 5433 -U user -d mediatech -F c -f mediatech.dump

Restore on another machine:

# Create the database first if it doesn't exist
createdb -h localhost -p 5433 -U user mediatech
pg_restore -h localhost -p 5433 -U user -d mediatech mediatech.dump

FalkorDB (Redis RDB snapshot)

Export:

# Trigger a background save
docker exec falkor redis-cli BGSAVE
# Wait a moment, then copy the dump file
docker cp falkor:/data/dump.rdb ./falkordb_dump.rdb

Restore on another machine:

# Place the dump before starting the container
cp falkordb_dump.rdb /path/to/your/falkordb/data/dump.rdb
docker compose up -d falkor

Docker volume copy (all-in-one)

If both machines use the same docker-compose.yml:

Export:

docker compose stop
tar -czf frenchadmin_data.tar.gz \
  $(docker volume inspect frenchadmin_pg_data --format '{{.Mountpoint}}') \
  $(docker volume inspect frenchadmin_falkor_data --format '{{.Mountpoint}}')
docker compose start

Restore:

# Transfer frenchadmin_data.tar.gz to target machine, then:
docker compose stop
sudo tar -xzf frenchadmin_data.tar.gz -C /
docker compose up -d

Tip: The project also provides scripts/backup.sh and scripts/restore.sh which automate PostgreSQL volume backup and restoration.

⚖️ License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 254 Commits
.agents		.agents
.github		.github
config		config
crossreference		crossreference
data/perf_reports		data/perf_reports
database		database
db		db
docs		docs
download_and_processing		download_and_processing
mediatech.egg-info		mediatech.egg-info
scripts		scripts
tests		tests
utils		utils
web		web
.claude		.claude
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
=1.2.0		=1.2.0
CROSSREFERENCE.md		CROSSREFERENCE.md
CROSSREF_REVIEW.md		CROSSREF_REVIEW.md
CROSSREF_REVIEW_2.md		CROSSREF_REVIEW_2.md
CROSSREF_REVIEW_3.md		CROSSREF_REVIEW_3.md
Dockerfile		Dockerfile
Dockerfile.web		Dockerfile.web
FIXES_1.md		FIXES_1.md
FTS_SEARCH.md		FTS_SEARCH.md
LICENSE		LICENSE
README.md		README.md
cloudbuild.yaml		cloudbuild.yaml
docker-compose.yml		docker-compose.yml
initial-docker-compose.yml		initial-docker-compose.yml
main.py		main.py
optimization.md		optimization.md
parallelization_analysis_report.md		parallelization_analysis_report.md
pyproject.toml		pyproject.toml
rust_migration_analysis_report.md		rust_migration_analysis_report.md
skills-lock.json		skills-lock.json

Folders and files

Latest commit

History

Repository files navigation

TaxFrance

📝 Description

💡 Get Started

</> Use local CLI

Installing Dependencies

Database Configuration (PostgreSQL + FalkorDB)

Downloading, Processing and Uploading Data

Using the mediatech Command

Alternative Usage with python main.py

Hugging Face download

Performance and Optimization Flags

Benchmark and Regression Gate

Using the update.sh Script

🗂️ Project Structure

CROSS REFERENCE details

GraphRAG Web Interface

Architecture

Features

Prerequisites

Fresh install pipeline (run in order)

Setup

Option 1: Docker (recommended for production)

Option 2: Local development

API Endpoints

Search Request Example

Configuration

Query Flow

Project Structure (web/)

💾 Exporting & Restoring the Database

PostgreSQL (pg_dump)

FalkorDB (Redis RDB snapshot)

Docker volume copy (all-in-one)

⚖️ License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Using the `mediatech` Command

Alternative Usage with `python main.py`

Using the `update.sh` Script

Packages