Skip to content

yawo/frenchadmin

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

254 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TaxFrance

License French version Hugging Face collection

📝 Description

FrenchAdmin processes French tax data and GraphRAG for AI applications. The data focuses on tax law (CGI et annexes, Assemblée Nationale), doctrine (BOFiP, Bercy) and jurisprudence (JADE, Conseil d'état). It downloads, processes, embeds, and stores data in PostgreSQL with PgVector for vector search, and FalkorDB for knowledge graph relationships.

Key capabilities:

  • LEGI: French legislative texts (Code Général des Impôts, LPF, etc.)
  • JADE: Judicial decisions from French courts
  • BOFiP: Tax guidance documents (Bulletin Officiel des Finances Publiques)
  • Cross-reference inference: Automatic linking between JADE/BOFiP and LEGI articles for RAG and graphRAG

💡 Get Started

</> Use local CLI

Installing Dependencies

  1. Install the required apt dependencies:

    sudo apt-get update
    sudo apt-get install -y $(cat config/requirements-apt-container.txt)
  2. Create and activate a virtual environment:

    python3 -m venv .venv  # Create the virtual environment
    source .venv/bin/activate  # Activate the virtual environment
  3. Install the required python dependencies:

    pip install -e .

Installing in development mode (-e) allows you to use the mediatech command and modify the code without reinstalling.

Note: Make sure your environment is properly configured before continuing.

Database Configuration (PostgreSQL + FalkorDB)

  1. Set up the environment variables in a .env file based on the example in .env.example.

  2. Export .env variables :

    export $(grep -v '^#' .env | xargs)
  3. Start the containers with Docker:

    docker compose up -d
  4. Verify containers are running:

    docker ps

    You should see:

    • pg - PostgreSQL with PgVector (vector search)
    • falkor - FalkorDB (graph database)

Downloading, Processing and Uploading Data

IMPORTANT: Clean config/data_history.json

Using the mediatech Command

After installation, the mediatech command is available globally and replaces python main.py:

If you encounter issues with the mediatech command, you can still use python main.py instead.

The main.py file is the main entry point of the project and provides a command-line interface (CLI) to run each step of the pipeline separately.
You can use it as follows:

mediatech <command> [options]

or

python main.py <command> [options]

Command examples:

  • View help:
    mediatech --help
  • Create PostgreSQL tables:
    mediatech create_tables --model BAAI/bge-m3
  • Download all files listed in data_config.json:
    mediatech download_files --all
  • Download files from the legi source:
     mediatech download_files --source legi
  • Download and process all files listed in data_config.json:
    mediatech download_and_process_files --all --model BAAI/bge-m3
  • Process all data:
    mediatech process_files --all --model BAAI/bge-m3
  • Split a table into subtables based on different criteria (see main.py):
    mediatech split_table --source legi
  • Export PostgreSQL tables to parquet files:
    mediatech export_tables --output data/parquet
  • Upload parquet datasets to the Hugging Face repository:
     mediatech upload_dataset --input data/parquet/bofip.parquet --dataset-name bofip
  • Add full-text search columns for hybrid search (only needed for databases created before this feature — fresh installs get it via create_tables):
    mediatech add_fts_columns
  • Generate BGE-M3 sparse embeddings for 3-way retrieval fusion (required after process_files):
    mediatech add_sparse_embeddings

Run mediatech --help in your terminal to see all available options, or check the code directly in main.py.

Alternative Usage with python main.py

If you prefer to use the Python script directly, you can always use:

python main.py <command> [options]

Examples:

python main.py download_files
python main.py create_tables --model BAAI/bge-m3
python main.py process_files --all --model BAAI/bge-m3

Hugging Face download

pip install -U hf_transfer
pip install -U huggingface_hub



Performance and Optimization Flags

The processing pipeline now exposes optimization switches via environment variables:

export ENABLE_BATCH_EMBEDDING=true
export ENABLE_FAST_DB_INSERT=true
export ENABLE_BATCH_GRAPH_UPSERT=true
export ENABLE_PARALLEL_PROCESSING=false
export ENABLE_PERF_TELEMETRY=true

Tuning variables:

export EMBEDDING_BATCH_MAX_SIZE=64
export FAST_DB_INSERT_PAGE_SIZE=1000
export MAX_WORKERS=4
export BATCH_SIZE_DOCS=32

When telemetry is enabled, each run writes a JSON report in data/perf_reports/.

Benchmark and Regression Gate

You can run the fixed-sample benchmark helper and enforce a regression gate:

python scripts/benchmark_pipeline.py \
   --command "python main.py process_files --source legi --model louisbrulenaudet/lemone-embed-pro" \
   --runs 3 \
   --run-prefix process_legi \
   --reports-dir data/perf_reports

Optional baseline gate (fails if runtime degrades by more than 10%):

python scripts/benchmark_pipeline.py \
   --command "python main.py process_files --source legi --model louisbrulenaudet/lemone-embed-pro" \
   --runs 3 \
   --run-prefix process_legi \
   --baseline data/perf_reports/process_legi_baseline.json \
   --regression-threshold 0.10
Using the update.sh Script

The update.sh script allows you to run the entire data processing pipeline: downloading, table creation, vectorization, and export.
To run it, execute the following command from the project root:

./scripts/update.sh

This script will:

  • Wait for the PostgreSQL database to be available,
  • Create or update the necessary tables in the PostgreSQL database,
  • Download public files listed in data_config.json,
  • Process and vectorize the data,
  • Export the tables in Parquet format,
  • Upload the Parquet files to Hugging Face.

🗂️ Project Structure

CROSS REFERENCE details

Files created/modified:


┌─────────────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────┐
│ File                                │ Purpose                                                                                   |
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ database/cross_reference_manage.py  │ 3 new tables, catalog refresh, mention/edge CRUD                                          │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ database/__init__.py                │ Added cross-reference exports                                                             │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ database/database_manage.py         │ Wired create_cross_reference_tables() + init_graph_schema() into create_all_tables        │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ database/graph_manage.py            │ inject_cross_reference_edges() for APPLIES_TO/INTERPRETS                                  │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/__init__.py          │ Package exports                                                                           │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/normalizer.py        │ Primary + loose article number normalization                                              │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/alias_detector.py    │ CGI/LPF/CIBS family alias detection                                                       │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/extractor.py         │ Article token extraction with enumeration support                                         │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/resolver.py          │ Full cascade: A→B→C→D→E                                                                   │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/fuzzy_resolver.py    │ rapidfuzz scoped fallback                                                                 │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/semantic_resolver.py │ Cosine-distance semantic fallback                                                         │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/confidence.py        │ Confidence scoring with adjustments                                                       │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ crossreference/pipeline.py          │ Orchestrator: catalog refresh → doc aggregation → extraction → resolution → edges → graph │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ main.py                             │ Added infer_crossreferences CLI command                                                   │
├─────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ pyproject.toml                      │ Added rapidfuzz, crossreference* to packages                                              │
└─────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────┘

Usage:

 1 main.py infer_crossreferences --source all
 2 main.py infer_crossreferences --source jade
 3 main.py infer_crossreferences --source bofip --debug

Resolution cascade per mention:
A. Exact normalized + temporal + family → 0.99
B. Loose key fallback → 0.92
C. Family-prior deterministic → 0.84
D. Fuzzy scoped (rapidfuzz, ≥96) → 0.74
E. Semantic scoped (cosine <=>) → 0.62

GraphRAG Web Interface

The project includes a full GraphRAG frontend for querying the knowledge graph, performing vector search, and generating LLM-synthesized answers.

Architecture

[React Frontend (Vite + TypeScript + Tailwind)]
        |
        v  (REST + SSE)
[FastAPI Backend]
        |
   +----+----+
   |         |
   v         v
[PostgreSQL] [FalkorDB]
 (pgvector)   (Cypher)
  • Backend: FastAPI serving REST endpoints + SSE streaming for LLM synthesis
  • Frontend: React with react-force-graph-2d for interactive graph visualization
  • LLM: OpenRouter (configurable model) for RAG answer generation

Features

Feature Description
Natural language search Cosine similarity search across LEGI, JADE, BOFiP with graph augmentation
Graph Explorer Interactive force-directed visualization of the knowledge graph
Cross-Reference Browser Paginated table of inferred JADE/BOFiP → LEGI links with confidence scores
Document Detail Full document view with metadata, chunks, and cross-references
LLM Synthesis Streaming answers grounded in retrieved documents via OpenRouter

Prerequisites

  • PostgreSQL + FalkorDB running (via docker compose up -d)
  • Node.js 18+ (for frontend build)
  • Python 3.10+ with project dependencies installed

Fresh install pipeline (run in order)

mediatech create_tables --model BAAI/bge-m3          # Creates tables + FTS columns/triggers + sparse column
mediatech download_and_process_files --all --model BAAI/bge-m3  # Download, chunk, embed, insert
mediatech add_sparse_embeddings                       # Generate BGE-M3 sparse lexical weights
mediatech infer_crossreferences --source all --model BAAI/bge-m3  # Link JADE/BOFiP → LEGI

Note: add_fts_columns is NOT needed on fresh installs — the FTS trigger is created by create_tables. Only run it on databases that predate the hybrid search feature.

Note: add_sparse_embeddings must be run after data is loaded. There is no trigger for sparse embeddings (encoding requires the FlagEmbedding model).

Setup

Option 1: Docker (recommended for production)

# Set required environment variables in .env:
#   API_URL=https://openrouter.ai/api/v1
#   API_KEY=your_openrouter_api_key
#   LLM_MODEL=openrouter/hunter-alpha

docker compose up --build

The web interface will be available at http://localhost:8080.

Option 2: Local development

1. Install Python dependencies:

pip install -e .

This installs FastAPI, Uvicorn, SSE-Starlette, and all other dependencies.

2. Install frontend dependencies and build:

cd web/frontend
npm install
npm run build
cd ../..

3. Start the backend (serves both API and built frontend):

uvicorn web.app:app --reload --port 8080

The app is available at http://localhost:8080.

4. (Optional) Frontend dev mode with hot-reload:

# Terminal 1: backend
uvicorn web.app:app --reload --port 8080

# Terminal 2: frontend dev server (proxies /api to backend)
cd web/frontend
npm run dev

Frontend dev server runs at http://localhost:5173 with API proxy to port 8080.

API Endpoints

Method Path Description
POST /api/search GraphRAG search (vector + graph fusion)
POST /api/graph/neighbors Get N-hop graph neighborhood
GET /api/graph/context/{doc_id} Get all relationships for a document
GET /api/graph/subgraph?doc_ids=... Get subgraph connecting documents
GET /api/documents/{source_type}/{doc_id} Full document detail
GET /api/crossrefs Paginated cross-reference listing
POST /api/synthesize LLM RAG synthesis (SSE streaming)
GET /api/health Health check

Search Request Example

curl -X POST http://localhost:8080/api/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "conditions application article 200 CGI",
    "source_types": ["legi", "jade"],
    "top_k": 10,
    "min_confidence": 0.6
  }'

Configuration

The GraphRAG web interface uses the same environment variables as the rest of the project:

Variable Default Description
POSTGRES_HOST localhost PostgreSQL host
POSTGRES_PORT 5433 PostgreSQL port
POSTGRES_DB mediatech Database name
POSTGRES_USER user Database user
POSTGRES_PASSWORD password Database password
FALKORDB_HOST localhost FalkorDB host
FALKORDB_PORT 6379 FalkorDB port
FALKORDB_GRAPH_NAME frenchadmin Graph name
EMBEDDING_MODEL BAAI/bge-m3 Embedding model (384-dim)
API_URL https://openrouter.ai/api/v1 LLM API base URL
API_KEY OpenRouter API key
LLM_MODEL openrouter/hunter-alpha LLM model for synthesis
WEB_PORT 8080 Web server port (Docker)
RERANKER_MODEL BAAI/bge-reranker-v2-m3 Cross-encoder for relevance reranking
RERANKER_MAX_LENGTH 1024 Max tokens the reranker sees per document
RERANKER_MIN_SCORE 0.01 Minimum reranker score to keep a result
ENABLE_HYBRID_SEARCH true Combine vector + full-text + sparse search via RRF
RRF_K 60 RRF smoothing constant
FTS_WEIGHT 1.0 Relative weight of FTS vs vector in RRF
FTS_MODE auto FTS strategy: auto (AND for ≤3 words, OR for 4+), and, or
ENABLE_QUERY_EXPANSION true Expand legal acronyms and synonyms before search
ENABLE_SPARSE_SEARCH true BGE-M3 sparse retrieval (requires add_sparse_embeddings)
SPARSE_WEIGHT 1.0 Relative weight of sparse retrieval in RRF

Query Flow

  1. User enters a natural language query in French
  2. Query expansion: legal acronyms (EURL→"entreprise unipersonnelle…") and synonyms (dirigeant→gérant) are appended
  3. 3-way hybrid retrieval (oversampled 4×):
    • Dense: query embedded via BAAI/bge-m3 (1024-dim) → cosine similarity search via pgvector HNSW
    • FTS: adaptive AND/OR plainto_tsquery('french', ...) on title-weighted tsvector GIN indexes
    • Sparse: BGE-M3 learned lexical weights → JSONB key overlap + dot product scoring
    • All three result lists fused using Reciprocal Rank Fusion (RRF)
  4. Top results are expanded via FalkorDB graph traversal (APPLIES_TO, INTERPRETS, REFERENCES edges)
  5. Cross-encoder reranking (BAAI/bge-reranker-v2-m3) assigns sigmoid-normalized relevance scores
  6. Results below RERANKER_MIN_SCORE are filtered out
  7. (Optional) Retrieved context is sent to LLM for synthesized answer with source citations

Project Structure (web/)

web/
├── app.py                    # FastAPI application + SPA serving
├── dependencies.py           # DB pool + graph connection injection
├── models/
│   └── schemas.py            # Pydantic request/response models
├── routers/
│   ├── search.py             # POST /api/search
│   ├── graph.py              # Graph exploration endpoints
│   ├── documents.py          # Document detail endpoint
│   ├── crossrefs.py          # Cross-reference browser
│   └── synthesize.py         # LLM synthesis (SSE)
├── services/
│   ├── embedding.py          # Query embedding wrapper
│   ├── query_expansion.py    # Legal acronym/synonym expansion
│   ├── sparse_embedding.py   # BGE-M3 sparse lexical weights
│   ├── vector_search.py      # Hybrid search (dense + FTS + sparse + RRF)
│   ├── reranker.py           # Cross-encoder reranking
│   ├── graph_search.py       # FalkorDB Cypher queries
│   ├── retrieval.py          # GraphRAG fusion orchestrator
│   └── synthesis.py          # LLM answer generation
└── frontend/
    ├── package.json
    ├── vite.config.ts
    └── src/
        ├── components/       # React UI components
        ├── hooks/            # Custom React hooks
        ├── api/client.ts     # API client + SSE helper
        └── types/index.ts    # TypeScript type definitions

💾 Exporting & Restoring the Database

PostgreSQL (pg_dump)

Export:

pg_dump -h localhost -p 5433 -U user -d mediatech -F c -f mediatech.dump

Restore on another machine:

# Create the database first if it doesn't exist
createdb -h localhost -p 5433 -U user mediatech
pg_restore -h localhost -p 5433 -U user -d mediatech mediatech.dump

FalkorDB (Redis RDB snapshot)

Export:

# Trigger a background save
docker exec falkor redis-cli BGSAVE
# Wait a moment, then copy the dump file
docker cp falkor:/data/dump.rdb ./falkordb_dump.rdb

Restore on another machine:

# Place the dump before starting the container
cp falkordb_dump.rdb /path/to/your/falkordb/data/dump.rdb
docker compose up -d falkor

Docker volume copy (all-in-one)

If both machines use the same docker-compose.yml:

Export:

docker compose stop
tar -czf frenchadmin_data.tar.gz \
  $(docker volume inspect frenchadmin_pg_data --format '{{.Mountpoint}}') \
  $(docker volume inspect frenchadmin_falkor_data --format '{{.Mountpoint}}')
docker compose start

Restore:

# Transfer frenchadmin_data.tar.gz to target machine, then:
docker compose stop
sudo tar -xzf frenchadmin_data.tar.gz -C /
docker compose up -d

Tip: The project also provides scripts/backup.sh and scripts/restore.sh which automate PostgreSQL volume backup and restoration.

⚖️ License

This project is licensed under the MIT License.

About

Collection of public datasets from the French administration, vectorized and ready to use in AI projects.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 80.1%
  • TypeScript 11.4%
  • Shell 7.2%
  • PLpgSQL 0.6%
  • Dockerfile 0.4%
  • JavaScript 0.1%
  • Other 0.2%