Named entity recognition for conversational text. Extracts 6 sync entity types from multi-turn conversations, optimized for low-latency inference across 75 languages.
| Component | State | Notes |
|---|---|---|
| Data pipeline | Complete | 5.0M train / 266K val records, 85K silver-en, 4.2K silver-ml (46 langs) |
| Model harness | Validated | Local smoke + Modal GPU + W&B e2e passed (report) |
| Autoresearch | Validated | launch.py orchestration tested, 7 bugs fixed during validation |
| Dataset loading | Optimized | Lazy sidecar index + on-demand tokenization (~3 min startup vs ~20 min eager) |
| Training | Ready | H100 pilot passed: 18-22 steps/sec, 12.8GB VRAM, buckets match targets |
6 sync types extracted as BIO spans (43 labels total):
| Type | Subtypes |
|---|---|
| PERSON | — |
| PLACE | — |
| ORG | — |
| DATE | Day, Week, Month, Season, Year, Decade, Relative, Range |
| EVENT | Life, General |
| RELATIONSHIP_REF | Family, Romantic, Friend, Professional, Acquaintance |
3 async types (EMOTION, GOAL, ACTIVITY) are claim-level Dreamer outputs, not span extraction.
- Backbone: mGTE 306M (
Alibaba-NLP/gte-multilingual-mlm-base) — 12 layers, 768 hidden, 8K context, 75 languages - Head: Dropout + Linear(768, 43) token classifier
- Training: HF Trainer with MixtureSampler (75% gold / 20% silver-en / 5% silver-ml)
- Eval: Exact char-span
(type, start, end)F1 — token-level is debug only - Inference: BF16 train → CPU INT8 + GPU FP16
model/ NER model, training, eval (7 files, ~1100 LOC)
research/ Autoresearch harness (7 files, ~770 LOC)
configs/ Schema, type mappings, dataset inventory, training mix
scripts/ Data pipeline: converters, annotation, validation
lib/ Shared modules (dedup, windower, span tools, schema)
task8/ JP-RP + ChatHaruhi batch annotation pipeline
task9/ Silver data QA/validation pipeline
task9_5/ Silver quality & coverage upgrade
docs/ Design docs, task plans, research analysis
tests/ Test suite
data/ Gitignored — managed locally + HF LFS
| File | Purpose |
|---|---|
config.py |
43 BIO labels, type normalization, hyperparams, source tiers |
ner_model.py |
NerModel: AutoModel encoder + dropout + linear head |
ner_dataset.py |
Lazy sidecar index, on-demand tokenization, BIO alignment, collate |
train.py |
NerTrainer with token accounting, TokenBudgetCallback, AutoresearchCallback |
eval.py |
Char-span F1, token-level callback, REL_REF hard-neg analysis, run_full_eval() |
verify_split.py |
Train/val split leakage audit |
Autonomous hyperparameter search adapted from karpathy/autoresearch. Runs experiments on Modal H100, tracks via W&B, uses git worktree isolation.
- Editable surface:
research/train.py(EXPERIMENT dict only — nothing else) - Composite score: weighted REL_REF F0.5 (0.40) + macro F1 (0.25) + hard-neg precision (0.20) + multilingual (0.10) + latency (0.05)
- Budget: 3.2M tokens/experiment, 60 min Modal timeout, ~$0.85/run on H100
- Stopping: no improvement for 5 consecutive experiments
See research/README.md for the full protocol.
# Preparation (once)
python3 -m research.prepare
# Manual single experiment
python3 -m research.launch
# Autonomous loop (requires OpenProse)
prose run research/autoresearch.prose| Tier | Sources | Weight |
|---|---|---|
| Gold | Existing labeled data (~4.9M train / ~262K val) | 1.0 |
| T1 | mentalchat, therapy, personachat, prosocial | 0.9 |
| T2 | reddit_confessions | 0.7 |
| T3 | pippa, opencharacter, roleplay_hieu, synthetic_persona_chat | 0.5 |
| Multilingual | 46 languages, synthetic seed (4,960 convos) | 0.8 |
# Full conversion + dedup + split
python3 scripts/convert_all.py
# Silver annotation (Task 9)
python3 scripts/task9/annotate.py
# Batch annotation (Task 8) — current pipeline
python3 scripts/task8/annotate_batch.pyNote: scripts/task8/label_roleplay.py is legacy (initial provider="spark" pipeline). The current Task 8 pipeline uses annotate_batch.py with GPT-5.4 and DeepSeek V3.2.
| File | Purpose |
|---|---|
schema.json |
JSONL record format (JSON Schema draft 2020-12) |
type_mapping_train.json |
Source labels → canonical types for training |
type_mapping_eval.json |
Source labels → canonical types for eval |
dataset_inventory.json |
All ingested datasets with metadata |
training_mix.json |
Dataset mixing configuration |
zero_shot_holdout_types.json |
Types excluded from training |
pip install -r requirements.txtKey dependencies: torch, transformers, accelerate, wandb, datasets, huggingface_hub
- Training: Modal (H100 $3.95/hr, B200 $6.25/hr)
- Tracking: Weights & Biases (
oneiron-dev/ner-sft) - Inference: Vast.ai Serverless (3-region: US/EU/Asia)