FastMCP fleet server for diffusion language models — batch-speed local inference when autoregression is the wrong tool.
dLLM is the acronym du jour (after LWM, VLA). This repo catches it. DiffusionGemma first; Qwen-Diffusion and friends next. Gotta catch 'em all.
| Today (v0.1) | Next (v0.2+) |
|---|---|
| Fleet narrative, PRD, routing doctrine | dllm_generate MCP tool |
| Goliath Windows scaffold | FastMCP sidecar → llama-diffusion-cli |
| Port reservation 10834 / 10835 | Health, status, batch endpoints |
| HLE / catch-them-all assessment | Provider registry for multiple dLLMs |
Not a replacement for local-llm-mcp (autoregressive chat) or cloud Gemini. Complement — route batch and frontier-reasoning workloads here; route streaming chat there.
DiffusionGemma changed the calculus:
- ~200–400 tok/s on Goliath (RTX 4090) vs ~40–60 for 32B AR
- 11.0% HLE no-tools — beats its own Gemma 4 AR twin (8.7%) on the exam CAIS built to stump AI1
- No streaming — 256-token blocks; wrong for chat, right for synthetic data and Ednaficator
Different runtime (llama-diffusion-cli, not Ollama). Different output shape. Different sweet spot. Own repo.
MCP agents Fleet inference party
┌─────────────┐
│ Cursor / │ ┌──────────────┐ ┌─────────────────┐ ┌──────────────┐
│ meta_mcp / │────►│ google-ai │ │ local-llm-mcp │ │ diffusion- │
│ Ednaficator │ │ -mcp (cloud) │ │ (AR stream) │ │ llm-mcp │
└─────────────┘ └──────────────┘ └─────────────────┘ │ (dLLM batch) │
│ │ │ └──────────────┘
│ quality/chat default local │
└────────────────── task-based routing ──────────────────────┘
| First model | DiffusionGemma 26B-A4B (Apache 2.0) |
| Quant | Q4_K_M GGUF (~16–18 GB VRAM) |
| Runtime | llama.cpp diffusion PR → llama-diffusion-cli |
| Hardware | Goliath — RTX 4090 24 GB, 64 GB RAM, Windows 11 |
| Ollama / LM Studio | Not yet — sidecar first |
| Doc | What's inside |
|---|---|
| HLE & CAIS | Humanity's Last Exam — anti-AI essentialism, goalpost moving, why we route on it |
| PRD | Product requirements, phases, acceptance criteria |
| Narrative | HLE surprise, AI of the Gaps, Chollet, fleet thesis |
| Vision | Mission, catch-them-all, success criteria |
| Architecture | Ports, planned FastMCP layout, MCP tools |
| Fleet integration | Routing rules, handoffs to sibling MCPs |
| Windows scaffold | Build DiffusionGemma on Goliath |
| Research | Papers, links, acronym lineage |
Assessment archive: mcp-central-docs/projects/diffusiongemma (deep dive — HLE tables, Chollet essay, prognosis).
Send here (diffusion-llm-mcp) |
Send elsewhere |
|---|---|
batch=true, throughput priority |
stream=true → local-llm-mcp |
| Synthetic data, Ednaficator | Quality-critical chat → google-ai-mcp |
| Code infill, constrained format | Default local chat → AR |
paradigm=dllm |
Multimodal vision (VRAM TBD) → cloud |
| Version | Milestone |
|---|---|
| v0.1.0 ✅ | Docs, PRD, ports, GitHub publish |
| v0.2.0 | FastMCP sidecar + dllm_generate / dllm_status |
| v0.3.0 | Provider registry, optional web UI, Ednaficator e2e |
| v1.0.0 | Ollama-native path when upstream merges; multi-model |
See CHANGELOG.md.
No Python server yet — run the model directly:
# Full guide: docs/WINDOWS_SCAFFOLD.md
.\llama-diffusion-cli.exe -m diffusiongemma-26B-A4B-it-Q4_K_M.gguf -ngl 99 -cnv -n 512Smoke checklist in docs/WINDOWS_SCAFFOLD.md.
Part of the sandraschi MCP fleet — same patterns as devices-mcp, worldlabs-mcp, ros-mcp:
- FastMCP 3.2 + FastAPI (when implemented)
- Adjacent frontend/backend ports
just+uv+ruff+pytest- Provider plugin registry for new models
Doctrine: Catch them all — AR, dLLM, VLA, LWM each get a repo. No pivoting on hype.
Phase 0 is narrative-heavy by design. Issues welcome on scaffold failures, routing proposals, and PRC dLLM sightings. Code contributions open at v0.2.0 sidecar milestone.
MIT — sandraschi, 2026.
Acronym du jour: dLLM · DiffusionGemma is specimen #1 · HLE is why we route here · The gap closes from the tall grass
Footnotes
-
Humanity's Last Exam (HLE) — CAIS + Scale AI's anti-saturation frontier benchmark. Expert questions filtered to defeat frontier LLMs at design time. DiffusionGemma's no-tools win is the routing signal for this repo. Full treatment: human essentialism, benchmark priesthood, goalpost moving, and why MMLU is the wrong scoreboard. ↩