diffusion-llm-mcp

FastMCP fleet server for diffusion language models — batch-speed local inference when autoregression is the wrong tool.

dLLM is the acronym du jour (after LWM, VLA). This repo catches it. DiffusionGemma first; Qwen-Diffusion and friends next. Gotta catch 'em all.

What it does (and will do)

Today (v0.1)	Next (v0.2+)
Fleet narrative, PRD, routing doctrine	`dllm_generate` MCP tool
Goliath Windows scaffold	FastMCP sidecar → `llama-diffusion-cli`
Port reservation 10834 / 10835	Health, status, batch endpoints
HLE / catch-them-all assessment	Provider registry for multiple dLLMs

Not a replacement for local-llm-mcp (autoregressive chat) or cloud Gemini. Complement — route batch and frontier-reasoning workloads here; route streaming chat there.

Why this exists

DiffusionGemma changed the calculus:

~200–400 tok/s on Goliath (RTX 4090) vs ~40–60 for 32B AR
11.0% HLE no-tools — beats its own Gemma 4 AR twin (8.7%) on the exam CAIS built to stump AI¹

No streaming — 256-token blocks; wrong for chat, right for synthetic data and Ednaficator

Different runtime (llama-diffusion-cli, not Ollama). Different output shape. Different sweet spot. Own repo.

  MCP agents                    Fleet inference party
 ┌─────────────┐
 │ Cursor /    │     ┌──────────────┐  ┌─────────────────┐  ┌──────────────┐
 │ meta_mcp /  │────►│ google-ai    │  │ local-llm-mcp   │  │ diffusion-   │
 │ Ednaficator │     │ -mcp (cloud) │  │ (AR stream)     │  │ llm-mcp      │
 └─────────────┘     └──────────────┘  └─────────────────┘  │ (dLLM batch) │
       │                    │                  │             └──────────────┘
       │              quality/chat         default local            │
       └────────────────── task-based routing ──────────────────────┘

Quick facts


First model	DiffusionGemma 26B-A4B (Apache 2.0)
Quant	Q4_K_M GGUF (~16–18 GB VRAM)
Runtime	llama.cpp diffusion PR → `llama-diffusion-cli`
Hardware	Goliath — RTX 4090 24 GB, 64 GB RAM, Windows 11
Ollama / LM Studio	Not yet — sidecar first

Documentation

Doc	What's inside
HLE & CAIS	Humanity's Last Exam — anti-AI essentialism, goalpost moving, why we route on it
PRD	Product requirements, phases, acceptance criteria
Narrative	HLE surprise, AI of the Gaps, Chollet, fleet thesis
Vision	Mission, catch-them-all, success criteria
Architecture	Ports, planned FastMCP layout, MCP tools
Fleet integration	Routing rules, handoffs to sibling MCPs
Windows scaffold	Build DiffusionGemma on Goliath
Research	Papers, links, acronym lineage

Assessment archive: mcp-central-docs/projects/diffusiongemma (deep dive — HLE tables, Chollet essay, prognosis).

Routing cheat sheet

Send here (`diffusion-llm-mcp`)	Send elsewhere
`batch=true`, throughput priority	`stream=true` → `local-llm-mcp`
Synthetic data, Ednaficator	Quality-critical chat → `google-ai-mcp`
Code infill, constrained format	Default local chat → AR
`paradigm=dllm`	Multimodal vision (VRAM TBD) → cloud

Roadmap

Version	Milestone
v0.1.0 ✅	Docs, PRD, ports, GitHub publish
v0.2.0	FastMCP sidecar + `dllm_generate` / `dllm_status`
v0.3.0	Provider registry, optional web UI, Ednaficator e2e
v1.0.0	Ollama-native path when upstream merges; multi-model

See CHANGELOG.md.

Local setup (manual, Phase 0)

No Python server yet — run the model directly:

# Full guide: docs/WINDOWS_SCAFFOLD.md
.\llama-diffusion-cli.exe -m diffusiongemma-26B-A4B-it-Q4_K_M.gguf -ngl 99 -cnv -n 512

Smoke checklist in docs/WINDOWS_SCAFFOLD.md.

Fleet context

Part of the sandraschi MCP fleet — same patterns as devices-mcp, worldlabs-mcp, ros-mcp:

FastMCP 3.2 + FastAPI (when implemented)
Adjacent frontend/backend ports
just + uv + ruff + pytest
Provider plugin registry for new models

Doctrine: Catch them all — AR, dLLM, VLA, LWM each get a repo. No pivoting on hype.

Contributing

Phase 0 is narrative-heavy by design. Issues welcome on scaffold failures, routing proposals, and PRC dLLM sightings. Code contributions open at v0.2.0 sidecar milestone.

License

MIT — sandraschi, 2026.

Acronym du jour: dLLM · DiffusionGemma is specimen #1 · HLE is why we route here · The gap closes from the tall grass

Humanity's Last Exam (HLE) — CAIS + Scale AI's anti-saturation frontier benchmark. Expert questions filtered to defeat frontier LLMs at design time. DiffusionGemma's no-tools win is the routing signal for this repo. Full treatment: human essentialism, benchmark priesthood, goalpost moving, and why MMLU is the wrong scoreboard. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docs		docs
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

diffusion-llm-mcp

What it does (and will do)

Why this exists

Quick facts

Documentation

Routing cheat sheet

Roadmap

Local setup (manual, Phase 0)

Fleet context

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

diffusion-llm-mcp

What it does (and will do)

Why this exists

Quick facts

Documentation

Routing cheat sheet

Roadmap

Local setup (manual, Phase 0)

Fleet context

Contributing

License

Footnotes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages