Skip to content

sandraschi/diffusion-llm-mcp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

diffusion-llm-mcp

Phase Fleet Ports Paradigm Target FastMCP License Status

FastMCP fleet server for diffusion language models — batch-speed local inference when autoregression is the wrong tool.

dLLM is the acronym du jour (after LWM, VLA). This repo catches it. DiffusionGemma first; Qwen-Diffusion and friends next. Gotta catch 'em all.


What it does (and will do)

Today (v0.1) Next (v0.2+)
Fleet narrative, PRD, routing doctrine dllm_generate MCP tool
Goliath Windows scaffold FastMCP sidecar → llama-diffusion-cli
Port reservation 10834 / 10835 Health, status, batch endpoints
HLE / catch-them-all assessment Provider registry for multiple dLLMs

Not a replacement for local-llm-mcp (autoregressive chat) or cloud Gemini. Complement — route batch and frontier-reasoning workloads here; route streaming chat there.


Why this exists

DiffusionGemma changed the calculus:

  • ~200–400 tok/s on Goliath (RTX 4090) vs ~40–60 for 32B AR
  • 11.0% HLE no-tools — beats its own Gemma 4 AR twin (8.7%) on the exam CAIS built to stump AI1
  • No streaming — 256-token blocks; wrong for chat, right for synthetic data and Ednaficator

Different runtime (llama-diffusion-cli, not Ollama). Different output shape. Different sweet spot. Own repo.

  MCP agents                    Fleet inference party
 ┌─────────────┐
 │ Cursor /    │     ┌──────────────┐  ┌─────────────────┐  ┌──────────────┐
 │ meta_mcp /  │────►│ google-ai    │  │ local-llm-mcp   │  │ diffusion-   │
 │ Ednaficator │     │ -mcp (cloud) │  │ (AR stream)     │  │ llm-mcp      │
 └─────────────┘     └──────────────┘  └─────────────────┘  │ (dLLM batch) │
       │                    │                  │             └──────────────┘
       │              quality/chat         default local            │
       └────────────────── task-based routing ──────────────────────┘

Quick facts

First model DiffusionGemma 26B-A4B (Apache 2.0)
Quant Q4_K_M GGUF (~16–18 GB VRAM)
Runtime llama.cpp diffusion PRllama-diffusion-cli
Hardware Goliath — RTX 4090 24 GB, 64 GB RAM, Windows 11
Ollama / LM Studio Not yet — sidecar first

Documentation

Doc What's inside
HLE & CAIS Humanity's Last Exam — anti-AI essentialism, goalpost moving, why we route on it
PRD Product requirements, phases, acceptance criteria
Narrative HLE surprise, AI of the Gaps, Chollet, fleet thesis
Vision Mission, catch-them-all, success criteria
Architecture Ports, planned FastMCP layout, MCP tools
Fleet integration Routing rules, handoffs to sibling MCPs
Windows scaffold Build DiffusionGemma on Goliath
Research Papers, links, acronym lineage

Assessment archive: mcp-central-docs/projects/diffusiongemma (deep dive — HLE tables, Chollet essay, prognosis).


Routing cheat sheet

Send here (diffusion-llm-mcp) Send elsewhere
batch=true, throughput priority stream=truelocal-llm-mcp
Synthetic data, Ednaficator Quality-critical chat → google-ai-mcp
Code infill, constrained format Default local chat → AR
paradigm=dllm Multimodal vision (VRAM TBD) → cloud

Roadmap

Version Milestone
v0.1.0 Docs, PRD, ports, GitHub publish
v0.2.0 FastMCP sidecar + dllm_generate / dllm_status
v0.3.0 Provider registry, optional web UI, Ednaficator e2e
v1.0.0 Ollama-native path when upstream merges; multi-model

See CHANGELOG.md.


Local setup (manual, Phase 0)

No Python server yet — run the model directly:

# Full guide: docs/WINDOWS_SCAFFOLD.md
.\llama-diffusion-cli.exe -m diffusiongemma-26B-A4B-it-Q4_K_M.gguf -ngl 99 -cnv -n 512

Smoke checklist in docs/WINDOWS_SCAFFOLD.md.


Fleet context

Part of the sandraschi MCP fleet — same patterns as devices-mcp, worldlabs-mcp, ros-mcp:

  • FastMCP 3.2 + FastAPI (when implemented)
  • Adjacent frontend/backend ports
  • just + uv + ruff + pytest
  • Provider plugin registry for new models

Doctrine: Catch them all — AR, dLLM, VLA, LWM each get a repo. No pivoting on hype.


Contributing

Phase 0 is narrative-heavy by design. Issues welcome on scaffold failures, routing proposals, and PRC dLLM sightings. Code contributions open at v0.2.0 sidecar milestone.


License

MIT — sandraschi, 2026.


Acronym du jour: dLLM · DiffusionGemma is specimen #1 · HLE is why we route here · The gap closes from the tall grass

Footnotes

  1. Humanity's Last Exam (HLE) — CAIS + Scale AI's anti-saturation frontier benchmark. Expert questions filtered to defeat frontier LLMs at design time. DiffusionGemma's no-tools win is the routing signal for this repo. Full treatment: human essentialism, benchmark priesthood, goalpost moving, and why MMLU is the wrong scoreboard.

About

FastMCP fleet MCP server for diffusion LMs (dLLM). DiffusionGemma on Goliath RTX 4090 — batch inference, HLE-shaped reasoning, ~200–400 tok/s. Doc phase; llama-diffusion-cli sidecar next. Complements local-llm-mcp.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors