One field instrument for MCP server reviews. Point it at a client's MCP
server (or an offline tools/list dump) and walk away with one graded
report that answers three questions:
- What does this surface cost? Per-turn context tax, per tool, before any work.
- Does the surface confuse the model? Wrong-tool selection, spurious firing on off-domain tasks.
- Can the surface be smaller? Which tools merge, which should be MCP resources, and whether the real fix is consolidation or just-in-time loading.
Many sensors, one voice: wrapped tools contribute measurements only; the grading engine owns all interpretation.
See real output: example reports - full mcp-xray
audits of two production MCP servers (OrionBelt Semantic Layer & Analytics),
rendered exactly as the tool emits them.
Built by RALFORION d.o.o. - the team behind the OrionBelt Semantic Layer. See Professional review & commercial use.
pip install -e . # core (offline static + consolidation half)
pip install -e ".[api]" # + authoritative token counting & LLM behavioral probes
pip install -e ".[live]" # + stdio / http / sse transports
pip install -e ".[dev]" # everything + pytestNote on naming. The
-e .commands above install from a local clone (rungit clonefirst). On PyPI the distribution is published asmcp-xray-audit(the baremcp-xrayname belongs to an unrelated Jira Xray project). The import package (mcp_xray) and the CLI command (mcp-xray) are unchanged.
The static + consolidation half runs keyless and offline from a tools/list
dump - no API key, no live server.
# Offline: static hygiene + consolidation, rendered as the client artifact
mcp-xray analyze --tools-json dump.json
# Authoritative token numbers (must match the client's production model)
mcp-xray analyze --tools-json dump.json --token-backend api --model claude-sonnet-4-6
# Live server, full audit including behavioral probe
ANTHROPIC_API_KEY=... mcp-xray analyze --stdio "gmail-mcp serve" --llm --model claude-sonnet-4-6
# Authed HTTP/SSE server -> pass a bearer token (repeatable --header). Prefer the
# MCP_XRAY_HTTP_HEADER env var so the token stays out of ps/shell history.
mcp-xray analyze --http https://server.example/mcp --header "Authorization: Bearer $TOKEN"
# With the client's labeled golden queries -> labeled selection accuracy
mcp-xray analyze --stdio "gmail-mcp serve" --llm --model claude-sonnet-4-6 --queries golden.yaml
# Phase-swapped surface (tool list changes by journey phase) -> per-phase audit
mcp-xray analyze --phases phases.yaml
# Just the capability-reduction analysis
mcp-xray consolidate --tools-json dump.json
# Validate a proposed merge: tokens + selection accuracy, before vs after
mcp-xray validate --before base.json --after merged.json --queries golden.yaml --model claude-sonnet-4-6
# Persist a run, re-render markdown later (fingerprinted for drift)
mcp-xray analyze --tools-json dump.json --out runs/2026-05-31/
mcp-xray report --run runs/2026-05-31/Each run folder is self-contained and replayable: alongside report.json/
report.md, analyze writes the run's input under <run>/dumps/ (a phased
run's phases.yaml + per-phase tools-json, or a flat run's tools.json). So
you can re-grade or re-probe a past version offline - no live server, no
re-capture - e.g. mcp-xray analyze --phases runs/<version>/dumps/phases.yaml.
Per-probe deep-dives live in docs/.
| Probe | Owned? | Needs | Emits |
|---|---|---|---|
static_hygiene |
owned (authoritative) | inventory | per-tool token cost (leave-one-out), hidden injectors, schema smells - see docs/static-hygiene-probe.md |
consolidate |
owned | inventory | merge candidates, resource candidates, JIT framing - see docs/consolidation-probe.md & merge-candidates.md |
noise |
owned | LLM + key | selection accuracy / confusability proxy / distraction - see docs/behavioral-probe.md |
mcp_checkup, token_analyzer |
wrapped (v0.2) | external bin + config | token cost, duplicates - measurements only |
Skipped probes drop their weight and are reported "not measured," never
scored zero. The authoritative per-tool token figure is computed in-house via
the Anthropic count_tokens endpoint; the offline backend is a flagged
ESTIMATE and never the headline number.
Wrapped sensors (
mcp_checkup,token_analyzer) run when you pass--client-config <path>and their binary is installed; otherwise they're reported "not measured." They contribute measurements only - never grades.
Five weighted dimensions roll to a 0–100 score and letter grade:
context efficiency (30%), selection robustness (25%), surface redundancy (15%),
schema hygiene (15%), description quality (15%). Full roll-up math in
docs/grading.md.
tools-json accepts a full MCP result ({"tools": [...], "instructions": "..."}),
a bare list, or a {"result": {"tools": [...]}} envelope.
golden queries (--queries):
queries:
- query: "create a new label called Work"
expected_tools: [create_label]
- query: "find emails from my boss"
expected_tools: [search_threads]call-manifest (--call-manifest, safe result-size probing - operator
asserts these are read-only/sandbox calls). On a live, non-phased run
(--stdio/--http/--sse) each listed tool is called once and its result size
(chars + bytes) is measured and reported, since tool outputs cost context
too. Offline or phased runs warn and skip (no server to call). mcp-xray never
calls a tool without a manifest - see docs/safe-calls.md:
calls:
- tool: list_labels
args: {}Some servers don't expose one static toolset - they swap the tool list by
journey phase (e.g. a "design" phase before a model is loaded, a "run" phase
after). A single tools/list snapshot can't see a swap, so point mcp-xray at a
phases manifest - one tools-json dump per phase:
# phases.yaml
phases:
design: design.json # tools visible before a model is loaded
run: run.json # tools visible once a model is loadedmcp-xray analyze --phases phases.yamlThe phased report:
- Headline tax = the worst phase, not the union - the model only ever carries one phase at a time, so it's not charged for tools it never co-loads.
- Per-phase surface table + carried tools (those visible in more than one phase = the cross-phase cost).
- Union analysis - every distinct tool still gets schema-hygiene + consolidation review.
- Progressive loading is credited, not flagged - ≥2 distinct phases means the server already does the JIT pattern the tool would otherwise recommend.
Capture the per-phase dumps with mcp-xray dump while the server is in each
phase - or automate the walk with capture-phases, which drives the journey in
a single session:
# capture.yaml - first phase captured before any call; later phases issue their
# 'advance' tool calls (the ONLY calls made - never inferred), then re-list.
phases:
- name: design
- name: run
advance:
- tool: load_model
args: { model_id: "<id>" }mcp-xray capture-phases --stdio "my-server --multi-model" \
--capture capture.yaml --out-dir dumps/phases
mcp-xray analyze --phases dumps/phases/phases.yamlThe tool (src/) is generic. Anything specific to a particular MCP server
you're reviewing - captured dumps, phase manifests, golden queries, run outputs -
lives under profiles/<server>/, one directory per server. profiles/ is
git-ignored: engagement data stays local and is never committed. Suggested
per-server layout:
profiles/<server>/
dumps/ # captured tools/list snapshots (mcp-xray dump)
phases.yaml # phase manifest (for phase-swapped surfaces)
golden.yaml # labeled selection queries (--queries)
call-manifest.yaml # operator-confirmed safe calls (--call-manifest)
runs/ # report.json + report.md per audit (fingerprinted)
Generic, server-neutral example fixtures live in tests/fixtures/ (e.g. the
synthetic "Acme Catalog" phased server) - those are part of the product and are
committed.
pytest # static + consolidation paths are fully testable offlinetests/contracts/ pins one frozen-fixture test per wrapped adapter so a silent
upstream format change fails in CI, not in front of a client.
v1.4.0 - production instrument. Everything through the behavioral harness is shipped:
- Offline core - static hygiene (authoritative tokens + smells), consolidation
(merge/resource candidates, JIT framing), grading, and rendered report. Keyless,
runs from a
tools/listdump. - Wrapped sensors -
mcp_checkup+token_analyzeradapters with pinned versions and contract tests; measurements only, reconciled against the authoritative count. - Behavioral -
noiseprobe (selection accuracy / confusability / distraction), resumable (--resume); before/aftervalidateloop; safe result-size probing via call-manifest. - Phased surfaces - phase-swapped (bucketed) toolsets,
capture-phasesautomation, worst-phase headline tax. - Replayable runs - self-contained, fingerprinted run folders you can re-grade or re-probe offline.
Remaining roadmap: trace co-occurrence (signal from client call logs + composite-tool proposals).
mcp-xray gives you the grade. Acting on it - prioritising the findings,
remodelling a confusing surface, wiring the validate gate into CI so a
regression can't merge - is what RALFORION does for a
living.
- MCP surface review - we run the full audit against your live servers and hand back a prioritised remediation plan (not just a score). Good first step if your tool surface is large, phase-swapped, or quietly burning context.
- Commercial / embedded use - the BSL 1.1 license lets you use mcp-xray for any internal purpose, including production. Embedding it in a commercial product, or offering it as part of a paid service, needs a commercial license - reach us via ralforion.com.
Copyright 2026 RALFORION d.o.o.
Licensed under the Business Source License 1.1. The Licensed Work will convert to Apache License 2.0 on 2030-06-09.
By contributing to this project, you agree to the Contributor License Agreement.
For commercial licensing inquiries, contact: licensing@ralforion.com