hyperdxio · brandon-pereira · May 20, 2026 · May 20, 2026 · May 20, 2026 · May 22, 2026
diff --git a/.changeset/add-hdx-eval-framework.md b/.changeset/add-hdx-eval-framework.md
@@ -0,0 +1,15 @@
+---
+'@hyperdx/hdx-eval': minor
+---
+
+feat: add AI eval framework for benchmarking MCP servers
+
+New `@hyperdx/hdx-eval` package for benchmarking AI agents against
+observability MCP servers. Generates deterministic synthetic telemetry
+with planted anomalies, spawns Claude Code as an SRE agent, records full
+trajectories, and grades answers using programmatic checks and an
+LLM-as-judge.
+
+Includes 5 scenarios (error-root-cause, latency-spike, noisy-signals,
+segmented-regression, service-health-check), MCP-agnostic N-way
+comparison, blinded judging, and a web viewer for browsing results.
diff --git a/.opencode/commands/eval-summary.md b/.opencode/commands/eval-summary.md
@@ -0,0 +1,157 @@
+---
+description: Analyze HyperDX eval benchmark results and produce a summary report
+---
+
+You are an eval benchmark analyst for HyperDX. Your job is to review eval run
+results and produce a structured summary comparing HyperDX MCP performance
+against the ClickHouse baseline MCP.
+
+## Input
+
+The user will provide either:
+
+- Raw eval output (pasted terminal logs from running the eval harness)
+- A batch directory path under `packages/hdx-eval/runs/`
+- A `$ARGUMENTS` referencing a specific batch timestamp (e.g.
+  `2026-05-21T16-34-35-417Z`)
+
+If the user provides a batch timestamp or directory path, locate the run files
+under `packages/hdx-eval/runs/<timestamp>/`. If they paste terminal output,
+extract the batch path from the "Results written under" line.
+
+If no batch is specified, list `packages/hdx-eval/runs/` and use the **most
+recent** batch directory (sorted by timestamp).
+
+## Analysis Steps
+
+Perform these steps in order. Use the Task tool with parallel subagents where
+possible to speed up analysis.
+
+### 1. Read Summary Files
+
+Read `_summary.md` and `_summary.json` from the batch directory. These contain
+aggregated metrics across all runs.
+
+### 2. Analyze Run Transcripts
+
+Use **parallel Task subagents** — one per MCP (hyperdx, clickhouse). Each
+subagent reads all run JSON files for its MCP
+(`<scenario>/<mcp>/0.json`, `1.json`, `2.json`, etc.) and extracts:
+
+- **Tool call sequence** — ordered list of tool names called with index numbers
+- **Tool errors** — any failed tool calls, including error messages and index
+- **Backtracking patterns** — sequences where the model retried, searched for
+  unavailable tools, or re-ran queries with different parameters
+- **Final answer** — whether the run produced one (final_answer vs max_turns)
+- **Wasted tool calls** — calls that didn't contribute to the final answer (e.g.
+  ToolSearch for tools that don't exist in the sandbox, queries on empty tables,
+  dead-end hypothesis investigations)
+- **Turn budget usage** — how many calls were used vs budget, and whether the
+  model showed awareness of remaining turns
+- **Interesting observations** — anything unusual, surprising, or noteworthy
+  about the run (clever investigation strategies, unexpected tool behavior,
+  novel failure modes, etc.)
+
+### 3. Read Grade Files and Ground Truth
+
+Use a **single Task subagent** to read all `*.grade.json` files, the scenario's
+`ground-truth.json` from
+`packages/hdx-eval/src/scenarios/<scenario>/ground-truth.json`, and the system
+prompt at `packages/hdx-eval/src/harness/systemPrompt.ts`. Extract:
+
+- Per-run combined, programmatic, and judge scores
+- Per-check pass/fail with weights
+- Judge reasoning/rationale per criterion
+- Ground truth expected answers, checks, and distractors
+- System prompt text and any variant-specific guidance
+
+## Default Output
+
+By default, produce the **Quick Stats** output below. Then prompt the user
+asking if they want to see more.
+
+### Quick Stats
+
+Four bullets covering:
+
+- **Accuracy** — combined score comparison with relative percentage
+- **Consistency** — completion rate (final_answer vs max_turns) and score
+  variance across runs
+- **Speed** — wall clock, tool calls, and token usage with relative percentages
+- **Overall** — one-line headline with the key deltas
+
+Then a **Notable Findings** section: 3-5 short bullets highlighting the most
+interesting or surprising things you found in the transcripts. These should be
+specific, concrete observations — not generic patterns. Examples:
+
+- "HyperDX Run 1 scored 79.8% despite 2 SQL errors — it recovered in 2 calls"
+- "ClickHouse Run 2 was the single best run (89.8%) but the other 2 CH runs
+  both hit max_turns"
+- "The feature_flag distractor tripped HyperDX Run 1 because event_deltas
+  surfaced it as a top-changing attribute"
+
+Finally, end with:
+
+> Want to dig deeper? I can show:
+> - **Detailed stats** — full per-run breakdown, per-check wins, and failure
+>   analysis
+> - **Recommendations** — specific tool/prompt changes to improve HyperDX scores
+> - **Both**
+
+### Detailed Stats (on request)
+
+When the user asks for detailed stats, produce:
+
+#### Section 1: Top-Line Stats
+
+A table of metrics where HyperDX beat ClickHouse. For each, compute the
+percentage improvement (relative, not absolute). Focus on:
+
+- Combined score delta
+- Completion rate (final_answer vs max_turns)
+- Programmatic accuracy delta
+- Judge criteria where HyperDX scored notably higher
+- Tool call efficiency (fewer calls = better)
+
+Format as a markdown table with columns: Metric, HyperDX, ClickHouse, Delta,
+Relative.
+
+#### Section 2: Per-Check Wins
+
+A table of the programmatic checks where HyperDX outperformed ClickHouse by the
+largest margin. Include both positive checks (identified the right pattern) and
+negative checks (avoided false positives). Only include checks with a gap of
+20pp or more.
+
+For each check with a large gap, add a brief explanation of **why** the gap
+exists.
+
+#### Section 3: Failure Analysis
+
+For any run (either MCP) that scored below 50% combined or hit max_turns,
+explain what went wrong. Reference specific tool call sequences and errors.
+Contrast with how the other MCP handled the same challenge.
+
+### Recommendations (on request)
+
+When the user asks for recommendations, analyze the run transcripts and produce
+concrete improvement opportunities for the HyperDX MCP tools or system prompt.
+For each opportunity:
+
+1. **Problem** — What happened in the runs (cite specific tool call sequences)
+2. **Impact** — How many tool calls / seconds / score points this costs
+3. **Fix** — A specific, actionable change to the tool, system prompt, or eval
+   harness
+
+Do NOT use a pre-built checklist of patterns to look for. Derive all findings
+directly from the transcript data for this specific batch. Each recommendation
+should be grounded in something that actually happened in the runs.
+
+## Guidelines
+
+- Be precise with numbers. Always compute relative deltas (not just absolute).
+- Cite specific run IDs and tool call numbers when describing patterns.
+- Focus on actionable insights, not just observations.
+- If the eval only has one scenario, that's fine — analyze it thoroughly. If
+  there are multiple scenarios, analyze each separately and then synthesize.
+- Keep the tone technical and direct. No filler.
diff --git a/AGENTS.md b/AGENTS.md
@@ -11,7 +11,7 @@ schema-agnostic design, and correlation across all telemetry types in one place.
 
 ## Architecture (WHAT)
 
-This is a **monorepo** with five packages:
+This is a **monorepo** with six packages:
 
 - `packages/app` - Next.js frontend (TypeScript, Mantine UI, TanStack Query)
 - `packages/api` - Express backend (Node.js 22+, MongoDB for metadata,
@@ -26,6 +26,10 @@ This is a **monorepo** with five packages:
 - `packages/otel-collector` - Custom-built OpenTelemetry Collector (Go, OCB).
   See its [`README.md`](packages/otel-collector/README.md) for architecture,
   included components, and upgrade procedures.
+- `packages/hdx-eval` - AI eval framework for benchmarking MCP servers against
+  observability scenarios. Generates deterministic synthetic telemetry, spawns
+  agents, and grades with programmatic checks + LLM-as-judge. See its
+  [`README.md`](packages/hdx-eval/README.md) for setup and usage.
 
 **Data flow**: Apps → OpenTelemetry Collector → ClickHouse (telemetry data) /
 MongoDB (configuration/metadata)

diff --git a/knip.json b/knip.json
@@ -26,6 +26,10 @@
     "packages/common-utils": {
       "entry": ["src/**/*.ts", "!src/__tests__/**", "!src/**/*.test.*"],
       "project": ["src/**/*.ts"]
+    },
+    "packages/hdx-eval": {
+      "project": ["src/**/*.ts"],
+      "ignoreDependencies": ["tsx"]
     }
   },
   "ignore": ["scripts/dev-portal/**", ".github/scripts/**"],

diff --git a/packages/hdx-eval/.gitignore b/packages/hdx-eval/.gitignore
@@ -0,0 +1,2 @@
+/runs/
+eval.config.json