Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
7bd6d60
initial commit
brandon-pereira May 20, 2026
d93aa98
stream seeding results
brandon-pereira May 20, 2026
37f9abe
general setup fixes
brandon-pereira May 20, 2026
e07e4c7
general eval improvements
brandon-pereira May 22, 2026
2fdc6d7
improvements to framework
brandon-pereira May 22, 2026
019db69
add eval skill
brandon-pereira May 29, 2026
3256e47
fix: track src/runs/ files and anchor gitignore to root runs/ only
brandon-pereira Jun 1, 2026
a4df2b3
chore: remove ablation scripts, experiment artifacts, and notes
brandon-pereira Jun 1, 2026
b74bf07
refactor: DRY scenario generators and compact rubric format (-772 lines)
brandon-pereira Jun 1, 2026
ac37fe1
feat: MCP-agnostic eval framework with N-way comparison support
brandon-pereira Jun 3, 2026
8701385
chore: fix knip errors and add changeset for hdx-eval
brandon-pereira Jun 3, 2026
c941b24
fix: address code review findings in hdx-eval
brandon-pereira Jun 4, 2026
9ec01b3
fix: code review fixes — severity bug, fetch timeouts, identifier val…
brandon-pereira Jun 4, 2026
a3bd92b
feat(hdx-eval): persist anchorTime in config, default to no-reseed
brandon-pereira Jun 4, 2026
0c1cf5c
feat(hdx-eval): auto-seed on first run when scenario data is missing
brandon-pereira Jun 5, 2026
5337706
chore: fix knip — unexport file-internal helpers in templates.ts
brandon-pereira Jun 5, 2026
51e4257
docs(hdx-eval): use HDX_DEV_SLOT in quick start and dual-slot setup
brandon-pereira Jun 5, 2026
30ecd94
Merge branch 'main' into brandon/ai-evals
brandon-pereira Jun 5, 2026
9c8e626
Merge branch 'main' into brandon/ai-evals
brandon-pereira Jun 5, 2026
adb76ea
Address PR review feedback (#2414)
brandon-pereira Jun 5, 2026
ad7c588
Exclude .timing.json sidecars from listRunFiles (#2414)
brandon-pereira Jun 5, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions .changeset/add-hdx-eval-framework.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
---
'@hyperdx/hdx-eval': minor
---

feat: add AI eval framework for benchmarking MCP servers

New `@hyperdx/hdx-eval` package for benchmarking AI agents against
observability MCP servers. Generates deterministic synthetic telemetry
with planted anomalies, spawns Claude Code as an SRE agent, records full
trajectories, and grades answers using programmatic checks and an
LLM-as-judge.

Includes 5 scenarios (error-root-cause, latency-spike, noisy-signals,
segmented-regression, service-health-check), MCP-agnostic N-way
comparison, blinded judging, and a web viewer for browsing results.
157 changes: 157 additions & 0 deletions .opencode/commands/eval-summary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
---
description: Analyze HyperDX eval benchmark results and produce a summary report
---

You are an eval benchmark analyst for HyperDX. Your job is to review eval run
results and produce a structured summary comparing HyperDX MCP performance
against the ClickHouse baseline MCP.

## Input

The user will provide either:

- Raw eval output (pasted terminal logs from running the eval harness)
- A batch directory path under `packages/hdx-eval/runs/`
- A `$ARGUMENTS` referencing a specific batch timestamp (e.g.
`2026-05-21T16-34-35-417Z`)

If the user provides a batch timestamp or directory path, locate the run files
under `packages/hdx-eval/runs/<timestamp>/`. If they paste terminal output,
extract the batch path from the "Results written under" line.

If no batch is specified, list `packages/hdx-eval/runs/` and use the **most
recent** batch directory (sorted by timestamp).

## Analysis Steps

Perform these steps in order. Use the Task tool with parallel subagents where
possible to speed up analysis.

### 1. Read Summary Files

Read `_summary.md` and `_summary.json` from the batch directory. These contain
aggregated metrics across all runs.

### 2. Analyze Run Transcripts

Use **parallel Task subagents** — one per MCP (hyperdx, clickhouse). Each
subagent reads all run JSON files for its MCP
(`<scenario>/<mcp>/0.json`, `1.json`, `2.json`, etc.) and extracts:

- **Tool call sequence** — ordered list of tool names called with index numbers
- **Tool errors** — any failed tool calls, including error messages and index
- **Backtracking patterns** — sequences where the model retried, searched for
unavailable tools, or re-ran queries with different parameters
- **Final answer** — whether the run produced one (final_answer vs max_turns)
- **Wasted tool calls** — calls that didn't contribute to the final answer (e.g.
ToolSearch for tools that don't exist in the sandbox, queries on empty tables,
dead-end hypothesis investigations)
- **Turn budget usage** — how many calls were used vs budget, and whether the
model showed awareness of remaining turns
- **Interesting observations** — anything unusual, surprising, or noteworthy
about the run (clever investigation strategies, unexpected tool behavior,
novel failure modes, etc.)

### 3. Read Grade Files and Ground Truth

Use a **single Task subagent** to read all `*.grade.json` files, the scenario's
`ground-truth.json` from
`packages/hdx-eval/src/scenarios/<scenario>/ground-truth.json`, and the system
prompt at `packages/hdx-eval/src/harness/systemPrompt.ts`. Extract:

- Per-run combined, programmatic, and judge scores
- Per-check pass/fail with weights
- Judge reasoning/rationale per criterion
- Ground truth expected answers, checks, and distractors
- System prompt text and any variant-specific guidance

## Default Output

By default, produce the **Quick Stats** output below. Then prompt the user
asking if they want to see more.

### Quick Stats

Four bullets covering:

- **Accuracy** — combined score comparison with relative percentage
- **Consistency** — completion rate (final_answer vs max_turns) and score
variance across runs
- **Speed** — wall clock, tool calls, and token usage with relative percentages
- **Overall** — one-line headline with the key deltas

Then a **Notable Findings** section: 3-5 short bullets highlighting the most
interesting or surprising things you found in the transcripts. These should be
specific, concrete observations — not generic patterns. Examples:

- "HyperDX Run 1 scored 79.8% despite 2 SQL errors — it recovered in 2 calls"
- "ClickHouse Run 2 was the single best run (89.8%) but the other 2 CH runs
both hit max_turns"
- "The feature_flag distractor tripped HyperDX Run 1 because event_deltas
surfaced it as a top-changing attribute"

Finally, end with:

> Want to dig deeper? I can show:
> - **Detailed stats** — full per-run breakdown, per-check wins, and failure
> analysis
> - **Recommendations** — specific tool/prompt changes to improve HyperDX scores
> - **Both**
### Detailed Stats (on request)

When the user asks for detailed stats, produce:

#### Section 1: Top-Line Stats

A table of metrics where HyperDX beat ClickHouse. For each, compute the
percentage improvement (relative, not absolute). Focus on:

- Combined score delta
- Completion rate (final_answer vs max_turns)
- Programmatic accuracy delta
- Judge criteria where HyperDX scored notably higher
- Tool call efficiency (fewer calls = better)

Format as a markdown table with columns: Metric, HyperDX, ClickHouse, Delta,
Relative.

#### Section 2: Per-Check Wins

A table of the programmatic checks where HyperDX outperformed ClickHouse by the
largest margin. Include both positive checks (identified the right pattern) and
negative checks (avoided false positives). Only include checks with a gap of
20pp or more.

For each check with a large gap, add a brief explanation of **why** the gap
exists.

#### Section 3: Failure Analysis

For any run (either MCP) that scored below 50% combined or hit max_turns,
explain what went wrong. Reference specific tool call sequences and errors.
Contrast with how the other MCP handled the same challenge.

### Recommendations (on request)

When the user asks for recommendations, analyze the run transcripts and produce
concrete improvement opportunities for the HyperDX MCP tools or system prompt.
For each opportunity:

1. **Problem** — What happened in the runs (cite specific tool call sequences)
2. **Impact** — How many tool calls / seconds / score points this costs
3. **Fix** — A specific, actionable change to the tool, system prompt, or eval
harness

Do NOT use a pre-built checklist of patterns to look for. Derive all findings
directly from the transcript data for this specific batch. Each recommendation
should be grounded in something that actually happened in the runs.

## Guidelines

- Be precise with numbers. Always compute relative deltas (not just absolute).
- Cite specific run IDs and tool call numbers when describing patterns.
- Focus on actionable insights, not just observations.
- If the eval only has one scenario, that's fine — analyze it thoroughly. If
there are multiple scenarios, analyze each separately and then synthesize.
- Keep the tone technical and direct. No filler.
6 changes: 5 additions & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ schema-agnostic design, and correlation across all telemetry types in one place.

## Architecture (WHAT)

This is a **monorepo** with five packages:
This is a **monorepo** with six packages:

- `packages/app` - Next.js frontend (TypeScript, Mantine UI, TanStack Query)
- `packages/api` - Express backend (Node.js 22+, MongoDB for metadata,
Expand All @@ -26,6 +26,10 @@ This is a **monorepo** with five packages:
- `packages/otel-collector` - Custom-built OpenTelemetry Collector (Go, OCB).
See its [`README.md`](packages/otel-collector/README.md) for architecture,
included components, and upgrade procedures.
- `packages/hdx-eval` - AI eval framework for benchmarking MCP servers against
observability scenarios. Generates deterministic synthetic telemetry, spawns
agents, and grades with programmatic checks + LLM-as-judge. See its
[`README.md`](packages/hdx-eval/README.md) for setup and usage.

**Data flow**: Apps → OpenTelemetry Collector → ClickHouse (telemetry data) /
MongoDB (configuration/metadata)
Expand Down
4 changes: 4 additions & 0 deletions knip.json
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,10 @@
"packages/common-utils": {
"entry": ["src/**/*.ts", "!src/__tests__/**", "!src/**/*.test.*"],
"project": ["src/**/*.ts"]
},
"packages/hdx-eval": {
"project": ["src/**/*.ts"],
"ignoreDependencies": ["tsx"]
}
},
"ignore": ["scripts/dev-portal/**", ".github/scripts/**"],
Expand Down
2 changes: 2 additions & 0 deletions packages/hdx-eval/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
/runs/
eval.config.json
Loading
Loading