Skip to content

feat(app): Query Stats debug drawer#2390

Open
MikeShi42 wants to merge 3 commits into
mainfrom
mikeshi/query-stats-drawer
Open

feat(app): Query Stats debug drawer#2390
MikeShi42 wants to merge 3 commits into
mainfrom
mikeshi/query-stats-drawer

Conversation

@MikeShi42
Copy link
Copy Markdown
Contributor

Summary

Adds an in-app "Query Stats" debug drawer that captures every ClickHouse query the app dispatches and surfaces them in a bottom drawer (opened from the user menu). Built to help engineers and support quickly see what queries a page is actually running, how slow they are, and what they look like with params filled in.

What's in it

  • Instrumented ClickHouse client (InstrumentedClickhouseClient) wraps the existing browser ClickhouseClient, captures every query() call into an in-memory ring buffer (max 200 events), and auto-assigns a query_id when callers don't supply one. Capture is wrapped in defensive try/catch so a broken store can never throw into a production query path.
  • Module-store + useSyncExternalStore for state (no Jotai/Zustand). Stable snapshot reference until a change happens.
  • Drawer UI: per-row status icon, duration colored at 2s / 10s thresholds, SQL preview with params interpolated client-side for readability. Expanded row shows the interpolated SQL (clearly labeled as preview-only), the parameterized SQL as sent to ClickHouse, the params object, query_id, connection, and a one-click "Run EXPLAIN" (EXPLAIN PLAN indexes = 1, returned as TabSeparatedRaw).
  • Filters: "This page only" (reactive on client-side nav via router.asPath) and a toggle to show EXPLAIN events.
  • Transient open state — closed on every page load, opened via the user menu item. Not persisted.
  • Error boundary around the drawer so a panel crash hides the panel rather than the whole app.

Notes for reviewers

  • InstrumentedClickhouseClient is now the default constructor in getClickhouseClient for both local and proxy modes. Behavior is identical to the bare ClickhouseClient aside from instrumentation.
  • The captured event.pathname is window.location.pathname at dispatch time, so events stay attributed to the page that initiated them even if the user navigates before the query resolves.
  • The interpolated SQL is not safe to copy-paste back (no string quoting); the UI labels it accordingly and shows the raw parameterized form alongside.
  • Path filter is by exact URL, not route pattern — /dashboards/abc and /dashboards/def are treated as separate "pages". This is intentional for now; route-pattern grouping is a future iteration.
  • React Query key attribution is not wired up yet (a future iteration could thread QueryFunctionContext.queryKey through QueryInputs).
  • EXPLAIN PLAN indexes = 1 is MergeTree-specific; failures render gracefully in the UI as an error message.

Test plan

  • Open user menu → click "Query Stats" → drawer opens at the bottom
  • Interact with a page (search, chart, dashboard) and verify rows appear in real time with status, duration color, and readable SQL
  • Expand a row, confirm SQL is interpolated and labeled as preview-only, parameterized SQL is shown beneath, params object is rendered, and "Run EXPLAIN" produces formatted plan text
  • Click inside the expanded SQL/Params text — row should not collapse; text selection should work
  • Toggle "This page only" off → events from other pages appear. Navigate to a different page → filter re-applies reactively without needing a new query to fire
  • Trigger an error query (e.g. invalid table) and confirm the row renders red with the error message
  • Reload the page → drawer is closed by default
  • Verify make ci-lint and make ci-unit pass

Captures every ClickHouse query the app dispatches into an in-memory
ring buffer (max 200 events) and surfaces them in a bottom drawer
opened from the user menu. Each row shows status, duration with a
2s/10s color threshold, and SQL with params interpolated client-side
for readability. Expanded view exposes the raw parameterized SQL as
sent to ClickHouse, params, query_id, connection, and a one-click
"Run EXPLAIN". Filter to current-page events (reactive on client-side
nav) and toggle visibility of EXPLAIN events. Capture is wrapped in
defensive error handling and an error boundary so instrumentation
can never break a production query path.
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Jun 1, 2026

🦋 Changeset detected

Latest commit: 8174d06

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages
Name Type
@hyperdx/app Minor
@hyperdx/api Minor
@hyperdx/otel-collector Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 1, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hyperdx-oss Ready Ready Preview, Comment Jun 1, 2026 7:00am
hyperdx-storybook Ready Ready Preview, Comment Jun 1, 2026 7:00am

Request Review

@github-actions github-actions Bot added the review/tier-3 Standard — full human review required label Jun 1, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

🟡 Tier 3 — Standard

Introduces new logic, modifies core functionality, or touches areas with non-trivial risk.

Why this tier:

  • Diff size: 728 production lines changed (Tier 2 max: < 250)

Review process: Full human review — logic, architecture, edge cases.
SLA: First-pass feedback within 1 business day.

Stats
  • Production files changed: 7
  • Production lines changed: 728 (+ 243 in test files, excluded from tier calculation)
  • Branch: mikeshi/query-stats-drawer
  • Author: MikeShi42

To override this classification, remove the review/tier-3 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

E2E Test Results

1 test failed • 190 passed • 3 skipped • 1178s

Status Count
✅ Passed 190
❌ Failed 1
⚠️ Flaky 4
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

Deep Review

✅ No critical issues found.

No P0/P1 issues after re-grading. The instrumentation path is defensively wrapped, the store guards against SSR, and the drawer is gated behind an error boundary. Several P2 items below are worth addressing before merge — most concentrate on the drawer's render cost and the missing UI test coverage for ~450 new lines of React code.

🟡 P2 -- recommended

  • packages/app/src/queryStats/QueryStatsDrawer.tsx:125 -- QueryRow calls useClickhouseClient() per row, registering an api.useMe() subscription and constructing a fresh InstrumentedClickhouseClient for each of up to 200 rows on every render of the drawer.
    • Fix: Hoist useClickhouseClient() into QueryStatsDrawer (or lazy-instantiate the client inside runExplain) and pass it to QueryRow via prop so the cost is paid once per render, not per row.
    • performance, adversarial
  • packages/app/src/queryStats/QueryStatsDrawer.tsx:127 -- runExplain awaits a network call and then calls setExplainState with no mounted/cancel check, so closing the drawer, clicking Clear, or filtering the row out mid-flight produces a React unmounted-state-update warning and leaves the request in flight.
    • Fix: Bind an AbortController (or mounted ref) to the row, pass abort_signal into the clickhouseClient.query call, and skip the setExplainState write when cancelled.
    • correctness, reliability, frontend-races
  • packages/app/src/components/AppNav/AppNav.tsx:273 -- queryStatsEverOpened latches true permanently, so after the first open the drawer subtree stays mounted with opened={false} and useQueryEvents keeps re-rendering the entire QueryStatsDrawer (with all QueryRow children) on every captured query for the rest of the session.
    • Fix: Either unmount the subtree while queryStatsOpen is false, or short-circuit visible computation and child mapping inside QueryStatsDrawer when opened === false so the store subscription is effectively idle when the panel is hidden.
    • performance, reliability, frontend-races
  • packages/app/src/queryStats/QueryStatsDrawer.tsx:131 -- Clicking Run EXPLAIN on an event whose sql was clipped at MAX_STRING_LEN sends EXPLAIN PLAN indexes = 1 <truncated sql>\n…[truncated] to ClickHouse, which always fails to parse — canExplain only inspects the prefix.
    • Fix: Store an isTruncated flag on the event (or export TRUNCATION_MARK from the store), and disable the Run EXPLAIN button with an explanatory tooltip when the captured SQL was truncated.
    • correctness, adversarial
  • packages/app/src/queryStats/QueryStatsDrawer.tsx:309 -- ~415 lines of new UI logic (filters, route-change listener, expand/collapse, EXPLAIN happy/error paths, empty-state, clearQueryEvents button) ship with zero render-level tests.
    • Fix: Add a React Testing Library spec that covers the pathFilter/showExplain toggles (including routeChangeComplete reactivity), the Clear button, expand→hydrated/parameterized/params blocks, and both branches of runExplain (resolved text vs caught error).
  • packages/app/src/queryStats/QueryStatsErrorBoundary.tsx:22 -- The boundary's documented contract is that changing resetKey clears the error state and re-mounts children; componentDidUpdate is the only escape from hasError and has no test coverage.
    • Fix: Add a spec that mounts with a throwing child, asserts the rendered output is null, changes resetKey, and asserts the (non-throwing replacement) child renders again.
  • packages/app/src/queryStats/__tests__/InstrumentedClickhouseClient.test.ts:12 -- The base-class spy reaches one level up the prototype chain via Object.getPrototypeOf(ClickhouseClient.prototype); if @hyperdx/common-utils ever moves query onto a different rung, the spy silently misses super.query and tests pass without exercising the real path.
    • Fix: Spy on the concrete base class by name (import the upstream parent and spy on its prototype with a structural assertion), or inject a fake transport into InstrumentedClickhouseClient so the test stops depending on prototype-chain shape.
    • testing, kieran-typescript
  • packages/app/src/queryStats/queryStatsStore.ts:30 -- The MAX_STRING_LEN truncation branch (the stated memory-safety guarantee for the ring buffer) is unexercised; tests only feed short strings.
    • Fix: Add unit tests that append/update an event with sql and error longer than MAX_STRING_LEN, then assert the stored value ends with TRUNCATION_MARK and is capped at the expected length.
🔵 P3 nitpicks (19)
  • packages/app/src/queryStats/QueryStatsDrawer.tsx:334 -- explainCount is computed from the unfiltered events array while errorCount uses visible, so the EXPLAINs (N) chip badge does not reflect the active pathFilter.
    • Fix: Derive explainCount from the same events.filter(e => pathFilter ? e.pathname === currentPath : true) predicate used for visible before counting kind === 'explain'.
  • packages/app/src/queryStats/InstrumentedClickhouseClient.ts:95 -- The catch branch classifies an event as cancelled purely from props.abort_signal?.aborted, so a server-side error that arrives after the caller's signal has aborted gets labeled as a cancellation and hides the real failure in the drawer.
    • Fix: Inspect the rejection itself first (e.g. error instanceof DOMException && error.name === 'AbortError') and only fall back to signal.aborted when the error doesn't carry abort identity.
    • correctness, reliability
  • packages/app/src/queryStats/InstrumentedClickhouseClient.ts:56 -- The wrapper now always auto-injects a query_id for callers that previously omitted it, so ClickHouse's server-side query-id minting is bypassed across the entire app.
    • Fix: Confirm nothing downstream (audit log routing, distributed tracing, retry-on-READONLY reuse, server-side replace_running_query semantics) depended on the prior behavior; document the contract change in a comment if intentional.
  • packages/app/src/queryStats/InstrumentedClickhouseClient.ts:30 -- currentPathname() is a generic SSR-safe utility that lives inside the ClickHouse client subclass module purely so the drawer and the capture path import the same reference.
    • Fix: Move currentPathname() into its own module (e.g. queryStats/pathname.ts) or into queryStatsStore.ts, so the drawer no longer reaches into the instrumentation class file for an unrelated util.
  • packages/app/src/queryStats/InstrumentedClickhouseClient.ts:12 -- The module-level warnedOnce flag silences every subsequent instrumentation failure for the tab's lifetime, even when later failures have a different root cause than the first.
    • Fix: Replace the boolean with a counter that warns on the first failure and at exponentially backed-off intervals after that, or scope the once-flag per call site so each distinct safeWarn message warns once.
  • packages/app/src/queryStats/QueryStatsDrawer.tsx:1 -- The new file is ~415 lines and the project standards cap individual files at ~300.
    • Fix: Extract QueryRow, StatusIcon, and the format/hydrate helpers (formatDuration, durationColor, collapseWhitespace, hydrateSql, canExplain) into sibling files under queryStats/ so each module stays under the documented limit.
  • packages/app/src/queryStats/QueryStatsDrawer.tsx:55 -- Status icons and duration colors hard-code raw Mantine palette tokens (var(--mantine-color-yellow-5), teal-5, gray-5, red-5, teal-4, yellow-4, orange-4, dimmed, default-border) instead of project semantic tokens.
    • Fix: Reroute through the semantic token layer in packages/app/src/theme/semanticColorsGrouped.ts (e.g. text-success, text-warning, text-danger, border, text-muted) so the drawer matches the rest of the product across themes.
  • packages/app/src/queryStats/QueryStatsDrawer.tsx:154 -- The error-row background tint is an inline rgba(255, 99, 99, 0.05) literal.
    • Fix: Replace with a semantic CSS custom property (var(--color-bg-danger-subtle) or equivalent) sourced from the theme so dark/light variants render correctly.
  • packages/app/src/queryStats/queryStatsStore.ts:121 -- __resetQueryStatsForTests and __resetInstrumentationWarnForTests are exported from production modules; the NODE_ENV guard short-circuits at call time but the symbols and bodies still ship in the production bundle.
    • Fix: Move the helpers into a dedicated *.test-utils.ts module excluded from the production build, or wrap the export itself with a build-time NODE_ENV check so dead-code elimination drops them.
    • reliability, kieran-typescript
  • packages/app/src/queryStats/InstrumentedClickhouseClient.ts:77 -- If the pre-capture try block throws before queryId = queryId ?? generateQueryId() completes, captured stays false and nextProps = props is forwarded to super.query without an auto-injected id, silently degrading the wrapper's "always assigns a query_id" contract.
    • Fix: Compute const queryId = props.queryId ?? generateQueryId(); once outside the try block (and likewise const eventId = generateQueryId();) so the assignment is unconditional and the guarantee holds on every code path.
  • packages/app/src/queryStats/QueryStatsDrawer.tsx:131 -- EXPLAIN PLAN indexes = 1 ${event.sql} does not strip trailing SETTINGS/FORMAT clauses produced by the existing chart query renderer, so EXPLAIN on those captures lands at the parser as a different statement than the user expects.
    • Fix: Wrap the inner SQL in parentheses (EXPLAIN PLAN indexes = 1 (\n${event.sql}\n)) or reuse the existing extractSettingsClauseFromEnd helper to split off the trailing clause before composition.
  • packages/app/src/queryStats/QueryStatsDrawer.tsx:317 -- Capture-side records window.location.pathname synchronously while the drawer's currentPath only advances on routeChangeComplete, so queries dispatched during the in-flight nav window can briefly disappear from "This page only" before reappearing on the next route event.
    • Fix: Subscribe to routeChangeStart as well (or read router.asPath directly at render time), so the filter advances on the same edge the capture side sees.
  • packages/app/src/components/AppNav/AppNav.tsx:540 -- On first toggle of queryStatsOpen, the effect-driven setQueryStatsEverOpened(true) causes the drawer to mount after a first render where it was still unmounted, which may skip Mantine's enter animation since opened is already true at mount time.
    • Fix: Set queryStatsEverOpened synchronously inside the toggle handler (instead of via useEffect) so the first render that flips opened to true already has the drawer mounted with opened transitioning false→true.
  • packages/app/src/queryStats/QueryStatsDrawer.tsx:130 -- runExplain calls clickhouseClient.query without an abort_signal, so a slow EXPLAIN against the cluster blocks the row's button state until the server-side max_execution_time fires, with no user-facing cancel.
    • Fix: Wire an AbortController whose signal is passed into the query call and aborted on row unmount or a "Cancel EXPLAIN" button; reuse the same controller introduced for the unmounted-setState fix above.
  • packages/app/src/queryStats/queryStatsStore.ts:62 -- The !isBrowser() short-circuit paths in appendQueryEvent, updateQueryEvent, and clearQueryEvents are never executed by the existing jsdom-only test suite.
    • Fix: Add a test that deletes global.window (or runs under the node environment block) and asserts the three mutators are no-ops without throwing, locking the SSR-safety contract the wrapper depends on.
  • packages/app/src/queryStats/InstrumentedClickhouseClient.ts:87 -- Only the appendQueryEvent-throws path is tested; the two inner safeWarn arms covering updateQueryEvent failures on the success and error completion paths are uncovered.
    • Fix: Add specs that mock updateQueryEvent to throw on each branch and assert the wrapper still returns the resolved value or rethrows the original super.query error rather than the secondary store failure.
  • packages/app/src/queryStats/InstrumentedClickhouseClient.ts:37 -- currentPathname() has three declared behaviors (undefined window, normal pathname, throwing accessor) and none are tested.
    • Fix: Add focused unit tests for each branch by stubbing globalThis.window so the SSR-safety contract used by the drawer's filter is locked in.
  • packages/app/src/queryStats/__tests__/InstrumentedClickhouseClient.test.ts:113 -- The abort test only covers the pre-aborted case (controller.abort() before invocation); the realistic mid-flight abort and the "signal supplied but never aborted while super.query rejects" case are not differentiated.
    • Fix: Add a spec that aborts the controller between the super.query call and the rejection, and another that supplies an unfired signal alongside a non-abort rejection, asserting the resulting status field is cancelled vs error correctly.
  • packages/app/src/queryStats/queryStatsStore.ts:115 -- useQueryEvents is the React surface the drawer consumes; the wiring (re-render on append/update/clear, listener cleanup on unmount) has no test.
    • Fix: Add a renderHook test that mounts, appends events, asserts re-renders, unmounts, and asserts the listener Set is drained.

Reviewers (12): correctness, testing, maintainability, project-standards, agent-native, learnings, security, performance, reliability, adversarial, kieran-typescript, frontend-races.

Testing gaps:

  • No render-level coverage for QueryStatsDrawer (filters, route-change reactivity, expand/collapse, Run EXPLAIN happy/error, empty-state).
  • No coverage of QueryStatsErrorBoundary reset-on-resetKey semantics.
  • MAX_STRING_LEN truncation, the isBrowser() SSR guard, and the inner safeWarn paths are unexercised.
  • Only a pre-aborted abort scenario is tested; mid-flight abort and "signal supplied but unfired" branches are not differentiated.

- Separate internal event id from ClickHouse query_id so duplicate
  caller-supplied query_ids no longer conflate events in the store
- Shallow-clone captured query params so post-call mutation by the
  caller cannot rewrite the stored event
- Gate drawer mount until first open so the global event subscription
  doesn't re-render an unused panel app-wide on every captured query
- Truncate captured SQL/error strings at 32KB so chart configs with
  multi-KB SQL can't pin tens of MB in the buffer
- No-op store mutators on the server so SSR renders don't accumulate
  across requests in the same Node process
- Use window.location.pathname as the single source of truth for
  capture-side pathname and drawer-side filter; re-render on Next's
  routeChangeComplete
- Hide Run EXPLAIN for non-SELECT/WITH statements and disable while
  loading to avoid concurrent EXPLAIN races
- Reset error boundary when the drawer is reopened after a crash
- Memoize collapseWhitespace result so multi-KB SQL isn't re-scanned
  twice per row render
- Fix .spin -> .spin-animate so the pending indicator actually spins
- Guard test-only reset exports behind NODE_ENV !== 'production'
- Add tests for event-id/queryId separation, duplicate-queryId
  isolation, and params shallow-clone
@MikeShi42
Copy link
Copy Markdown
Contributor Author

Pushed 8174d06d addressing the deep-review feedback. Summary of what I changed vs deliberately skipped:

Addressed (P2):

  • Drawer only mounts after first open — global useQueryEvents subscription no longer re-renders an unused panel app-wide on every captured query. Stays mounted after first open so close animation still plays. (AppNav.tsx)
  • Internal event.id is now minted independently of queryId, so duplicate caller-supplied query_ids don't conflate events in the store. (InstrumentedClickhouseClient.ts)
  • params are shallow-cloned on capture so post-call mutation by the caller can't rewrite the stored event.
  • appendQueryEvent/updateQueryEvent/clearQueryEvents are no-ops on the server (SSR safety).
  • Captured sql/error truncated to 32KB (with a …[truncated] marker) to bound retained memory.
  • Single source of truth for the captured pathname (currentPathname() is now exported from the wrapper and reused by the drawer filter). Drawer re-renders on router.events.routeChangeComplete, so capture-side and filter-side strings can't drift during nav.

Addressed (P3):

  • className="spin""spin-animate" (project's stylesheet uses the latter).
  • Run EXPLAIN is disabled while loading (was only loading={...}).
  • Run EXPLAIN is hidden for non-SELECT/WITH events (parse-failure would have been the only outcome).
  • Memoized collapseWhitespace(hydratedSql) so multi-KB SQL isn't re-scanned twice per row render.
  • QueryStatsErrorBoundary accepts a resetKey and clears hasError on reopen.
  • __resetQueryStatsForTests / __resetInstrumentationWarnForTests are no-ops under NODE_ENV === 'production'.
  • Added tests for event-id/queryId separation, duplicate-queryId isolation across calls, and params shallow-clone (19 passing).

Skipped — with reasons:

  • Math.randomcrypto.randomUUID: deliberate. crypto.randomUUID() requires a secure context (HTTPS or localhost) — self-hosted HyperDX accessed over plain HTTP on a LAN IP would throw. The other two repo call sites are inside flows that only run in cloud/dev. The instrumentation has to be safe on every deployment.
  • RTL tests for the drawer: agree this is a gap, but it's a larger separate change and I'd rather not stuff it into this PR.
  • Splitting the 400-line file into QueryRow.tsx + format.ts: also separate refactor; happy to do as a follow-up.
  • Three-layer error suppression + warnedOnce rate-limit: the layered guards are intentional (each layer protects a different concern); a rate-limit on the warn dedup is a reasonable follow-up but not blocking.
  • Semantic color tokens (--color-border etc.): would prefer to do that across the codebase in one pass rather than only here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review/tier-3 Standard — full human review required

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant