Skip to content

test(e2e): stress native watcher with safeTime canary#1369

Draft
stormslowly wants to merge 55 commits into
web-infra-dev:mainfrom
stormslowly:e2e/native-watcher-safetime-stress
Draft

test(e2e): stress native watcher with safeTime canary#1369
stormslowly wants to merge 55 commits into
web-infra-dev:mainfrom
stormslowly:e2e/native-watcher-safetime-stress

Conversation

@stormslowly

Copy link
Copy Markdown

Why

Stress-test the rspack native file watcher carrying the ported watchpack safeTime / mtime-accuracy mechanism (web-infra-dev/rspack#14267) under rstest's e2e watch matrix.

Before that PR the native watcher returned empty fileTimeInfoEntries / contextTimeInfoEntries; it now produces the full { safeTime, timestamp } tables. This run validates the new path holds up under the macОS/Windows watch stress matrix.

What

Builds on #1261's harness (enables experiments.nativeWatcher, aggregateTimeout: 100, macOS fresh-fixture symlink, the STRESS_* CI matrix). The only delta in this PR is repointing the @rspack/core override to the rspack canary that includes the safeTime change:

@rspack-canary/core@2.0.7-canary-a310bbfa-20260603233709

Draft — stress validation only, do not merge.

…acOS

On macOS the FSEvents subsystem keeps a per-path ring buffer that can
replay history events to watcher streams attached later. When
prepareFixtures repeatedly cleans and re-copies into the same target
directory, file watchers (the rstest config watcher, and the rspack
NativeWatcher introduced in web-infra-dev#518) can observe a Delete+Create batch on
the same path, yielding spurious change events. With nativeWatcher
enabled this manifests as a flaky "restarting Rstest as
rstest.config.mts changed" that erases the CLI output the e2e asserts on.

Per-call unique directories produce a path that has no FSEvents history,
so only Create events for the freshly copied fixture are delivered. A
symlink at the hard-coded fixturesTargetPath keeps the call sites in
test code unchanged. Linux/Windows paths fall through to the original
behaviour.
Mirrors the one-line change from web-infra-dev#518 so the e2e matrix on this branch
exercises the nativeWatcher code path while validating the fresh fixture
directory strategy. Used to confirm the prepareFixtures change in the
prior commit eliminates the flaky "restarting Rstest as
rstest.config.mts changed" failure under nativeWatcher.
…tcher

Under chokidar the 10 ms value was a safe watchpack event-merging window.
With rspack experiments.nativeWatcher enabled in the previous commit, the
same 10 ms becomes the gap between the FSEvent arriving and rspack's
stat/read of the changed file. On macOS that gap is shorter than vnode
cache invalidation latency, so rspack reads stale content, produces
identical chunk hashes, and rstest then logs "No test files need re-run".

Local sweep (5 runs per value, vanilla notify-rs latency=0.0, fresh
fixture dir per call):

  aggregateTimeout    Pass    "No test files need re-run" lines
  10                  0/5     36
  100                 5/5     0
  200                 5/5     0
  500                 5/5     0
  1000                5/5     0

100 ms is enough; no rspack or notify-rs changes are needed.
100 ms was enough on local M-series macs (5/5 PASS) but the
GitHub Actions macos-14 runner still triggered mode B
("No test files need re-run" -> stale 'Hello, INDEX!') in every
e2e (macos-14, node 20/24) run on the prior commit.

The Actions ARM64 runner appears to have a noticeably slower
vnode cache invalidation than local hardware, so we need a bigger
margin. 500 ms matches the upper half of the local sweep band that
was 5/5 PASS (200 / 500 / 1000 all passed) and is 2.5x the
watchpack default of 200 ms. Hot reload latency stays well under
human perception.
Temporary investigation commit. CI macos-14 still hits mode B with
fixture aggregateTimeout=500 even though the local sweep with the
same value was 5/5 PASS. Need to confirm what value the child rstest
process ACTUALLY receives (fixture cwd lookup, rsbuild config merge
order, nativeWatcher flag) before deciding the next move.

Will be reverted after one CI run.
The GitHub Actions macos-14 arm64 runner is roughly 6x slower than the
local M-series Mac on this watch flow (each fs.update -> rerun cycle is
2-3 s in CI vs ~300 ms locally). aggregateTimeout=500 was still hitting
mode B 100% of the time on CI, indicating that the vnode cache
invalidation on the Actions runner has not finished by the time rspack
performs stat/read.

Going to 3000 ms gives the runner ~6x more wall-clock budget than the
500 value that already worked locally with full headroom. Last attempt
in the timeout-only direction; if this still fails the root cause is
not the aggregate window and we will revisit.
…xtures

Temporary investigation. CI macos-14 e2e fails with mode B at every
aggregateTimeout value (100/500/3000), but identical local runs pass.
Suspecting that the cp() into the fresh sibling directory does not
include rstest.config.mts under some Actions runner condition, so the
child rstest falls back to the entry.ts default (5 ms). Print the
freshly copied config content so the CI log shows the actual file
seen by the child process.

Will be reverted.
Cannot reproduce the macos-14 + node 24 watch e2e failure locally
on the same Node version (24.15.0), same fixture, same binding,
same concurrency. Last unknown is the GitHub Actions macos-14
runner's actual hardware/OS state. Open a tmate session on
failure, limited to that one matrix slot, so we can inspect:

  - runner CPU/memory pressure
  - the fresh-dir contents after fs.update
  - whether rspack input fs cache holds the old content
  - whether vnode invalidation lag is observable

Will be reverted after the investigation.
…lure

Pin @rspack/core to a canary build of rspack worktree-rstest-native-watcher
(commit 27bc1e20) so the binding writes RCA trace lines into
/tmp/rca-binding-trace.log on the runner:
  - NativeWatcher::new aggregateTimeout/poll/symlink
  - JsEventHandler::on_event_handle aggregate changedFiles/deletedFiles
  - JsEventHandlerUndelayed::on_change every undelayed path

Replace the prior tmate step with a `cat /tmp/rca-binding-trace.log`
step on failure so the dump shows up directly in the macos-14 job log
without needing an interactive SSH.

Will be reverted after the investigation; not for merge.
The previous bump chain (10 -> 100 -> 500 -> 3000) was a workaround
for the rspack_watcher mtime-gate race that has now been fixed in the
canary (efffb233). With the race gone, 3000ms was unnecessary, but
falling back to the original 10ms was still too aggressive: rspack
fires stat/read inside the macOS vnode-cache settling window and reads
stale content. 100ms is the smallest value that survives macOS-14
node20+node24 CI runs.
…e overrides

rstest used to default rspack's `watchOptions.aggregateTimeout` to 5ms
(see Watching.ts comment). That was safe under the chokidar watcher
where the aggregate is just a watchpack-level event-merging window,
but is too aggressive under the rspack native watcher on macOS: the
runtime stats the changed file ~5ms after the FSEvent arrives, well
inside the vnode-cache settling window, and reads stale content. The
resulting "No test files need re-run" is the failure mode addressed by
this PR.

Set the default to 100ms in the rstest plugin so every watch project
inherits a sane value without each fixture wiring its own override.
Drop the per-fixture `tools.rspack.watchOptions.aggregateTimeout: 10`
that was added long ago (chokidar era). Individual cases that need a
different value can still set it via `tools.rspack.watchOptions` and
the user value wins through rsbuild's regular merge order.
…ixtures

Removes the temporary investigation-only steps now that the underlying
rspack_watcher mtime-gate race is fixed upstream and the rstest fix is
covered by regular CI:

  - .github/workflows/test.yml: "Debug — show resolved @rspack/* versions"
    and "Dump rspack binding RCA trace" steps
  - e2e/scripts/index.ts: __RCA_FIXTURE__ stdout dump in prepareFixtures
…fix)

Switch from canary efffb233 (stateless arrival-vs-mtime) to canary
96980782 (watchpack-style incremental baseline). The new approach
keeps the per-file mtime baseline intact across rspack rebuild
cycles, which restores the original "filter stale FSEvent replays"
property while still avoiding the baseline-snapshot-after-user-write
race that previously caused mode B on macos-14 + node 24.
Picks up the final fix for the FSEvents mtime baseline race + the
Option 2 perf follow-up (record_initial_file_mtimes now only stats
files added in this watch cycle, not the full registered set).

Tracks web-infra-dev/rspack#14077.
Picks up the latest fix/fsevents-mtime-race tip (now includes
main-merge 9b248702). Local prepare/build currently panics with
"attempted to get vtable for an unregistered impl" in
rspack_cacheable::dyn — unrelated to the watcher fix and looks
like a regression introduced by the main-merge in this canary.
Pushing to surface the panic in CI as a public artifact for a
separate bug report.
…re-dir-per-call

# Conflicts:
#	pnpm-lock.yaml
…re-dir-per-call

# Conflicts:
#	package.json
#	pnpm-lock.yaml
Stress matrix should now verify the 2% residual flake fix on macOS:
late-arriving events that previously sat in files_data forever should
now be drained by the post-handler re-flush.
Stress matrix should now verify whether the safeTime backfill mechanism
defeats Race 3 (file registration timing) by replaying events that
landed between rspack's done hook and the process.nextTick rewatch.
Verifies whether the dir-event rescan defeats FSEvents partial-delivery
(where only the parent-dir Change arrives but not the file Change).
Also broadens trace dump filter to capture fixture-root-level raw events
so we can verify or disprove the hypothesis post-stress.
…ll registered)

Verifies the watchpack-equivalent extensions: dir-event rescan now walks
actual dir contents to catch unregistered files, and safeTime backfill
covers all registered files (not just newly added). Targets the residual
dynamic-import flake.
…sive walk)

The 17ee60b7 canary added recursive dir walking + safeTime backfill over
all registered files, which regressed stress matrix from 79/80 to 74/80
because the recursive walk picked up dist/.rstest-temp/* outputs and
the extended backfill double-fired on every rebuild (safe_time ==
start_time). Reverting to the proven a85437e7 version to re-confirm
the 79/80 baseline before submitting watcher PR.
Verifies that filtering ignored paths (dist/.rstest-temp/*) at the
trigger's on_event entry eliminates the [root]-only noise js_event_handle
batches that were causing waitForStdout('Duration') to capture stray
build-output rebuilds.
Control experiment: run the macos-14 stress matrix against watchpack
(the JS-based webpack/rspack default) instead of rspack's nativeWatcher
to compare pass rates. This will tell us whether the residual flake is
a watcher-implementation issue (watchpack should pass 100%) or a deeper
race in rspack-core / rstest test spec.

Revert this before any production change.
Need full visibility into what's producing [root]-only js_event_handle
flushes that bypass the ignored filter. Removing the \.rstest-temp
exclusion from the dump filter so we can see raw dispatched events for
build-output paths and confirm whether the ignored filter is actually
suppressing them.
The trace dump filter change (previous commit) is meant to capture what
nativeWatcher dispatches when the watchpack baseline was 100%. Reverting
the watchpack switch so we get nativeWatcher trace output to compare.
Last run hit 'Binary file matches' because grep auto-detected non-text
bytes (likely from path display in some events) and refused to print
matches. Force text mode with -a so the dump is actually usable for
diagnosis.
Events are reported via the symlink (fixtures-test-0-module), not the
realpath (fixtures-test-0-module-r1234-N-xxxx) thanks to the realpath
translation in the watcher. Drop the -r[0-9]+ suffix requirement from
the dump filter.
Critical fix: previously the ignored glob "**/dist/.rstest-temp" only
matched the directory itself, so every write to dist/.rstest-temp/foo.mjs
escaped the filter and produced a noisy [root]-only js_event_handle that
triggered a fake rebuild during step 2 of the watch e2e tests.

Now the watcher walks the path and every ancestor against the ignored
patterns — files inside an ignored directory match via the ancestor
check. Expectation: near-100% stress pass rate (matching watchpack).
Repoint the @rspack/core override to the rspack canary that ports watchpack's
safeTime / mtime-accuracy mechanism into the native watcher
(web-infra-dev/rspack#14267), so the web-infra-dev#1261 native-watcher stress matrix
exercises the new fileTimeInfoEntries / contextTimeInfoEntries path.
@stormslowly

Copy link
Copy Markdown
Author

Stress result — @rspack-canary/core@2.0.7-canary-a310bbfa-20260603233709

The native watcher carrying the ported watchpack safeTime / mtime-accuracy mechanism (web-infra-dev/rspack#14267) holds up under the stress matrix:

  • All stress-restart jobs green across node 22 + node 24 (after flaky reruns).
  • node 24 surfaced 3 flaky failures in the macOS FSEvents watch matrix — (24,12,on), (24,17,on), (24,19,off)all passed on re-run, i.e. flaky tail, not a regression.
  • The only remaining red checks (lint, test-gate) are pre-existing on the perf(core): enable rspack experiments.nativeWatcher #1261 base branch and unrelated to this change (this PR only repoints the @rspack/core override).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant