Skip to content

feat(js): adopt deno_url/web/crypto, enforce navigation deadline#204

Open
marcbachmann wants to merge 3 commits into
h4ckf0r0day:mainfrom
marcbachmann:deno-web
Open

feat(js): adopt deno_url/web/crypto, enforce navigation deadline#204
marcbachmann wants to merge 3 commits into
h4ckf0r0day:mainfrom
marcbachmann:deno-web

Conversation

@marcbachmann
Copy link
Copy Markdown
Contributor

This branch replaces several broken / fingerprintable hand-rolled web API polyfills with their Deno counterparts, and adds hard deadline enforcement so a single page-supplied script can no longer hold the tokio executor past --timeout.

Three independent commits, each landable on its own.

Why

Obscura is intentionally lean — 30 MB memory, 70 MB binary, scoped to scraping and agent automation. That's a feature.

But several crates/obscura-js/js/bootstrap.js globals aren't lean — they're broken or fingerprintable:

API Problem
URL Regex-based; new URL('/foo', window.location) threw TypeError: base.match is not a function for any non-string base. Only handled http(s).
Blob/File/FormData Blob joined parts as strings — new Blob([uint8Array]).text() returned "[object Object]". fetch file uploads silently corrupted.
TextEncoder/TextDecoder JS-only, no ICU. Wrong on non-UTF-8 encodings, surrogate halves, fatal mode, BOM.
AbortController/AbortSignal Stubs. abort() set a flag but never fired the 'abort' event. AbortSignal.timeout() didn't time out.
structuredClone JSON.parse(JSON.stringify(v)). Silently corrupted Date, Map, Set, RegExp, Uint8Array, circular refs.
ReadableStream/WritableStream Stubs. Streaming fetch().body, pipeTo all failed.
Event/EventTarget globalThis.EventTarget = Node collapsed two unrelated prototype chains.
atob/btoa Depended on the broken TextEncoder; wrong on multi-byte.
performance Date.now()-based, no sub-millisecond resolution.
crypto.getRandomValues Backed by Math.random() — detectable bias, no real entropy.
crypto.randomUUID Template-string stub, not CSPRNG-backed.

Each broken implementation is also a fingerprinting tell. Any bot-detection service that probes new Blob([new Uint8Array([1,2,3])]).text() or checks structuredClone(new Date()) instanceof Date flags us as non-Chrome. deno_web / deno_url / deno_crypto are V8/ICU-native and match Chrome's behavior byte-for-byte — same direction as the existing stealth features.

What changes

1. feat(js): adopt deno_url for WHATWG-compliant URL implementation

Drops the regex-based URL polyfill. The deno_url implementation handles every scheme correctly, accepts non-string base arguments per spec, and is what real Chrome uses.

2. feat(js): adopt deno_web + deno_crypto, drop remaining broken polyfills

All-or-nothing — BlobFormDataResponse and EventTargetAbortSignal import each other internally, so they cannot be cherry-picked. deno_crypto is folded into this commit because it's standalone (~150 KB) and shares the same V8/extension wiring path.

Includes regression tests (mod tests in runtime.rs) pinning the behavioral contract that the polyfills broke — Blob.text() round-trips bytes, TextEncoder handles surrogate pairs, structuredClone preserves Date/Uint8Array, AbortSignal actually fires 'abort', crypto.getRandomValues produces real entropy.

3. fix(browser,js): enforce navigation deadline through synchronous V8 execution

tokio::time::timeout only fires at .await points. Any synchronous V8 work — script evaluation, module top-level code — holds the tokio executor thread for its entire duration, so pages with heavy synchronous scripts could run arbitrarily past --timeout with no way to stop them.

Adds navigation_deadline: Option<Instant> to Page and threads it through every execution phase:

  • Script execution — each call receives the remaining budget as a hard ceiling. A watchdog thread fires terminate_execution() in a 10 ms loop once the budget is exhausted so that scripts with try-catch error-recovery handlers (which absorb a single termination) are still forced to stop.
  • Network fetches — each parallel script fetch is wrapped in tokio::time::timeout(remaining_budget, ...). A slow CDN response can no longer exhaust the deadline by itself.
  • ES modulesterminate_execution is catchable by JavaScript try-catch. A threshold guard skips any module whose remaining budget is too small to reliably stop, rather than starting work that cannot be terminated.
  • Load-events / event-loop drain — capped at the remaining budget, checking the deadline on every iteration.
  • Error surface — new PageError::NavigationTimedOut variant. The CLI's outer tokio::time::timeout could only fire once V8 yielded; the deadline is now enforced at the point of each V8 call.

Also switches eval_module_with_timeout from run_event_loop to with_event_loop_promise: the former waits for all pending work in the runtime to drain (blocking indefinitely on a page with a live setInterval), while the latter resolves as soon as the module's top-level evaluation completes.

Out of scope (intentional)

  • fetch and XMLHttpRequest stay hand-rolled. They route through op_fetch_url, which enforces SSRF policy, blocked-URL patterns, the cookie jar, and the CDP Fetch domain's request interception. Replacing with deno_fetch would bypass the security model.
  • setTimeout / setInterval now come from deno_web. The hand-rolled timers in bootstrap.js are removed.
  • The existing URL.createObjectURL override is unchanged in shape — it still routes through the embedder-side BlobUrlStore.

Cost

  • Snapshot grows by roughly +500–900 KB. Binary still well under the 70 MB target.
  • ~30 LoC of Rust glue in crates/obscura-js/src/deno_extensions.rs (BlobUrlStore + TimersPermission).
  • Version pin matches the current deno_core 0.350.

Verification

  • cargo test --features stealth --workspace: 253 passed, 0 failed.
  • cargo build --release --features stealth: clean (only the pre-existing RpcResponse visibility warning in obscura-mcp, unrelated).
  • Regression tests for each replaced polyfill in crates/obscura-js/src/runtime.rs::tests lock the contract in place.

Replaces the regex-based URL polyfill in bootstrap.js with deno_url
0.207, plus deno_webidl 0.207 and deno_console 0.207 as required peer
deps. deno_console is an inert library dep (deno_url customInspect
imports createFilteredInspectProxy); it does not replace the existing
op_console_msg-backed globalThis.console.

URL, URLSearchParams, and URLPattern now come from the same Rust url
crate already used by op_fetch_url SSRF validation in ops.rs, so
JS-side and Rust-side parsing agree byte-for-byte. The polyfill at
bootstrap.js:1549 silently mishandled non-http schemes, percent-
encoding, IDN, and IPv6 hosts.

Extension wiring lives in crates/obscura-js/src/deno_extensions.rs and
is shared between the snapshot build and the runtime, so adding more
deno extensions later only touches one file.

Snapshot size: 1.15 MB to 1.47 MB.
Implements the plan in ISSUE_deno_web.md. Removes ~250 LoC of hand-rolled
polyfills in favour of deno_web 0.238 + deno_crypto 0.221 (V8/ICU-native,
Chrome-equivalent). Replaced: Blob, File, FileReader, TextEncoder,
TextDecoder, Event, CustomEvent, MessageEvent, ErrorEvent, EventTarget,
AbortController, AbortSignal, structuredClone, performance, atob, btoa,
crypto (getRandomValues + randomUUID + subtle), ReadableStream,
WritableStream, TransformStream, MessageChannel, MessagePort, DOMException,
CompressionStream, plus the Streams' controllers/readers/writers and
queuing strategies.

Why this matters for a scraping browser:

- Stealth: crypto.getRandomValues was Math.random()-backed. A real CSPRNG
  removes a glaring fingerprinting tell and lets pages do OAuth PKCE,
  WebAuthn, JWT verification properly. Blob, structuredClone, TextEncoder
  etc. now behave byte-for-byte like Chrome.
- Correctness: new Blob([uint8Array]).text() returned '[object Object]';
  structuredClone(new Date()) returned a string; AbortController.abort()
  set a flag but never fired the event. All fixed.

Out of scope (intentional):

- fetch and XMLHttpRequest stay hand-rolled - they route through
  op_fetch_url which enforces SSRF policy, blocked-URL patterns, cookie
  jar, and the CDP Fetch domain interception. deno_fetch would bypass
  all of that.
- setTimeout / setInterval stay hand-rolled (microtask fast-fake) for
  the README's 51-85 ms page-load target. deno_web's 02_timers.js is
  loaded transitively for AbortSignal.timeout / performance / FileReader,
  but its global setTimeout/setInterval are deliberately NOT exposed.
- FormData stays a small JS stub (FormData lives in deno_fetch).
- EventTarget no longer aliases Node; DOM nodes do not extend the
  native EventTarget. AbortSignal and friends now correctly satisfy
  'instanceof EventTarget' for the first time.

Three DOM-side compatibility patches:

- bootstrap.js Node.dispatchEvent uses Object.defineProperty to assign
  target/currentTarget because deno_web's Event has getter-only accessors.
- performance.timeOrigin is left to deno_web's StartTime (also a getter);
  the old polyfill assigned it directly per-runtime.
- CustomEvent.initCustomEvent re-attached on the prototype as a small
  polyfill (deno_web ships only the modern API; some legacy bundles still
  call createEvent + initCustomEvent - see issue h4ckf0r0day#41).

Tests (7 new, in runtime.rs#mod tests):

- test_blob_preserves_binary_data
- test_text_encoder_handles_surrogate_pair
- test_structured_clone_preserves_date_and_typed_array
- test_abort_controller_fires_abort_event
- test_crypto_get_random_values_has_entropy
- test_crypto_random_uuid_is_v4_format
- test_btoa_handles_non_ascii_via_textencoder

Snapshot size: 1.47 MB -> 2.27 MB (+800 KB), driven by deno_web's 18 ESM
modules. Binary still well under the README's 70 MB target.
…xecution

`tokio::time::timeout` only fires at `.await` points. Any synchronous V8
work — script evaluation, module top-level code — holds the tokio executor
thread for its entire duration, making the outer async timeout invisible to
scripts that never yield. A page with heavy synchronous scripts could run
arbitrarily past `--timeout` with no way to interrupt it.

Fix: add `navigation_deadline: Option<Instant>` to `Page` and thread it
through every execution phase.

**Script phase** — each `execute_script_with_timeout` call receives the
remaining budget as its hard ceiling. When the budget expires a watchdog
thread fires `terminate_execution()` in a tight loop (every 10 ms) so that
scripts with `try-catch` error-recovery handlers are still eventually
terminated rather than absorbing a single termination call and continuing.
Scripts are also skipped entirely once the deadline has passed, cutting the
iteration short rather than starting work we know will be cancelled.

**Network fetch phase** — each parallel script fetch is wrapped in
`tokio::time::timeout(remaining_budget, ...)` so a slow CDN response cannot
by itself exhaust the navigation deadline; fetch failures are treated as
absent scripts rather than errors.

**ES module phase** — V8's `terminate_execution` is catchable by JavaScript
`try-catch`, and heavy modules with error-recovery paths run *longer* when
disturbed than when left to complete naturally. A threshold guard skips any
module when the remaining budget is below 15 s; the module would outlast the
deadline regardless, so it is better to skip it cleanly than to start work
that cannot be reliably stopped.

**Load-events / event-loop drain** — the DOMContentLoaded + load dispatch is
capped at the remaining budget (min 50 ms to allow basic event handling).
The idle event-loop drain is capped at min(500 ms, remaining) and now checks
the deadline on *every* iteration, not only in the timeout branch.

**Error surface** — a new `PageError::NavigationTimedOut` variant is
returned when `execute_scripts` exits because the deadline was reached,
letting the CLI distinguish a timeout from a genuine navigation failure and
produce an accurate "Timed out after Ns" message rather than silently
returning a partially-rendered page.

Also switches `eval_module_with_timeout` from `run_event_loop` to
`with_event_loop_promise`: the former waits for *all* pending work in the
runtime to drain (blocking forever on a page with a live `setInterval`),
while the latter resolves as soon as the module's top-level evaluation
completes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant