Skip to content

perf(cell): use DoublyLinkedList for pendingMessageIds with list node on metadata#27415

Draft
anthony-murphy wants to merge 4 commits into
microsoft:mainfrom
anthony-murphy:prep-cell
Draft

perf(cell): use DoublyLinkedList for pendingMessageIds with list node on metadata#27415
anthony-murphy wants to merge 4 commits into
microsoft:mainfrom
anthony-murphy:prep-cell

Conversation

@anthony-murphy
Copy link
Copy Markdown
Contributor

Description

Replaces SharedCell's pendingMessageIds: number[] with a DoublyLinkedList<number> from @fluidframework/core-utils/internal, and extends ICellLocalOpMetadata to carry the inserted list node.

  • Submit: push() returns a ListNodeRange; the metadata records the inserted node as pendingNode.
  • ACK: pendingMessageIds.first?.data head check, then shift() — both O(1) instead of Array.shift()'s O(n) re-index.
  • Rollback: pop() returns the trailing node; compare node.data to the rolled-back message id.

Why

Two wins, both independent of squash:

  1. Array.shift() is O(n) — every ACK in SharedCell re-indexes the entire pending tail today. DoublyLinkedList.shift() is O(1).
  2. Carrying the node reference on localOpMetadata enables any future per-op operation (rollback, arbitrary-position drop, ...) to be O(1) via list.remove(node) instead of an indexOf + splice scan.

Notes

Pure structural prep — no reSubmitSquashed, no squash logic, no tests touched, no api-report changes.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

Hi! Thank you for opening this PR. Want me to review it?

Based on the diff (104 lines, 2 files), I've queued these reviewers:

  • Correctness — logic errors, race conditions, lifecycle issues
  • Security — vulnerabilities, secret exposure, injection
  • API Compatibility — breaking changes, release tags, type design
  • Performance — algorithmic regressions, memory leaks
  • Testing — coverage gaps, hollow tests

How this works

  • Adjust the reviewer set by ticking/unticking boxes above. Reviewer toggles alone don't trigger anything.

  • Tick Start review below to dispatch the review fleet.

  • After review finishes, tick Start review again to request another run — it auto-resets after each dispatch.

  • This comment updates as new commits land; your reviewer selections are preserved.

  • Start review

Comment thread packages/dds/cell/src/interfaces.ts Outdated
/**
* The node in the pending message list corresponding to this operation (op).
* Holding a direct reference to the node enables O(1) removal from arbitrary
* positions in the pending list, which is required for future squash support.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Review: ListNode<number> leaks onto the exported ICellLocalOpMetadata here, and the new import type { ListNode } from "@fluidframework/core-utils/internal" at the top of this file couples the cell package's interfaces module to a linked-list node type that is purely a runtime implementation detail.

Containment check: ICellLocalOpMetadata is not re-exported from packages/dds/cell/src/index.ts and the only literal construction is updated in cell.ts:306-317, so this is not a public-API break today. The cost is encapsulation and maintainability — embedding ListNode<number> on a package-shaped interface makes the metadata shape harder to mock or re-implement out of tree, and your own PR self-summary flagged this same risk. PR #12273's reviewer (scarlettjlee) historically pushed back on widening this exact interface with implementation details.

Suggested fix: declare a private ICellPendingLocalOpMetadata extends ICellLocalOpMetadata inside cell.ts and use that as the concrete metadata type at the three creation/consumption sites — interfaces.ts then doesn't need the core-utils/internal import. If pendingNode must remain on the exported type for a reason not visible in the diff, mark it optional and tag it @internal. The O(1) ACK and the future O(1) arbitrary-removal capability are preserved either way.

Comment thread packages/dds/cell/src/cell.ts Outdated
): ICellLocalOpMetadata {
const pendingMessageId = ++this.messageId;
this.pendingMessageIds.push(pendingMessageId);
const { first: pendingNode } = this.pendingMessageIds.push(pendingMessageId);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Review: DoublyLinkedList.push(...items) returns a ListNodeRange spanning all newly appended nodes (packages/common/core-utils/src/list.ts:196-205). For a single-item push, range.first === range.last, so this is correct today — but push appends at the tail, so last reads correctly as "the node I just appended." If a future caller ever batches multiple ids in one push, first would silently grab the head of the inserted range instead of the most recently appended node.

Suggested change
const { first: pendingNode } = this.pendingMessageIds.push(pendingMessageId);
const { last: pendingNode } = this.pendingMessageIds.push(pendingMessageId);

…tests

Address deep-review findings on PR microsoft#27415:

- Stop leaking `ListNode<number>` onto `ICellLocalOpMetadata`. Move
  `pendingNode` to a private `ICellPendingLocalOpMetadata` subtype declared
  inside cell.ts. The public/internal interfaces module no longer imports
  from `@fluidframework/core-utils/internal`, decoupling it from a runtime
  implementation detail.
- Use `last` instead of `first` when destructuring the single-node range
  returned by `DoublyLinkedList.push(...)`. They are equivalent today
  (single-item push), but `first` would be incorrect if a future change
  batched multiple pending ids into one push call.
- Add two regression tests in cell.spec.ts:
  - drain-many-ACKs: queue several local sets and ACK them incrementally
    via processSomeMessages(1); asserts no assert-tag fires (0x471) and
    final value converges across clients in order.
  - rollback-with-multiple-pending-ops: enqueue three pending sets under
    TurnBased flush mode and run containerRuntime.rollback(), confirming
    the LIFO pop against the expected pending id does not throw.
const { last: pendingNode } = this.pendingMessageIds.push(pendingMessageId);
const localMetadata: ICellPendingLocalOpMetadata<T> = {
pendingMessageId,
pendingNode,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Review: PR #12273 wired applyStashedOp through createLocalOpMetadata, so stashed ops push onto pendingMessageIds and emit ICellPendingLocalOpMetadata carrying a pendingNode. The new Pending op bookkeeping describe covers direct cell.set(...) submits and LIFO rollback but doesn't exercise stash → rehydrate → ACK.

This is narrower than it might look: applyStashedOp isn't an independent producer — it delegates to set()/delete()submitCellMessage()createLocalOpMetadata(), the same path the new tests already cover, so the "different list instance" failure mode is ruled out by construction. The residual risk is a future regression in stash-rehydrate sequencing through processCore's ICellPendingLocalOpMetadata cast surfacing only in production.

One applyStashedOp round-trip test in the new Pending op bookkeeping describe (drive a stashed set/delete op, then ACK, assert no 0x471 and ordering preserved) closes that gap cheaply. If you prefer to defer to the squash follow-up that first reads pendingNode, note that in the PR description so the gap is intentional.

Is there an existing stash → rehydrate → ACK regression elsewhere in the DDS test corpus that exercises applyStashedOp end-to-end through the pending-id queue? If yes, this is fully addressed by reference.

cell.get(),
undefined,
"cell should be empty after rolling back all pending sets",
);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Review: MockContainerRuntime.rollback is declared optional (public rollback?(): void in packages/runtime/test-runtime-utils/src/mocks.ts:449-454), so containerRuntime.rollback?.() short-circuits to undefined if the method is ever removed/renamed. assert.doesNotThrow then succeeds vacuously, and the test's whole purpose — exercising the new pop()?.data rollback path — is silently lost; the subsequent cell.get() === undefined check would fail for the wrong reason.

Guard the invocation so the test fails loudly if the mock loses the method.

Suggested change
);
assert(
typeof containerRuntime.rollback === "function",
"MockContainerRuntime.rollback must exist for this test",
);
assert.doesNotThrow(
() => containerRuntime.rollback!(),
"rollback should pop pending ids in LIFO order",
);

attribution?: AttributionKey;
}

/**
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Review: ICellPendingLocalOpMetadata is documented as "Kept private to this module," and its only producer is createLocalOpMetadata on SharedCell<T> (cell.ts:319-329), which threads a concrete T. The <T = any> default has no caller and forces the // eslint-disable-next-line @typescript-eslint/no-explicit-any directly above it. The pattern on SharedCell<T = any> / ISharedCell<T = any> / ICellLocalOpMetadata<T = any> exists because those are public-shaped types with external callers; this new private extension has none, and wes-carlson explicitly pushed back on any in this file during PR #10776.

Drop the = any default and the eslint suppression:

interface ICellPendingLocalOpMetadata<T> extends ICellLocalOpMetadata<T> {
	pendingNode: ListNode<number>;
}

@anthony-murphy
Copy link
Copy Markdown
Contributor Author

Deep Review

Reviewed commit e131c9f on 2026-06-02.

Readiness: 8/10 — ALMOST READY

The number[]DoublyLinkedList<number> swap is correct and preserves the FIFO-ACK / LIFO-rollback invariant from PR #12273 (ACK head check at cell.ts:281-289, LIFO rollback at cell.ts:370-375). Two small items remain for the author — a new inline thread on the messageId/messageIdObserved watermark, plus the PR description still says "no tests touched" while cell.spec.ts adds 66 lines of new tests.

Path to Ready

  • Resolve inline threads
  • Update the PR description: drop "no tests touched", add a short bullet describing the two new Pending op bookkeeping tests in cell.spec.ts (drain-via-ACKs and LIFO-rollback), and note that pendingNode is intentionally unread until the squash follow-up. Suggested replacement below.

Context for Reviewers

  • Data-structure swap around the invariant scarlettjlee enforced in PR Local Rollback for Cell #12273: every push paired with exactly one shift (on ACK) or pop (on rollback). The diff preserves it 1:1.
  • First structural change to pendingMessageIds since Local Rollback for Cell #12273 introduced it in 2022. cell.ts is low-churn; deserves close reading rather than pattern-matching to a trivial perf refactor. Historical authorities: scarlettjlee / clarenceli-msft (Local Rollback for Cell #12273, rollback contract); wes-carlson (cell.ts.applyStashedOp implementation #10776, applyStashedOp metadata contract).
  • applyStashedOp flows through set()/delete()submitCellMessagecreateLocalOpMetadata, so stashed ops also acquire a pendingNode. Future consumers of pendingNode (squash) must account for stashed-op resubmission metadata, not only the direct set()/delete() path.
  • pendingNode is dead-on-read in this PR — neither the ACK nor rollback path consults cellOpMetadata.pendingNode; both still operate on the list head/tail directly. The field is prep for the squash follow-up where O(1) arbitrary-position removal becomes necessary.
  • The .last destructure at cell.ts:323-326 depends on DoublyLinkedList.push(...items) returning a range whose .last is the final element of items (packages/common/core-utils/src/list.ts:38-47, 87-112, 196-205). Today first === last (single-id push); the choice is forward-compatibility only. A core-utils owner is the right reviewer for whether that contract should be formally documented in list.ts.
For human reviewer
  • Needs human judgment — whether pendingNode carries its weight in this PR or should land together with the squash PR that first reads it. scarlettjlee and clarenceli-msft are the historical authorities (PR Local Rollback for Cell #12273).
  • Needs human judgment — whether DoublyLinkedList.push(...items) should formally document that the returned range's .last is the final element of items. A core-utils owner is the right reviewer.
  • Cannot be assessed by the pipeline — whether a DDS-wide perf-regression harness exists or is planned where a burst-drain test for pendingMessageIds would be in-convention. If yes, characterizing the O(n)→O(1) ACK claim with a stress test may be worth doing here rather than deferring.
  • Cannot be assessed by the pipeline — whether the attribution path (per PR Add SharedCell Attribution #13560) is preserved across the new shift-after-head-check ordering on the local-ACK branch.

Suggested description

## Description

Replaces `SharedCell`'s `pendingMessageIds: number[]` with a `DoublyLinkedList<number>` from `@fluidframework/core-utils/internal`, and extends the (module-private) local-op metadata to carry the inserted list node.

- Submit: `push()` returns a `ListNodeRange`; the metadata records the inserted node as `pendingNode`.
- ACK: `pendingMessageIds.first?.data` head check, then `shift()` — both O(1) instead of `Array.shift()`'s O(n) re-index.
- Rollback: `pop()` returns the trailing node; compare `node.data` to the rolled-back message id.

## Why

Two wins, both independent of squash:

1. `Array.shift()` is O(n) — every ACK in `SharedCell` re-indexes the entire pending tail today. `DoublyLinkedList.shift()` is O(1).
2. Carrying the node reference on `localOpMetadata` enables any future per-op operation (rollback, arbitrary-position drop, …) to be O(1) via `list.remove(node)` instead of an `indexOf` + `splice` scan.

## Notes

Structural prep — no `reSubmitSquashed`, no squash logic, no api-report changes. `pendingNode` is intentionally unread on the ACK/rollback paths in this PR; it is plumbed for the squash follow-up that first consumes it.

New tests in `cell.spec.ts` under `describe("Pending op bookkeeping")`:
- `drains many pending sets via incremental ACKs without assert and preserves ordering` — exercises the per-ACK `shift()` path.
- `rolls back multiple pending sets in LIFO order against the expected pending id` — exercises the rewritten `pop()` / `node.data` comparison.
Review history (3 prior reviews)
  • 382f2d7 2026-05-28 · 8/10 — prior inline threads resolved in d12ef64; four polish items + stale PR description remained
  • d12ef64 2026-05-27 · 9/10 — both prior inline threads resolved; only stale PR description remained
  • e4f82d7 2026-05-27 · 7/10 — encapsulation + .last threads opened; rollback/drain test gap flagged

@@ -264,9 +281,7 @@ export class SharedCell<T = any>
0x00c /* "messageId is incorrect from from the local client's ACK" */,
);
assert(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Review: The messageId !== messageIdObserved watermark predicate is a sound proxy for "any pending writes?" today because the lifecycle is strictly push-on-submit / shift-on-ACK / pop-on-rollback — messageId advances exactly when pendingMessageIds grows, and the head moves in lockstep with ACKs.

This PR's stated motivation (per the description) is to enable future O(1) list.remove(node) for arbitrary-position drop in the squash follow-up. As soon as squash/arbitrary-removal lands and a mid-queue entry is dropped, messageId (the last-submitted id) keeps advancing while messageIdObserved (the last-ACKed id) lags, leaving the predicate true even when no actual pending op remains — which would suppress remote-op application indefinitely.

The cross-cutting invariant — that this guard must move to !this.pendingMessageIds.empty when arbitrary-position removal lands — is easy to miss from inside a future squash PR that's focused on reSubmitSquashed plumbing. Two options:

  1. Migrate the guard to !this.pendingMessageIds.empty in this PR. Behavioral no-op today, and you're already touching every other site that reads the queue.
  2. Add a one-line // TODO: at this predicate referencing the squash/arbitrary-removal follow-up, so the next maintainer can't silently miss the dependency.

Either is fine; the goal is just to make the coupling visible at the predicate site.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant