Skip to content

fix(memory): atomic cross-process MySQL checkpoint index (resolves #301)#304

Open
fede-kamel wants to merge 1 commit into
fix/mysql-pool-idle-in-transactionfrom
fix/mysql-cross-process-index
Open

fix(memory): atomic cross-process MySQL checkpoint index (resolves #301)#304
fede-kamel wants to merge 1 commit into
fix/mysql-pool-idle-in-transactionfrom
fix/mysql-cross-process-index

Conversation

@fede-kamel
Copy link
Copy Markdown
Contributor

Summary

Resolves #301. Makes the MySQL checkpoint index updates atomic across processes, closing the cross-process race that #300's per-instance lock could not cover.

Stacked on #303 (base branch = fix/mysql-pool-idle-in-transaction). The new integration test's DROP TABLE teardown needs #303's rollback-on-release fix to avoid the idle-in-transaction MDL hang. Retarget to main once #303 merges.

The bug

StorageBackendAdapter keeps the {thread}:_checkpoints index as a blob and updates it with a load-modify-save guarded only by a per-instance asyncio.Lock (#300). That serializes writers within one process, but two processes / adapter instances over a shared store hold independent locks, so the read-modify-write interleaves and drops index entries. The checkpoint data keys are never lost (distinct keys, no RMW) — only the index under-reports, so list_checkpoints / time-travel miss checkpoints.

Reproduced against MySQL 9.6 (20 concurrent saves to one thread):

one-instance (shared lock)     = 20, 20, 20   ✅  (#300 in-process fix works)
two-instances (separate locks) = 11, 10, 10   ❌  (cross-process loss)

The fix

Give the MySQL backend atomic index primitives and have the adapter delegate to them when present (other backends keep the lock+blob fallback unchanged):

  • index_add — single INSERT … ON DUPLICATE KEY UPDATE using JSON_ARRAY_INSERT(data, '$.checkpoints[0]', …). The append is serialized by InnoDB's row lock, so concurrent cross-process writers can't clobber each other. Keeps the adapter's {"checkpoints": [...]} shape, newest-first.
  • index_removeSELECT … FOR UPDATE + rewrite, so the index RMW is row-locked across processes.
  • list_checkpoints — de-duplicates by checkpoint_id at read time (the atomic append doesn't de-dup a re-saved id).
  • _async_backend_op — capability detection requires a real coroutine method, so MagicMock test doubles aren't mistaken for the atomic-index backend.

After the fix: two-instances is 20/20.

Validation

  • New integration test test_mysql_adapter_cross_process_index_no_loss (two adapter instances, one table): all 20 entries retained.
  • MySQL integration suite: 12 passed against MySQL 9.6.
  • Full unit suite green; coverage 93.3% (≥90% gate). mysql.py 97%, adapters.py 99%. Added unit tests for index_add/index_remove (atomic SQL + FOR UPDATE + missing-row no-op) and adapter delegation incl. the MagicMock-safety guard.
  • ruff, ruff format --check, hatch run typecheck all clean.

Resolves #301. The StorageBackendAdapter maintained the
{thread}:_checkpoints index blob with a load-modify-save guarded only by
a per-instance asyncio.Lock (added in #300). That serializes concurrent
writers within one process, but two processes / adapter instances over a
shared store hold independent locks, so the read-modify-write still
interleaves and drops index entries. Reproduced against MySQL 9.6:

  one-instance (shared lock)     = 20/20   (in-process fix works)
  two-instances (separate locks) = 10/20   (cross-process loss)

Give the MySQL backend atomic index primitives and have the adapter use
them when present (falling back to the lock+blob path for other backends):

- index_add: single INSERT ... ON DUPLICATE KEY UPDATE with
  JSON_ARRAY_INSERT at $.checkpoints[0]; the append is serialized by
  InnoDB's row lock, so concurrent cross-process writers can't clobber.
- index_remove: SELECT ... FOR UPDATE + rewrite so the index RMW is
  row-locked across processes.
- list_checkpoints de-duplicates by checkpoint_id at read time, since the
  atomic append does not de-dup a re-saved id.

Capability detection requires a real coroutine method (_async_backend_op),
so MagicMock test doubles don't get mistaken for the atomic-index backend.

With the fix, two-instances is 20/20. Stacked on the #303 pool fix (the
integration test's DROP TABLE teardown needs the rollback-on-release fix
to avoid the idle-in-transaction MDL hang).

Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
@oracle-contributor-agreement oracle-contributor-agreement Bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OCA Verified All contributors have signed the Oracle Contributor Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant