Skip to content

fix(oss): parallelize entity boost searches in Memory.search#5377

Merged
kartik-mem0 merged 5 commits into
mainfrom
fix/parallelize-entity-boost-searches
Jun 5, 2026
Merged

fix(oss): parallelize entity boost searches in Memory.search#5377
kartik-mem0 merged 5 commits into
mainfrom
fix/parallelize-entity-boost-searches

Conversation

@kartik-mem0
Copy link
Copy Markdown
Contributor

@kartik-mem0 kartik-mem0 commented Jun 4, 2026

Linked Issue

Closes #5214

Description

Memory.search() entity boost computation processed up to 8 extracted entities sequentially — each entity's embed + entity-store search had to complete before the next started. With remote embedding providers (50-200ms RTT per call), this created 16 serial round-trips that dominated search latency, pushing entity-rich queries into multi-second territory or request timeouts.

This PR fixes entity boost performance across both Python and TypeScript SDKs with a batch-first, parallel-search architecture, plus several correctness hardening fixes found during review.

1. Batch embedding (biggest win)

Instead of calling embed() per entity (N API round-trips), all entity texts are now embedded in a single embed_batch() / embedBatch() call. Providers with native batching (e.g. OpenAI) send one HTTP request for all 8 entities. Providers without native batching fall back to sequential embed() via EmbeddingBase.embed_batch() — no regression.

At scale (1000 concurrent users × 8 entities): reduces embed API calls from 8000 → 1000 (8x).

2. Parallel entity store searches

After batch embedding, entity store searches are parallelized:

SDK Mechanism Concurrency Cap
Python (sync) ThreadPoolExecutor(max_workers=4) 4
Python (async) asyncio.gather + Semaphore(4) 4
TypeScript Promise.allSettled unbounded (JS single-threaded)

3. Correctness hardening

  • LMStudio embedBatch sort: Added .sort((a, b) => a.index - b.index) to match OpenAI/Azure — prevents silent embedding-entity misalignment when the server returns results out of insertion order.
  • Batch length guard: All 3 code paths validate that embed_batch / embedBatch returns the same number of vectors as input texts. On mismatch: logs a warning and skips entity boost gracefully (no crash, no undefined access).
  • entity_store race fix (Python sync): Resolves self.entity_store once on the main thread before submitting to ThreadPoolExecutor — prevents concurrent lazy-init race when _entity_store is None on first search call.
  • TS entity search filter scoping: Strips effectiveFilters to only user_id/agent_id/run_id before querying the entity store, matching Python's search_filters behavior. Previously passed full effectiveFilters which could contain processed metadata operators, causing zero entity boost results on metadata-filtered searches.
  • Consistent failure logging: Both Python and TypeScript now log individual entity boost failures via logger.warning / console.warn.

Combined latency improvement

Entities Old (sequential) New (batch + parallel) Speedup
1 ~200ms ~200ms 1x
4 ~800ms ~200ms 4x
8 (max) ~1600ms ~300ms 5x

With real providers at 100-200ms RTT, worst case (8 entities) goes from ~3.2s → ~300ms.

Key Design Decisions

  • embed_batch() first, parallel search second: Separates the two I/O phases cleanly. Batch embedding is strictly better than parallelizing individual embed calls — fewer API calls, less provider rate-limit pressure, simpler thread/coroutine management.
  • Concurrency cap of 4 (Python): Prevents hammering rate-limited embedding providers. Hardcoded rather than configurable to keep the API surface clean.
  • return_exceptions=True / Promise.allSettled: One entity failure doesn't abort others. Failed entities are logged at WARNING and skipped (best-effort).
  • Scoring is unchanged: max() aggregation over per-entity boosts is order-independent, so concurrent completion produces identical scores to the old sequential order.
  • No opt-in flag: Parallelism is always-on. The semaphore/pool cap provides the safety valve without requiring users to discover and set a config flag.

Supersedes

This PR covers the scope of all three community PRs with a unified fix:

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Refactor (no functional changes)
  • Documentation update

Breaking Changes

N/A — no public API changes. Internal implementation detail only.

Test Coverage

  • I added/updated unit tests
  • I added/updated integration tests
  • I tested manually (describe below)
  • No tests needed (explain why)

Python (8 new tests in TestEntityBoostParallelism)

  • test_sync_boosts_preserve_scoring — scoring math matches reference with batch embed + ThreadPoolExecutor
  • test_sync_embed_batch_called_once — verifies embed_batch is called exactly once with all entity texts
  • test_async_boosts_preserve_scoring — scoring math matches reference with batch embed + asyncio.gather
  • test_async_embed_batch_called_once — verifies embed_batch is called exactly once (async path)
  • test_sync_one_entity_failure_does_not_abort_others — partial failure resilience (sync)
  • test_async_one_entity_failure_does_not_abort_others — partial failure resilience (async)
  • test_sync_searches_run_concurrently — timing + peak concurrency proves overlap (sync)
  • test_async_searches_run_concurrently — timing + peak concurrency proves overlap (async)

TypeScript (4 new tests in memory.entity-boost.test.ts)

  • should use Promise.allSettled for concurrent entity searches
  • should preserve scoring math with parallel execution
  • should survive one entity search failure without losing other boosts — also verifies console.warn is called
  • should call entity searches concurrently, not sequentially

Verification

  • Python: 42/42 passed (pytest tests/memory/test_main.py)
  • TypeScript: 4/4 passed (jest memory.entity-boost.test.ts), Prettier passes
  • Ruff + isort pass

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have added tests that prove my fix/feature works
  • New and existing tests pass locally
  • I have updated documentation if needed

Entity boost computation processed up to 8 entities sequentially,
making each search wait for the previous embed+search round-trip.
With remote providers this dominated search latency (measured 4-8x
overhead). Parallelize across both Python and TypeScript SDKs:

- Python async: asyncio.gather with Semaphore(4)
- Python sync: concurrent.futures.ThreadPoolExecutor(max_workers=4)
- TypeScript: Promise.allSettled

Scoring is unchanged — max() aggregation is order-independent.
Individual entity failures are logged and skipped (best-effort).
The TS SDK silently skipped rejected entity boost searches while the
Python SDK logged a warning for each. Add console.warn on rejection
and update the test to verify the warning fires (using a query that
actually produces extractable entities).
…ed calls

Batch all entity embeddings into a single API call via embed_batch(),
then parallelize only the entity store searches. Reduces N embed
round-trips to 1 (providers with native batching like OpenAI send a
single HTTP request), cutting API calls by up to 8x at scale while
keeping the search parallelism from the prior commit.
… length mismatch

- LMStudio embedBatch: add .sort((a,b) => a.index - b.index) to match
  OpenAI/Azure pattern — prevents silent embedding-entity misalignment
  when the server returns results out of insertion order.
- Python (sync + async): validate embed_batch returns same count as
  input texts; log warning and skip boost if mismatched.
- TypeScript: same length guard before Promise.allSettled — prevents
  undefined vector passed to entityStore.search on short response.
- Remove dead searchResponses variable from TS test.
- Relax timing bound in concurrency test (350ms → 500ms) to avoid
  flakes from embedBatch overhead on slow CI runners.
- Python sync: resolve self.entity_store once on the main thread before
  submitting to ThreadPoolExecutor — prevents concurrent lazy-init race
  when _entity_store is None on first search call.
- TypeScript: strip effectiveFilters to only user_id/agent_id/run_id
  before querying the entity store, matching Python's search_filters
  behavior. Previously passed full effectiveFilters which could contain
  processed metadata operators that entity store records don't have,
  causing zero entity boost results with metadata-filtered searches.
@markymark2001
Copy link
Copy Markdown

great

@kartik-mem0 kartik-mem0 merged commit d817aa9 into main Jun 5, 2026
14 checks passed
@kartik-mem0 kartik-mem0 deleted the fix/parallelize-entity-boost-searches branch June 5, 2026 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parallelize or batch entity boost searches in AsyncMemory.search

3 participants