fix(oss): parallelize entity boost searches in Memory.search#5377
Merged
Conversation
Entity boost computation processed up to 8 entities sequentially, making each search wait for the previous embed+search round-trip. With remote providers this dominated search latency (measured 4-8x overhead). Parallelize across both Python and TypeScript SDKs: - Python async: asyncio.gather with Semaphore(4) - Python sync: concurrent.futures.ThreadPoolExecutor(max_workers=4) - TypeScript: Promise.allSettled Scoring is unchanged — max() aggregation is order-independent. Individual entity failures are logged and skipped (best-effort).
The TS SDK silently skipped rejected entity boost searches while the Python SDK logged a warning for each. Add console.warn on rejection and update the test to verify the warning fires (using a query that actually produces extractable entities).
This was referenced Jun 5, 2026
…ed calls Batch all entity embeddings into a single API call via embed_batch(), then parallelize only the entity store searches. Reduces N embed round-trips to 1 (providers with native batching like OpenAI send a single HTTP request), cutting API calls by up to 8x at scale while keeping the search parallelism from the prior commit.
… length mismatch - LMStudio embedBatch: add .sort((a,b) => a.index - b.index) to match OpenAI/Azure pattern — prevents silent embedding-entity misalignment when the server returns results out of insertion order. - Python (sync + async): validate embed_batch returns same count as input texts; log warning and skip boost if mismatched. - TypeScript: same length guard before Promise.allSettled — prevents undefined vector passed to entityStore.search on short response. - Remove dead searchResponses variable from TS test. - Relax timing bound in concurrency test (350ms → 500ms) to avoid flakes from embedBatch overhead on slow CI runners.
- Python sync: resolve self.entity_store once on the main thread before submitting to ThreadPoolExecutor — prevents concurrent lazy-init race when _entity_store is None on first search call. - TypeScript: strip effectiveFilters to only user_id/agent_id/run_id before querying the entity store, matching Python's search_filters behavior. Previously passed full effectiveFilters which could contain processed metadata operators that entity store records don't have, causing zero entity boost results with metadata-filtered searches.
|
great |
whysosaket
approved these changes
Jun 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Linked Issue
Closes #5214
Description
Memory.search()entity boost computation processed up to 8 extracted entities sequentially — each entity's embed + entity-store search had to complete before the next started. With remote embedding providers (50-200ms RTT per call), this created 16 serial round-trips that dominated search latency, pushing entity-rich queries into multi-second territory or request timeouts.This PR fixes entity boost performance across both Python and TypeScript SDKs with a batch-first, parallel-search architecture, plus several correctness hardening fixes found during review.
1. Batch embedding (biggest win)
Instead of calling
embed()per entity (N API round-trips), all entity texts are now embedded in a singleembed_batch()/embedBatch()call. Providers with native batching (e.g. OpenAI) send one HTTP request for all 8 entities. Providers without native batching fall back to sequentialembed()viaEmbeddingBase.embed_batch()— no regression.At scale (1000 concurrent users × 8 entities): reduces embed API calls from 8000 → 1000 (8x).
2. Parallel entity store searches
After batch embedding, entity store searches are parallelized:
ThreadPoolExecutor(max_workers=4)asyncio.gather+Semaphore(4)Promise.allSettled3. Correctness hardening
.sort((a, b) => a.index - b.index)to match OpenAI/Azure — prevents silent embedding-entity misalignment when the server returns results out of insertion order.embed_batch/embedBatchreturns the same number of vectors as input texts. On mismatch: logs a warning and skips entity boost gracefully (no crash, no undefined access).self.entity_storeonce on the main thread before submitting toThreadPoolExecutor— prevents concurrent lazy-init race when_entity_storeisNoneon first search call.effectiveFiltersto onlyuser_id/agent_id/run_idbefore querying the entity store, matching Python'ssearch_filtersbehavior. Previously passed fulleffectiveFilterswhich could contain processed metadata operators, causing zero entity boost results on metadata-filtered searches.logger.warning/console.warn.Combined latency improvement
With real providers at 100-200ms RTT, worst case (8 entities) goes from ~3.2s → ~300ms.
Key Design Decisions
embed_batch()first, parallel search second: Separates the two I/O phases cleanly. Batch embedding is strictly better than parallelizing individual embed calls — fewer API calls, less provider rate-limit pressure, simpler thread/coroutine management.return_exceptions=True/Promise.allSettled: One entity failure doesn't abort others. Failed entities are logged at WARNING and skipped (best-effort).max()aggregation over per-entity boosts is order-independent, so concurrent completion produces identical scores to the old sequential order.Supersedes
This PR covers the scope of all three community PRs with a unified fix:
Type of Change
Breaking Changes
N/A — no public API changes. Internal implementation detail only.
Test Coverage
Python (8 new tests in
TestEntityBoostParallelism)test_sync_boosts_preserve_scoring— scoring math matches reference with batch embed + ThreadPoolExecutortest_sync_embed_batch_called_once— verifies embed_batch is called exactly once with all entity textstest_async_boosts_preserve_scoring— scoring math matches reference with batch embed + asyncio.gathertest_async_embed_batch_called_once— verifies embed_batch is called exactly once (async path)test_sync_one_entity_failure_does_not_abort_others— partial failure resilience (sync)test_async_one_entity_failure_does_not_abort_others— partial failure resilience (async)test_sync_searches_run_concurrently— timing + peak concurrency proves overlap (sync)test_async_searches_run_concurrently— timing + peak concurrency proves overlap (async)TypeScript (4 new tests in
memory.entity-boost.test.ts)should use Promise.allSettled for concurrent entity searchesshould preserve scoring math with parallel executionshould survive one entity search failure without losing other boosts— also verifiesconsole.warnis calledshould call entity searches concurrently, not sequentiallyVerification
pytest tests/memory/test_main.py)jest memory.entity-boost.test.ts), Prettier passesChecklist