Skip to content

feat(oss): opt-in parallel entity-boost in Memory.search() — 2-4x recall speedup#5046

Closed
DmitryPogodaev wants to merge 1 commit into
mem0ai:mainfrom
DmitryPogodaev:feat/parallel-entity-boost
Closed

feat(oss): opt-in parallel entity-boost in Memory.search() — 2-4x recall speedup#5046
DmitryPogodaev wants to merge 1 commit into
mem0ai:mainfrom
DmitryPogodaev:feat/parallel-entity-boost

Conversation

@DmitryPogodaev
Copy link
Copy Markdown

feat(oss): opt-in parallel entity-boost in Memory.search() — 2-4x recall speedup with remote embedders

Problem

Memory.search() runs the entity-boost computation as a sequential for...await loop:

for (const entity of deduped) {
  const entityEmbedding = await this.embedder.embed(entity.text);  // SEQUENTIAL
  const matches = await entityStore.search(entityEmbedding, 500, ...);
  // accumulate boosts...
}

With remote embedders this becomes the dominant latency. We measured a single
recall for an entity-rich query (9 entities) taking ~6.5 seconds, with
~95% of that time spent waiting on serial embed RTTs.

The block has up to 8 iterations (.slice(0, 8)) plus the initial query embed,
so worst case is 9 sequential embedder.embed() calls per search.

This regressed real production latency for us when v3.0.0 added the multi-signal
hybrid retrieval (entity boost). Before v3.0.0 a single Memory.search() did
exactly one embed call.

Fix

Add an opt-in config flag parallelEntityBoost (default: false, preserves
upstream behavior). When true, the entity-boost loop runs via
Promise.all(deduped.map(...)).

Safety: each iteration writes to entityBoosts[memId] = Math.max(prev, boost).
This is order-independent and safe under JS's single-threaded event loop —
interleaved Promise resolutions cannot race because each Math.max + assign
is one synchronous block per microtask.

Default kept at false to:

  • preserve back-compat for users with rate-limited embedders (e.g. tight
    OpenAI plans, single-slot ollama)
  • avoid surprise behavior change in 1-1 patch upgrades

Users with parallel-friendly embedders (managed services, multi-slot ollama,
batched embedder backends) opt in via:

const memory = new Memory({
  ...,
  parallelEntityBoost: true,
});

Measurements

Reproduced on production setup: ollama embed:latest (qwen3-embedding 4B Q4) on
RTX 5090 with OLLAMA_NUM_PARALLEL=2, accessed via WireGuard tunnel
(~218ms RTT), through a multi-threaded HTTP proxy.

Same prompts, same Qdrant collection (~15k memories), same gateway process —
only difference is parallelEntityBoost flag flipped:

prompt (chars) embed_calls sequential ms parallel ms speedup
320 (entity-rich) 7 5852 2089 2.80x
475 (entity-rich) 9 6561 1595 4.11x
821 (long+entity) 9 6595 2557 2.58x
633 (entity-rich, agent dev) 9 6669 2540 2.63x

Sequential embed_ms (sum of all per-call durations) ≈ wallclock total_ms
in baseline (each call blocks the next). Parallel embed_ms (sum) is
2-3x larger than wallclock — direct evidence the calls overlap.

For prompts that extract no entities (≤1 embed call, common for short
conversational queries), behavior is identical — no regression possible
since the patched branch is only entered when deduped.length > 0.

Files changed

  • mem0-ts/src/oss/src/types/index.ts — add parallelEntityBoost to
    MemoryConfigSchema
  • mem0-ts/src/oss/src/config/manager.ts — propagate the flag through
    ConfigManager.mergeConfig (default false)
  • mem0-ts/src/oss/src/memory/index.ts — gate the entity-boost loop on
    this.config.parallelEntityBoost

Backward compatibility

  • Default is false → identical to current behavior
  • Flag is optional in schema → existing configs validate unchanged
  • No public API surface change for users who don't opt in
  • No dependency changes

… speedup)

Adds Memory config flag `parallelEntityBoost` (default false). When true,
the entity-boost embed+search loop in Memory.search() runs concurrently via
Promise.all instead of sequentially. With remote embedders this turns
N+1 sequential RTTs into ~1 RTT.

Measured on production setup (ollama embed:latest, 9-entity query):
- sequential: 6595ms
- parallel:   2089ms (3.16x speedup)

Safety: per-iteration writes go to entityBoosts[memId] = Math.max(prev, boost)
which is order-independent under interleaved single-threaded JS writes.

Default kept at false to preserve back-compat for users with rate-limited
or single-slot embedder backends.
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Dmitry Pogodaev seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@markymark2001
Copy link
Copy Markdown

please merge this 🙏 🥺

@kartik-mem0
Copy link
Copy Markdown
Contributor

Closing — superseded by #5377, which covers both Python and TypeScript SDKs with a unified fix (always-on parallelism with concurrency cap, no opt-in flag needed). Thank you for the contribution!

@kartik-mem0 kartik-mem0 closed this Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants