Skip to content

perf(common): replace BitSet with a fixed word array in the ported bloom filter#19140

Merged
voonhous merged 2 commits into
apache:masterfrom
yihua:bloom-filter-word-bits
Jul 3, 2026
Merged

perf(common): replace BitSet with a fixed word array in the ported bloom filter#19140
voonhous merged 2 commits into
apache:masterfrom
yihua:bloom-filter-word-bits

Conversation

@yihua

@yihua yihua commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

Fixes #19139 (part of #14367).

InternalBloomFilter, the port of Hadoop's BloomFilter whose serialized bytes must stay compatible in both directions, kept its bit vector in a java.util.BitSet and converted it to and from bytes one bit at a time. Every add paid per-probe BitSet bookkeeping (expandTo/ensureCapacity/wordsInUse), and every serialization or deserialization looped bit-by-bit over vectorSize (millions of iterations per filter). These paths run for every record key added during HFile base file writing in metadata table compaction, every parquet footer read by the bloom index, and every metadata table bloom filter partition record. CPU profiling of metadata table compaction attributed about 10% of executor CPU to BitSet.ensureCapacity under BloomFilter.add.

Summary and Changelog

  • InternalBloomFilter: store the bit vector in a fixed-size long[] word array. Bit i lives at words[i >> 6] under mask 1L << (i & 63); the serialized layout (bit i at byte i >> 3 under mask 1 << (i & 7)) is the little-endian byte view of the array, so write/readFields translate by byte position alone and the serialized bytes are unchanged.
  • add/membershipTest become one mask operation per probe with no growth or bookkeeping logic; and/or/xor/not become word-wise loops; write/readFields pack and unpack one byte per iteration instead of one bit.
  • Bits at positions at or beyond vectorSize are kept zero (clearUnusedBits), preserving the previous reader's behavior of ignoring unused trailing bits in the last serialized byte.
  • SimpleBloomFilter and HoodieDynamicBoundedBloomFilter compose InternalBloomFilter and inherit the change without modification.
  • New TestInternalBloomFilter:
    • differential tests against a java.util.BitSet oracle replicating the previous layout (serialized bits, membership on present and absent keys, and/or/xor/not) across word and byte boundary sizes (63/64/65/127/128/1000/43133 bits, up to 30 hash functions);
    • write/readFields round trips at boundary sizes;
    • unused-trailing-bits semantics (mutated trailing bits are ignored and normalized on re-serialization);
    • golden serializeToString fixtures for SIMPLE and DYNAMIC_V0 filters captured from the previous implementation, asserted byte-identical and re-deserializable through BloomFilterFactory.fromString.
  • New InternalBloomFilterBenchmark: manual microbenchmark for adds, membership tests, and serde. The class name does not match the surefire patterns so it never runs in CI; run it explicitly with mvn test -pl hudi-common -Dtest=InternalBloomFilterBenchmark -Dsurefire.failIfNoSpecifiedTests=false.

Impact

Serialized bloom filter bytes are unchanged in both directions: golden fixtures captured before the change pass unchanged, and every false-positive count in the benchmark below is identical before and after (identical membership semantics).

InternalBloomFilterBenchmark results on JDK 11 / Apple Silicon, medians of 3 measured rounds after warmup, this branch vs its base commit on master:

Scenario Metric BitSet (master) long[] (this PR) Change
SIMPLE, 1M entries, fpp 1e-3 1M adds 159 ms 139 ms 1.14x
serialize / deserialize 59 / 66 ms 2 / 2 ms ~30x
SIMPLE, 10M entries, fpp 1e-9 10M adds 8439 ms 6151 ms 1.37x
serialize / deserialize 594 / 708 ms 55 / 62 ms ~11x
DYNAMIC_V0, 60K entries, fpp 1e-9, max 100K 10M adds 4130 ms 4056 ms neutral
serialize / deserialize 6 / 8 ms under 1 ms >6x

Adds improve most where the bit vector is large and the probe count is high (fpp 1e-9 means about 30 probes per key). The third scenario mirrors the current metadata table HFile writer defaults; there the bounded dynamic filter saturates (99.6% false positives on absent keys in the benchmark), so adds are dominated by hashing into tiny cache-resident rows. Right-sizing that filter is the follow-up tracked in #19139.

Risk Level

low

Byte-level equivalence is pinned by the oracle differential tests and the pre-change golden fixtures. TestInternalBloomFilter, TestInternalDynamicBloomFilter, and the HFile suites that write and read blooms through real files (TestHoodieHFileReaderWriter in hudi-hadoop-common plus the hudi-io and hudi-common HFile tests) all pass.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR swaps the BitSet backing InternalBloomFilter for a fixed long[] word array to cut per-probe and per-serialization overhead on the bloom filter hot paths, while keeping the serialized byte layout unchanged. I traced the byte↔word mapping in write/readFields, the bounds of add/membershipTest, and the unused-bit clearing across the boundary vector sizes, and the format-compatibility invariant holds. The differential tests against a BitSet oracle and the golden-string checks give good coverage. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

@github-actions github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Jul 2, 2026
@hudi-bot

hudi-bot commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR swaps the BitSet in InternalBloomFilter for a fixed-size long[] word array to eliminate per-probe BitSet bookkeeping and bit-by-bit serialization on the bloom filter hot paths (HFile writes, parquet footer reads, MDT bloom partition). I traced the serialized byte layout (it's the little-endian word view, so write/readFields stay byte-for-byte identical), the h[i] ∈ [0, vectorSize) hash bound that keeps add/membershipTest in-bounds and preserves the unused-bits-zero invariant, the clearUnusedBits handling on the readFields/not paths, and the word-wise and/or/xor ops — the serde format looks preserved and the oracle/golden-string tests guard it well. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

@voonhous voonhous merged commit 3697001 into apache:master Jul 3, 2026
72 of 74 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eliminate java.util.BitSet overhead in bloom filter add and serde hot paths

5 participants