perf(common): replace BitSet with a fixed word array in the ported bloom filter by yihua · Pull Request #19140 · apache/hudi

yihua · 2026-07-02T06:45:03Z

Describe the issue this Pull Request addresses

Fixes #19139 (part of #14367).

InternalBloomFilter, the port of Hadoop's BloomFilter whose serialized bytes must stay compatible in both directions, kept its bit vector in a java.util.BitSet and converted it to and from bytes one bit at a time. Every add paid per-probe BitSet bookkeeping (expandTo/ensureCapacity/wordsInUse), and every serialization or deserialization looped bit-by-bit over vectorSize (millions of iterations per filter). These paths run for every record key added during HFile base file writing in metadata table compaction, every parquet footer read by the bloom index, and every metadata table bloom filter partition record. CPU profiling of metadata table compaction attributed about 10% of executor CPU to BitSet.ensureCapacity under BloomFilter.add.

Summary and Changelog

InternalBloomFilter: store the bit vector in a fixed-size long[] word array. Bit i lives at words[i >> 6] under mask 1L << (i & 63); the serialized layout (bit i at byte i >> 3 under mask 1 << (i & 7)) is the little-endian byte view of the array, so write/readFields translate by byte position alone and the serialized bytes are unchanged.
add/membershipTest become one mask operation per probe with no growth or bookkeeping logic; and/or/xor/not become word-wise loops; write/readFields pack and unpack one byte per iteration instead of one bit.
Bits at positions at or beyond vectorSize are kept zero (clearUnusedBits), preserving the previous reader's behavior of ignoring unused trailing bits in the last serialized byte.
SimpleBloomFilter and HoodieDynamicBoundedBloomFilter compose InternalBloomFilter and inherit the change without modification.
New TestInternalBloomFilter:
- differential tests against a java.util.BitSet oracle replicating the previous layout (serialized bits, membership on present and absent keys, and/or/xor/not) across word and byte boundary sizes (63/64/65/127/128/1000/43133 bits, up to 30 hash functions);
- write/readFields round trips at boundary sizes;
- unused-trailing-bits semantics (mutated trailing bits are ignored and normalized on re-serialization);
- golden serializeToString fixtures for SIMPLE and DYNAMIC_V0 filters captured from the previous implementation, asserted byte-identical and re-deserializable through BloomFilterFactory.fromString.
New InternalBloomFilterBenchmark: manual microbenchmark for adds, membership tests, and serde. The class name does not match the surefire patterns so it never runs in CI; run it explicitly with mvn test -pl hudi-common -Dtest=InternalBloomFilterBenchmark -Dsurefire.failIfNoSpecifiedTests=false.

Impact

Serialized bloom filter bytes are unchanged in both directions: golden fixtures captured before the change pass unchanged, and every false-positive count in the benchmark below is identical before and after (identical membership semantics).

InternalBloomFilterBenchmark results on JDK 11 / Apple Silicon, medians of 3 measured rounds after warmup, this branch vs its base commit on master:

Scenario	Metric	BitSet (master)	long[] (this PR)	Change
SIMPLE, 1M entries, fpp 1e-3	1M adds	159 ms	139 ms	1.14x
	serialize / deserialize	59 / 66 ms	2 / 2 ms	~30x
SIMPLE, 10M entries, fpp 1e-9	10M adds	8439 ms	6151 ms	1.37x
	serialize / deserialize	594 / 708 ms	55 / 62 ms	~11x
DYNAMIC_V0, 60K entries, fpp 1e-9, max 100K	10M adds	4130 ms	4056 ms	neutral
	serialize / deserialize	6 / 8 ms	under 1 ms	>6x

Adds improve most where the bit vector is large and the probe count is high (fpp 1e-9 means about 30 probes per key). The third scenario mirrors the current metadata table HFile writer defaults; there the bounded dynamic filter saturates (99.6% false positives on absent keys in the benchmark), so adds are dominated by hashing into tiny cache-resident rows. Right-sizing that filter is the follow-up tracked in #19139.

Risk Level

low

Byte-level equivalence is pinned by the oracle differential tests and the pre-change golden fixtures. TestInternalBloomFilter, TestInternalDynamicBloomFilter, and the HFile suites that write and read blooms through real files (TestHoodieHFileReaderWriter in hudi-hadoop-common plus the hudi-io and hudi-common HFile tests) all pass.

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

…oom filter

hudi-agent

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR swaps the BitSet backing InternalBloomFilter for a fixed long[] word array to cut per-probe and per-serialization overhead on the bloom filter hot paths, while keeping the serialized byte layout unchanged. I traced the byte↔word mapping in write/readFields, the bounds of add/membershipTest, and the unused-bit clearing across the boundary vector sizes, and the format-compatibility invariant holds. The differential tests against a BitSet oracle and the golden-string checks give good coverage. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

hudi-bot · 2026-07-02T09:30:09Z

CI report:

4c0594d Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

hudi-agent

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR swaps the BitSet in InternalBloomFilter for a fixed-size long[] word array to eliminate per-probe BitSet bookkeeping and bit-by-bit serialization on the bloom filter hot paths (HFile writes, parquet footer reads, MDT bloom partition). I traced the serialized byte layout (it's the little-endian word view, so write/readFields stay byte-for-byte identical), the h[i] ∈ [0, vectorSize) hash bound that keeps add/membershipTest in-bounds and preserves the unused-bits-zero invariant, the clearUnusedBits handling on the readFields/not paths, and the word-wise and/or/xor ops — the serde format looks preserved and the oracle/golden-string tests guard it well. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

perf(common): replace BitSet with a fixed word array in the ported bl…

2eb0c60

…oom filter

hudi-agent reviewed Jul 2, 2026

View reviewed changes

Add manual microbenchmark for bloom filter hot paths

4c0594d

wombatu-kun approved these changes Jul 2, 2026

View reviewed changes

github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Jul 2, 2026

hudi-agent reviewed Jul 2, 2026

View reviewed changes

voonhous merged commit 3697001 into apache:master Jul 3, 2026
72 of 74 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(common): replace BitSet with a fixed word array in the ported bloom filter#19140

perf(common): replace BitSet with a fixed word array in the ported bloom filter#19140
voonhous merged 2 commits into
apache:masterfrom
yihua:bloom-filter-word-bits

yihua commented Jul 2, 2026 •

edited

Loading

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-bot commented Jul 2, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

yihua commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Jul 2, 2026

CI report:

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yihua commented Jul 2, 2026 •

edited

Loading