Skip to content

[MINOR] Add MDT-partition-shape parameterized HBase-read compat test#19080

Draft
shangxinli wants to merge 2 commits into
apache:masterfrom
shangxinli:shangx/hudi-cross-version-hfile-compat
Draft

[MINOR] Add MDT-partition-shape parameterized HBase-read compat test#19080
shangxinli wants to merge 2 commits into
apache:masterfrom
shangxinli:shangx/hudi-cross-version-hfile-compat

Conversation

@shangxinli

@shangxinli shangxinli commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

Stacked on top of #19071 and #19083, which already cover the generic HBase 2.4.x reader → native multi-block HFile path (point lookup, full scan, seekBefore, byte-parity, NONE + GZIP). Please land those first.

What is still uncovered after #19071 / #19083: those tests exercise one generic multi-block file shape. The Metadata Table writes files with five distinct key/value sizing profiles (small key + medium value, record-key-shaped, long composite key, small key + 64 KB value, etc.). A future change to HFileWriterImpl that only perturbs the trailer index sizing for, say, the RECORD_INDEX or BLOOM_FILTERS shape would still pass the generic tests — the data-block / meta-block index in the trailer is laid out per-file based on actual key and value sizes.

Summary and Changelog

Test-only. Adds one parameterized test, TestHudiHFileMdtHbaseReadCompatibility, that — for each of the five MDT-realistic shapes — writes ~5,000 records (≥ 5 data blocks at the default 64 KB block size) with the native writer, reopens with HBase 2.4.x HFile.createReader, and asserts:

  • the trailer parses (no CorruptHFileException),
  • getEntries() matches the count written,
  • a full forward scan returns every key in the order written.

Shapes covered:

Shape Key Value
FILES 32 B 96 B
COLUMN_STATS 32 B 256 B
RECORD_INDEX 40 B 24 B
SECONDARY_INDEX 120 B 24 B
BLOOM_FILTERS 32 B 64 KB

No production code changes. New test file: hudi-io/src/test/java/org/apache/hudi/io/hfile/TestHudiHFileMdtHbaseReadCompatibility.java.

Impact

None on runtime behavior. CI runs ~2 seconds longer in hudi-io.

Risk Level

none — test-only.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable (this PR is the test)
  • CI passed

Xinli Shang added 2 commits June 26, 2026 15:53
The existing TestHFileCompatibility#testHbaseReaderSucceedsWhenKeyValueVersionIsSetTo1
covers the single-data-block happy path with 5 records, which does not
stress the load-on-open data structures (data-block index, meta-block
index) that HFileWriterImpl's trailer points at.

Add a parameterized cross-version compat test that writes 5,000 records
across 5 key/value shapes representative of the Metadata Table (MDT)
partitions (FILES, COLUMN_STATS, RECORD_INDEX, SECONDARY_INDEX,
BLOOM_FILTERS), reopens each file with HBase 2.4.x HFile.createReader,
and asserts:

- trailer parses (no CorruptHFileException),
- getEntries() matches the count written,
- a full forward scan returns every key in order.

A future trailer-layout drift in HFileWriterImpl (field reorder, missed
field, width change) would fail at least one assertion across these
shapes, before the change reaches MDT files in production.

No production code changes; test-only.

Signed-off-by: Xinli Shang <shangxinli@apache.org>
Three issues found by second-pass review on the previous commit:

1. BLOOM_FILTERS shape had keyLen=24 with a prefix of "BLOOM_FILTERS::"
   (15 bytes), leaving only 9 bytes for the 10-digit idx after the prefix.
   The earlier truncation path silently dropped the last digit, producing
   10 identical keys per group of consecutive indices. Bump to keyLen=32
   and harden the generator: it now throws IllegalArgumentException when
   keyLen is too small for prefix + idx instead of truncating.

2. assertNotNull(scanner.seekTo()) was a no-op guard; seekTo() returns a
   primitive (boolean in HBase 2.4.13), autoboxed to a non-null wrapper
   in all cases. Replace with assertTrue(scanner.seekTo(), ...).

3. The redundant assertDoesNotThrow(() -> HFile.createReader(...))
   block opened a Reader that was never closed and immediately reopened
   one in the try-with-resources below. Remove it; the try-with-resources
   already proves trailer parse success.

Signed-off-by: Xinli Shang <shangxinli@apache.org>
@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label Jun 26, 2026
@shangxinli

Copy link
Copy Markdown
Contributor Author

FYI — cherry-picked internally at Uber to gate our 0.14 → 1.2 cutover ahead of the OSS review cycle: uber-code/data-hoodie_oss#273. Any feedback here will be back-ported.

@hudi-bot

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shangxinli Thanks for putting this up! I enhanced the HFile writer and tests in #19071 and #19083. Could you check if that have already covered your changes?

@shangxinli shangxinli changed the title [MINOR] Add multi-block HBase-reader compat test for native HFile writer [MINOR] Add MDT-partition-shape parameterized HBase-read compat test Jun 29, 2026
@shangxinli

Copy link
Copy Markdown
Contributor Author

Thanks @yihua#19071 and #19083 cover the generic HBase-reader → native multi-block path (point lookup, seekBefore, full scan, byte-parity, NONE+GZIP), which is the bulk of what this PR was originally guarding.

The piece still not covered there: those tests exercise a single generic multi-block file. MDT writes files with five distinct key/value sizing profiles (FILES, COLUMN_STATS, RECORD_INDEX, SECONDARY_INDEX, BLOOM_FILTERS) and the trailer index layout is per-file based on actual key/value sizes — a future writer change that only perturbs one shape's sizing would slip past the generic tests.

I've rescoped this PR to just that parameterized MDT-shape sweep, stacked on #19083. Happy to fold it into #19083 directly if you'd rather, just say the word and I'll close this and send a patch there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants