fix(streamer): write non-partitioned sample-writes table for record-size estimation by yihua · Pull Request #19115 · apache/hudi

yihua · 2026-06-30T04:30:01Z

Describe the issue this Pull Request addresses

When hoodie.streamer.sample.writes.enabled is on, the first batch's sample write estimates the average record size by writing a sample to the auxiliary .hoodie/.aux/.sample_writes table. The sample bulk insert runs with parallelism 1, but it routes each record to a file by the record's partition path. When the incoming batch spans many source partitions, the write fans out into one tiny file per partition, which slows the sample write and inflates the per-record size estimate through per-file metadata overhead.

Summary and Changelog

Rewrite each sampled record with an empty partition path before the sample bulk insert, so the auxiliary sample-writes table is effectively non-partitioned and all sampled records land in a single file. The sample write is faster and the record-size estimate is more accurate (one parquet footer/dictionary amortized over all records instead of one per source partition).

SparkSampleWritesUtils.doSampleWrites: map each record to newInstance(new HoodieKey(recordKey, "")) before the bulk insert.
TestSparkSampleWritesUtils: assert the sample-writes folder has no source-partition subdirectories, and add a test with 20 source partitions. The empty-table estimate adjusts from ~779 to ~337 to reflect the single-file layout.

Impact

No public API or config change. Faster and more accurate first-batch record-size estimation for partitioned sources that use Hudi Streamer sample writes.

Risk Level

low

The change only affects the throwaway sample-writes table used for size estimation. The estimate is derived from write-stat byte and record counts (partition-agnostic), so flattening the layout does not change correctness. Covered by unit tests including a 20-partition case.

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

…ize estimation The first-batch sample write now rewrites each record with an empty partition path so the auxiliary sample-writes table is non-partitioned. Previously, when the incoming batch spanned many source partitions, the bulk insert (run with parallelism 1) emitted at least one tiny file per partition path, which slowed the sample write and inflated the per-record size estimate through per-file metadata overhead. Writing a single non-partitioned file yields a faster sample write and a more accurate estimate.

hudi-agent

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The PR rewrites each sampled record with an empty partition path so the auxiliary sample-writes table is written non-partitioned, making the first-batch record-size estimate faster and avoiding per-source-partition file fan-out. The mechanics look correct — newInstance(HoodieKey) preserves the payload, operation, and ordering value while only swapping the key. One point on estimation semantics worth double-checking in the inline comments. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of minor suggestions below: preferring the existing SAMPLE_WRITES_FOLDER_PATH constant over a hardcoded path string, and a small simplification in the assertion message.

hudi-agent · 2026-06-30T04:45:32Z

+        // partitions fans out into many tiny files even though parallelism is 1, slowing
+        // the sample write and inflating per-file metadata in the size estimate.
+        List<HoodieRecord> samples = records.coalesce(1).take(size).stream()
+            .map(r -> r.newInstance(new HoodieKey(r.getRecordKey(), "")))


🤖 Collapsing the sample to a single non-partitioned file maximizes parquet footer/dictionary amortization, which is why the estimate drops so much (~779 → ~337). But the real target table is partitioned, so its per-record on-disk size sits somewhere between these depending on how many records each partition actually accumulates per commit. Since this estimate drives copyOnWriteRecordSizeEstimate (file bin-packing), could the single-file layout under-estimate and lead to larger-than-target files for sources whose partitions stay sparse? Curious whether you considered keeping a bounded number of partitions rather than fully flattening to one.

_{⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.}

Impact is bounded to the first commit: getWriteConfigWithRecordSizeEstimate early-returns once the timeline is non-empty (SparkSampleWritesUtils.java:69), and from the next commit on AverageRecordSizeEstimator.averageBytesPerRecord recomputes from real partitioned commit stats, using this config only as the empty-timeline fallback (AverageRecordSizeEstimator.java:70-105). That estimator also subtracts a per-file metadata estimate and skips below-threshold commits (AverageRecordSizeEstimator.java:86,90,93), so the target semantic is data bytes per record, not footer overhead - which is what the single-file sample now measures. A sparse-partition under-estimate self-corrects after the first commit, and a bounded-partition layout would partly reintroduce the footer overhead this removes.

hudi-agent · 2026-06-30T04:45:32Z

+   * subdirectories named after the source partition paths.
+   */
+  private void assertSampleWritesNonPartitioned() throws IOException {
+    Path sampleWritesPath = new Path(basePath(), ".hoodie/.aux/.sample_writes");


🤖 nit: could you use SAMPLE_WRITES_FOLDER_PATH here instead of the hardcoded string? The constant is already imported in SparkSampleWritesUtils; if the path changes, this assertion would silently walk the wrong directory.

_{⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-06-30T04:45:32Z

+      }
+      assertTrue(partitionDirs.isEmpty(),
+          "Sample-writes run at " + run.getPath() + " should have no source partition subdirectories, but found: "
+              + Arrays.toString(partitionDirs.toArray()));


🤖 nit: partitionDirs.toString() gives the same [a, b, ...] output without the unnecessary toArray() roundtrip — could you simplify?

_{⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR rewrites each sampled record with an empty partition path so the auxiliary sample-writes table is non-partitioned, making the first-batch record-size estimate faster and less skewed by per-file metadata overhead. I traced the record newInstance semantics, the record-based bulk-insert partitioning (which routes by the record key's partition path, confirming the flattening works), and the estimate math — the change looks functionally sound. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One minor maintainability suggestion on a hardcoded path in the test.

cc @yihua

hudi-agent · 2026-06-30T06:52:34Z

+   * sample write was not flattened into a single non-partitioned file.
+   */
+  private void assertSampleWritesNonPartitioned() throws IOException {
+    Path sampleWritesPath = new Path(basePath(), ".hoodie/.aux/.sample_writes");


🤖 nit: could you build this from the SAMPLE_WRITES_FOLDER_PATH constant instead of hardcoding .hoodie/.aux/.sample_writes? Keeps the test from silently breaking if that path ever changes.

_{⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-bot · 2026-06-30T08:09:45Z

CI report:

4ec295b Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

lokeshj1703

@yihua Thanks for working on this! Apart from minor comments from agent, the changes look good to me.

hudi-agent reviewed Jun 30, 2026

View reviewed changes

github-actions Bot added the size:S PR with lines of changes in (10, 100] label Jun 30, 2026

yihua added 2 commits June 29, 2026 21:52

Simplify code comments and drop pre-fix references in tests

05bd495

Make sample-writes test methods package-private

4ec295b

hudi-agent reviewed Jun 30, 2026

View reviewed changes

lokeshj1703 approved these changes Jul 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(streamer): write non-partitioned sample-writes table for record-size estimation#19115

fix(streamer): write non-partitioned sample-writes table for record-size estimation#19115
yihua wants to merge 3 commits into
apache:masterfrom
yihua:fix-sample-write-non-partitioned

yihua commented Jun 30, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent Jun 30, 2026

Uh oh!

wombatu-kun Jun 30, 2026

Uh oh!

hudi-agent Jun 30, 2026

Uh oh!

hudi-agent Jun 30, 2026

Uh oh!

wombatu-kun Jun 30, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent Jun 30, 2026

Uh oh!

wombatu-kun Jun 30, 2026

Uh oh!

hudi-bot commented Jun 30, 2026

Uh oh!

lokeshj1703 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

yihua commented Jun 30, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

wombatu-kun Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

wombatu-kun Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

wombatu-kun Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Jun 30, 2026

CI report:

Uh oh!

lokeshj1703 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants