Skip to content

feat(writer): Switch default to native log format for table version >…#19118

Draft
cshuo wants to merge 13 commits into
apache:masterfrom
cshuo:native_log_as_default
Draft

feat(writer): Switch default to native log format for table version >…#19118
cshuo wants to merge 13 commits into
apache:masterfrom
cshuo:native_log_as_default

Conversation

@cshuo

@cshuo cshuo commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

…= 10

Describe the issue this Pull Request addresses

Table version 10 should use the native v2 log format by default for Parquet MOR writes, but the writer selection logic was still tied to the LSM-tree storage layout. That made native log writing depend on storage layout instead of the effective write version and base file format.

This PR switches the write-path decision to the effective table write version, adds the missing version-10 write-config support, and hardens native log read/delete handling so Spark, Flink, Hive realtime read paths can consistently work with native data and delete logs.

Summary and Changelog

  • Adds HoodieTableVersion.TEN as a supported write version in HoodieWriteConfig.
  • Introduces CommonClientUtils.shouldWriteNativeLogFormat, which enables native v2 log writes for Parquet tables when the effective write version is >= 10.
  • Replaces LSM-layout checks with the new native-log decision helper across Spark, Flink, append, compaction, CDC, and log-compaction write paths.
  • Updates native log reader/block handling so native delete logs can be consumed by both file-group-reader and legacy scanner paths, including delete record materialization, partition-path handling, ordering fields, and schema parsing from native log metadata.
  • Writes native log footer metadata for delete files and avoids collecting file-format metadata when column-stats metadata indexing is disabled.
  • Adds coverage for Spark SQL native-log writes, Flink native-log writes, Hive realtime reads over native logs, common native log format reads, and native-log decision logic.

Impact

  • Functional impact: Parquet MOR writers targeting table version 10 now default to native v2 log files instead of legacy inline log files. Lower write versions and non-Parquet base file formats continue to avoid native log writes.
  • Maintainability: Centralizes native-log write eligibility in CommonClientUtils.shouldWriteNativeLogFormat, removing duplicated storage-layout checks from engine-specific writer paths.
  • Extensibility: Decouples native log format selection from LSM-tree layout, making future table-version and format-gated log behavior easier to evolve.

Risk Level

medium. This changes the default log format for table-version-10 Parquet MOR writes and touches shared Spark, Flink, compaction, and common log reader code. The commit mitigates the risk with new Spark, Flink, Hive realtime, common log-format, and utility tests covering native data/delete log write and read behavior.

Documentation Update

Document that Parquet MOR writes targeting table version 10 use the native v2 log format by default, while lower write versions and non-Parquet base file formats continue using the legacy inline log format.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@cshuo cshuo marked this pull request as draft June 30, 2026 08:28
@cshuo cshuo force-pushed the native_log_as_default branch 2 times, most recently from 47076a9 to 14d4c60 Compare July 1, 2026 01:41
@cshuo cshuo changed the title [WIP] feat(writer): Switch default to native log format for table version >… feat(writer): Switch default to native log format for table version >… Jul 1, 2026
@github-actions github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Jul 1, 2026
@cshuo cshuo force-pushed the native_log_as_default branch from 14d4c60 to e0e79f1 Compare July 1, 2026 08:32
*
* @param writeConfig the writer configuration.
*/
public static boolean shouldWriteNativeLogFormat(HoodieWriteConfig writeConfig) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to shouldWriteNativeLogs

return getBlockType().isDataOrDeleteBlock();
}

protected HoodieSchema getSchemaFromHeader() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have HoodieAvroSchemaCache now, can use it instead.

public <T> List<BufferedRecord<T>> getRecordsToDelete(HoodieReaderContext<T> readerContext) {
return Arrays.stream(getRecordsToDelete())
.map(deleteRecord -> BufferedRecords.fromDeleteRecord(deleteRecord, recordContext))
.map(deleteRecord -> BufferedRecords.fromDeleteRecord(deleteRecord, readerContext.getRecordContext()))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

throw new HoodieNotSupportedException("Native delete log files do not support the legacy DeleteRecord[] API. "
+ "Use getRecordsToDelete(RecordContext) instead. Log file: " + logFile);
if (recordsToDelete == null) {
recordsToDelete = readRecordsToDelete();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why read as delete record from native delete logs

@cshuo cshuo force-pushed the native_log_as_default branch from 367d6af to 52b0284 Compare July 2, 2026 02:29
@github-actions github-actions Bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels Jul 2, 2026
@hudi-bot

hudi-bot commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants