Skip to content

fix: Handle map/array-nested leaf columns in column stats collection during MOR log-append#19126

Open
vinishjail97 wants to merge 2 commits into
apache:masterfrom
vinishjail97:fix-colstats-map-array-value-navigation
Open

fix: Handle map/array-nested leaf columns in column stats collection during MOR log-append#19126
vinishjail97 wants to merge 2 commits into
apache:masterfrom
vinishjail97:fix-colstats-map-array-value-navigation

Conversation

@vinishjail97

Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

Closes #19125

Building the column-stats index on a table version 9 (timeline layout V2) Merge-on-Read table crashes during the inline log-append upsert when the col-stats column list contains a leaf nested inside a MAP or ARRAY (e.g. my_map.key_value.value, my_array.list.element):

Caused by: java.lang.IllegalStateException: Cannot get field from schema type: MAP
	at org.apache.hudi.common.schema.HoodieSchema.getField(HoodieSchema.java:1172)
	at org.apache.hudi.avro.AvroRecordContext.getFieldValueFromIndexedRecord(AvroRecordContext.java:90)
	at org.apache.hudi.common.model.HoodieAvroIndexedRecord.getColumnValueAsJava(HoodieAvroIndexedRecord.java:193)
	at org.apache.hudi.metadata.HoodieTableMetadataUtil.collectColumnRangeFieldValueV2(HoodieTableMetadataUtil.java:354)
	at org.apache.hudi.io.HoodieInlineLogAppendHandle.collectColumnStats(HoodieInlineLogAppendHandle.java:182)

Summary and Changelog

#17694 added column stats for primitives nested inside MAP/ARRAY by teaching the schema-side navigator (HoodieSchema.getNestedField) and the base-file (Parquet) path to resolve the synthetic accessors .key_value.key, .key_value.value and .list.element. Such leaves therefore pass isColumnTypeSupported (the resolved leaf is a scalar).

The column-stats V2 record path used by MOR inline log-append (HoodieTableMetadataUtil.collectColumnRangeFieldValueV2 -> HoodieRecord.getColumnValueAsJava -> AvroRecordContext.getFieldValueFromIndexedRecord) was never taught the same navigation. It splits the field path on . and assumes every segment is a RECORD field, so when it reaches the MAP/ARRAY schema it calls HoodieSchema.getField(...) on it, which throws IllegalStateException: Cannot get field from schema type: MAP.

The V1 path (table version 6) does not hit this: HoodieAvroUtils.getNestedFieldVal returns null when a path segment is not a record. Only the V2 record navigator was left asymmetric.

This PR makes AvroRecordContext.getFieldValueFromIndexedRecord return null when a path segment cannot be resolved as a plain RECORD field, instead of throwing:

  • Return null when the current schema does not support fields (a MAP/ARRAY/other non-record intermediate) or when the intermediate value is null, mirroring HoodieAvroUtils.getNestedFieldVal.
  • Guard the intermediate downcast so a non-record value (java Map/List) degrades to null rather than throwing ClassCastException.

A map/array leaf is multi-valued per record and has no single value to fold into a min/max, so returning null (no stats from the record path) is correct and safe; statistics for such leaves are still collected from the base-file (Parquet) footer path added in #17694.

Impact

MOR tables at table version 9 with column stats configured on map/array-nested leaves can be upserted (no longer crash on log-append). No change for scalar or record-nested columns. Behavior now matches table version 6.

Risk Level

low. The change is scoped to path segments that cannot be resolved as record fields, which previously threw. Covered by new unit tests in TestAvroRecordContext and validated end-to-end via Apache XTable's ITHudiConversionSource (MOR source at table version 9 with column stats on map/array leaves): the upsert that previously crashed with Cannot get field from schema type: MAP now succeeds.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

…tion

Collecting column stats for a primitive nested inside a MAP or ARRAY
(e.g. `my_map.key_value.value`, `my_array.list.element`) crashed the MOR
inline log-append path at table version 9:

    IllegalStateException: Cannot get field from schema type: MAP
      at HoodieSchema.getField
      at AvroRecordContext.getFieldValueFromIndexedRecord
      at HoodieAvroIndexedRecord.getColumnValueAsJava
      at HoodieTableMetadataUtil.collectColumnRangeFieldValueV2
      at HoodieInlineLogAppendHandle.collectColumnStats

apache#17694 taught the schema-side navigator (HoodieSchema.getNestedField) and
the base-file (Parquet) path to resolve the Parquet-style `.key_value.key`,
`.key_value.value` and `.list.element` synthetic accessors, so such leaves
pass isColumnTypeSupported. The record value-side navigator
(AvroRecordContext.getFieldValueFromIndexedRecord), used by the column-stats
V2 collection path (collectColumnRangeFieldValueV2 -> getColumnValueAsJava),
was never updated: it assumes every path segment is a RECORD field and calls
HoodieSchema.getField on the MAP/ARRAY schema, which throws.

Make the value navigator return null when a path segment cannot be resolved
as a plain RECORD field (a MAP/ARRAY intermediate or a null intermediate),
instead of throwing. This mirrors HoodieAvroUtils.getNestedFieldVal, which is
why the V1 path at table version 6 already tolerated these paths. A map/array
leaf is multi-valued per record and has no single value to fold into a
min/max, so returning null (no stats from the record path) is correct;
statistics for such leaves are still collected from the base-file (Parquet)
footer path added in apache#17694.

Also guard the intermediate downcast so a non-record value (java Map/List)
degrades to null rather than throwing ClassCastException.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the size:S PR with lines of changes in (10, 100] label Jul 1, 2026

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR fixes a crash in the column-stats V2 record navigator (AvroRecordContext.getFieldValueFromIndexedRecord) when a col-stats leaf is nested inside a MAP/ARRAY during MOR inline log-append, by returning null (graceful degradation) instead of throwing on non-record intermediate schemas. The change correctly mirrors the V1 HoodieAvroUtils.getNestedFieldVal null-on-non-record behavior, also hardens against null intermediates and non-IndexedRecord values that previously would NPE/ClassCastException, and preserves the up-front empty-field-name rejection. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One naming nit on the test schema constant; the production-code changes are clean.

cc @yihua

assertThrows(IllegalArgumentException.class, () -> getFieldValueFromIndexedRecord(record, ""));
}

private static final Schema COMPLEX_SCHEMA = new Schema.Parser().parse(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: COMPLEX_SCHEMA is a bit ambiguous — "complex" could mean Avro complex types, or just a complicated schema. Something like MAP_AND_ARRAY_SCHEMA would immediately signal what's being covered here and make the constant self-documenting when a future test author scans for it.

⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to MAP_AND_ARRAY_SCHEMA in 1a78da7.

@vinishjail97 vinishjail97 changed the title [HUDI] Fix column stats crash on map/array-nested leaves during MOR log-append (table version 9) fix: Handle map/array-nested leaf columns in column stats collection during MOR log-append Jul 1, 2026
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
vinishjail97 added a commit to vinishjail97/onetable that referenced this pull request Jul 1, 2026
… versions 6 and 9

Flip HudiTargetConfig.DEFAULT_TABLE_VERSION from NINE to SIX so the
default output stays readable by released Hudi readers; version 9
remains fully supported via xtable.hudi.target.table_version=9.

Parameterize the Hudi test suites so every run exercises both table
versions instead of only the default:
- ITHudiConversionTarget: partitioned x {SIX, NINE} via a MethodSource
  cross-product; the target client sets the version through
  HudiTargetConfig.HUDI_TABLE_VERSION.
- ITHudiConversionSource: source tables are created at {SIX, NINE} via
  the table-type/partition MethodSource cross-products, and the
  parameterized tests write through TestSparkHudiTable (Spark writer)
  instead of the Java client.
- ITConversionController: combinations targeting HUDI are emitted once
  per version; getTableSyncConfig gained an overload that applies the
  version to the Hudi target properties.
- TestHudiTargetConfig/TestHudiConversionTarget assert against
  DEFAULT_TABLE_VERSION instead of a hard-coded version.

Version 9 source coverage in ITHudiConversionSource depends on two
Hudi fixes validated against a locally patched 1.3.0-SNAPSHOT:
apache/hudi#19126 (column stats on map/array-nested leaves during MOR
log-append) and the savepoint backlog fix in the previous commit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR teaches the V2 column-stats value navigator (AvroRecordContext.getFieldValueFromIndexedRecord) to gracefully return null when a field path descends through a MAP/ARRAY synthetic accessor or hits a null intermediate value, instead of throwing IllegalStateException during MOR inline log-append column-stats collection. I traced the map, array, deep-nested, and null-intermediate paths — the new currentRecord == null || !currentSchema.hasFields() guard and the null-safe IndexedRecord cast correctly cover the previously-throwing cases without breaking valid nested-record resolution, and the behavior now matches the V1 getNestedFieldVal contract. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

@hudi-bot

hudi-bot commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Column stats on MOR table version 9 crashes on map/array-nested leaf columns during log-append

3 participants