fix: Handle map/array-nested leaf columns in column stats collection during MOR log-append#19126
Conversation
…tion
Collecting column stats for a primitive nested inside a MAP or ARRAY
(e.g. `my_map.key_value.value`, `my_array.list.element`) crashed the MOR
inline log-append path at table version 9:
IllegalStateException: Cannot get field from schema type: MAP
at HoodieSchema.getField
at AvroRecordContext.getFieldValueFromIndexedRecord
at HoodieAvroIndexedRecord.getColumnValueAsJava
at HoodieTableMetadataUtil.collectColumnRangeFieldValueV2
at HoodieInlineLogAppendHandle.collectColumnStats
apache#17694 taught the schema-side navigator (HoodieSchema.getNestedField) and
the base-file (Parquet) path to resolve the Parquet-style `.key_value.key`,
`.key_value.value` and `.list.element` synthetic accessors, so such leaves
pass isColumnTypeSupported. The record value-side navigator
(AvroRecordContext.getFieldValueFromIndexedRecord), used by the column-stats
V2 collection path (collectColumnRangeFieldValueV2 -> getColumnValueAsJava),
was never updated: it assumes every path segment is a RECORD field and calls
HoodieSchema.getField on the MAP/ARRAY schema, which throws.
Make the value navigator return null when a path segment cannot be resolved
as a plain RECORD field (a MAP/ARRAY intermediate or a null intermediate),
instead of throwing. This mirrors HoodieAvroUtils.getNestedFieldVal, which is
why the V1 path at table version 6 already tolerated these paths. A map/array
leaf is multi-valued per record and has no single value to fold into a
min/max, so returning null (no stats from the record path) is correct;
statistics for such leaves are still collected from the base-file (Parquet)
footer path added in apache#17694.
Also guard the intermediate downcast so a non-record value (java Map/List)
degrades to null rather than throwing ClassCastException.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
hudi-agent
left a comment
There was a problem hiding this comment.
Thanks for working on this! This PR fixes a crash in the column-stats V2 record navigator (AvroRecordContext.getFieldValueFromIndexedRecord) when a col-stats leaf is nested inside a MAP/ARRAY during MOR inline log-append, by returning null (graceful degradation) instead of throwing on non-record intermediate schemas. The change correctly mirrors the V1 HoodieAvroUtils.getNestedFieldVal null-on-non-record behavior, also hardens against null intermediates and non-IndexedRecord values that previously would NPE/ClassCastException, and preserves the up-front empty-field-name rejection. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One naming nit on the test schema constant; the production-code changes are clean.
cc @yihua
| assertThrows(IllegalArgumentException.class, () -> getFieldValueFromIndexedRecord(record, "")); | ||
| } | ||
|
|
||
| private static final Schema COMPLEX_SCHEMA = new Schema.Parser().parse( |
There was a problem hiding this comment.
🤖 nit: COMPLEX_SCHEMA is a bit ambiguous — "complex" could mean Avro complex types, or just a complicated schema. Something like MAP_AND_ARRAY_SCHEMA would immediately signal what's being covered here and make the constant self-documenting when a future test author scans for it.
There was a problem hiding this comment.
Renamed to MAP_AND_ARRAY_SCHEMA in 1a78da7.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… versions 6 and 9
Flip HudiTargetConfig.DEFAULT_TABLE_VERSION from NINE to SIX so the
default output stays readable by released Hudi readers; version 9
remains fully supported via xtable.hudi.target.table_version=9.
Parameterize the Hudi test suites so every run exercises both table
versions instead of only the default:
- ITHudiConversionTarget: partitioned x {SIX, NINE} via a MethodSource
cross-product; the target client sets the version through
HudiTargetConfig.HUDI_TABLE_VERSION.
- ITHudiConversionSource: source tables are created at {SIX, NINE} via
the table-type/partition MethodSource cross-products, and the
parameterized tests write through TestSparkHudiTable (Spark writer)
instead of the Java client.
- ITConversionController: combinations targeting HUDI are emitted once
per version; getTableSyncConfig gained an overload that applies the
version to the Hudi target properties.
- TestHudiTargetConfig/TestHudiConversionTarget assert against
DEFAULT_TABLE_VERSION instead of a hard-coded version.
Version 9 source coverage in ITHudiConversionSource depends on two
Hudi fixes validated against a locally patched 1.3.0-SNAPSHOT:
apache/hudi#19126 (column stats on map/array-nested leaves during MOR
log-append) and the savepoint backlog fix in the previous commit.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
hudi-agent
left a comment
There was a problem hiding this comment.
Thanks for working on this! This PR teaches the V2 column-stats value navigator (AvroRecordContext.getFieldValueFromIndexedRecord) to gracefully return null when a field path descends through a MAP/ARRAY synthetic accessor or hits a null intermediate value, instead of throwing IllegalStateException during MOR inline log-append column-stats collection. I traced the map, array, deep-nested, and null-intermediate paths — the new currentRecord == null || !currentSchema.hasFields() guard and the null-safe IndexedRecord cast correctly cover the previously-throwing cases without breaking valid nested-record resolution, and the behavior now matches the V1 getNestedFieldVal contract. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.
cc @yihua
Describe the issue this Pull Request addresses
Closes #19125
Building the column-stats index on a table version 9 (timeline layout V2) Merge-on-Read table crashes during the inline log-append upsert when the col-stats column list contains a leaf nested inside a
MAPorARRAY(e.g.my_map.key_value.value,my_array.list.element):Summary and Changelog
#17694 added column stats for primitives nested inside
MAP/ARRAYby teaching the schema-side navigator (HoodieSchema.getNestedField) and the base-file (Parquet) path to resolve the synthetic accessors.key_value.key,.key_value.valueand.list.element. Such leaves therefore passisColumnTypeSupported(the resolved leaf is a scalar).The column-stats V2 record path used by MOR inline log-append (
HoodieTableMetadataUtil.collectColumnRangeFieldValueV2->HoodieRecord.getColumnValueAsJava->AvroRecordContext.getFieldValueFromIndexedRecord) was never taught the same navigation. It splits the field path on.and assumes every segment is aRECORDfield, so when it reaches theMAP/ARRAYschema it callsHoodieSchema.getField(...)on it, which throwsIllegalStateException: Cannot get field from schema type: MAP.The V1 path (table version 6) does not hit this:
HoodieAvroUtils.getNestedFieldValreturnsnullwhen a path segment is not a record. Only the V2 record navigator was left asymmetric.This PR makes
AvroRecordContext.getFieldValueFromIndexedRecordreturnnullwhen a path segment cannot be resolved as a plainRECORDfield, instead of throwing:nullwhen the current schema does not support fields (aMAP/ARRAY/other non-record intermediate) or when the intermediate value isnull, mirroringHoodieAvroUtils.getNestedFieldVal.Map/List) degrades tonullrather than throwingClassCastException.A map/array leaf is multi-valued per record and has no single value to fold into a min/max, so returning
null(no stats from the record path) is correct and safe; statistics for such leaves are still collected from the base-file (Parquet) footer path added in #17694.Impact
MOR tables at table version 9 with column stats configured on map/array-nested leaves can be upserted (no longer crash on log-append). No change for scalar or record-nested columns. Behavior now matches table version 6.
Risk Level
low. The change is scoped to path segments that cannot be resolved as record fields, which previously threw. Covered by new unit tests in
TestAvroRecordContextand validated end-to-end via Apache XTable'sITHudiConversionSource(MOR source at table version 9 with column stats on map/array leaves): the upsert that previously crashed withCannot get field from schema type: MAPnow succeeds.Documentation Update
none
Contributor's checklist