Describe the problem you faced
Reading a Hudi table that is partitioned on a nested column (a partition field whose name is a dotted path, e.g. nested_record.level) fails on the batch read path with:
org.apache.hudi.internal.schema.HoodieSchemaException: Illegal character in: nested_record.level
at org.apache.hudi.HoodieSchemaConversionUtils$.convertStructTypeToHoodieSchema(HoodieSchemaConversionUtils.scala:143)
at org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedFileFormat.buildReaderWithPartitionValues(HoodieFileGroupReaderBasedFileFormat.scala:269)
at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(...)
HoodieFileGroupReaderBasedFileFormat is the only batch reader, so the table cannot be read at all once it is partitioned on a nested field.
Root cause
When a partition field is "mandatory", HoodieFileGroupReaderBasedFileFormat#buildReaderWithPartitionValues adds it to the StructType it converts into a HoodieSchema (a top-level Avro field) so it can be read from the data file. For a nested partition column the field name is a dotted path (nested_record.level), which is not a valid Avro field name, so convertStructTypeToHoodieSchema throws before the read can start.
A nested partition column is never a flat top-level column in the data file: its value is materialized from the partition path, and its root field (nested_record) is already read via the normal data schema. So it should not be converted into a top-level Avro field at all. Hudi already takes the root-level field name for nested mandatory columns elsewhere (HoodieBaseRelation#appendMandatoryColumns via HoodieAvroUtils.getRootLevelFieldName); the file-group reader path does not.
To Reproduce
- Write a COW table partitioned on a nested column (partition field path
nested_record.level).
spark.read.format("hudi").load(path).filter("nested_record.level = 'INFO'") (or any read that makes the partition column mandatory).
- The query fails with
HoodieSchemaException: Illegal character in: nested_record.level.
This surfaces in Apache XTable's HUDI -> ICEBERG conversion when the source Hudi table is partitioned on a nested column.
Expected behavior
Reading a table partitioned on a nested column should succeed; the partition value should be materialized from the partition path like any other appended partition field.
Environment Description
- Hudi version: 1.x (master)
- Spark version: 3.4
- Running on Docker? : no
Describe the problem you faced
Reading a Hudi table that is partitioned on a nested column (a partition field whose name is a dotted path, e.g.
nested_record.level) fails on the batch read path with:HoodieFileGroupReaderBasedFileFormatis the only batch reader, so the table cannot be read at all once it is partitioned on a nested field.Root cause
When a partition field is "mandatory",
HoodieFileGroupReaderBasedFileFormat#buildReaderWithPartitionValuesadds it to theStructTypeit converts into aHoodieSchema(a top-level Avro field) so it can be read from the data file. For a nested partition column the field name is a dotted path (nested_record.level), which is not a valid Avro field name, soconvertStructTypeToHoodieSchemathrows before the read can start.A nested partition column is never a flat top-level column in the data file: its value is materialized from the partition path, and its root field (
nested_record) is already read via the normal data schema. So it should not be converted into a top-level Avro field at all. Hudi already takes the root-level field name for nested mandatory columns elsewhere (HoodieBaseRelation#appendMandatoryColumnsviaHoodieAvroUtils.getRootLevelFieldName); the file-group reader path does not.To Reproduce
nested_record.level).spark.read.format("hudi").load(path).filter("nested_record.level = 'INFO'")(or any read that makes the partition column mandatory).HoodieSchemaException: Illegal character in: nested_record.level.This surfaces in Apache XTable's HUDI -> ICEBERG conversion when the source Hudi table is partitioned on a nested column.
Expected behavior
Reading a table partitioned on a nested column should succeed; the partition value should be materialized from the partition path like any other appended partition field.
Environment Description