test(spark): cover the legacy parquet read path with file-group reader disabled#19133
test(spark): cover the legacy parquet read path with file-group reader disabled#19133yihua wants to merge 1 commit into
Conversation
hudi-agent
left a comment
There was a problem hiding this comment.
Thanks for working on this! This PR adds TestLegacyParquetReadPath, a COW functional suite that exercises the previously-uncovered legacy Spark read path (BaseFileOnlyRelation, IncrementalRelationV1/V2, and the per-version legacy parquet file format) by constructing the legacy relations directly with the file-group reader disabled and comparing results row-by-row against the file-group-reader path plus independently computed expected values. I traced the expected counts for the snapshot, filter-pushdown, and incremental cases and they hold up, and the per-branch coverage (vectorized/non-vectorized, relation vs. file-format conversion, V1 vs. V2) looks well-targeted. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One small maintainability nit below — the magic positional index in collectSorted; everything else looks clean and well-commented.
cc @yihua
| private def collectSorted(df: DataFrame): Seq[Row] = | ||
| df.select(comparedCols.map(col): _*).collect().toSeq.sortBy(_.getString(2).toInt) | ||
|
|
||
| private def assertSameRows(expected: DataFrame, actual: DataFrame): Unit = { |
There was a problem hiding this comment.
🤖 nit: the magic index 2 silently couples this sort to the current position of "id" in comparedCols — could you use _.getAs[String]("id").toInt instead so it stays correct if comparedCols is ever reordered?
| spark = null | ||
| } | ||
|
|
||
| private def makeRows(ids: Seq[Int], ts: Long, valueFn: Int => Long): Seq[(String, Long, Long, String)] = |
There was a problem hiding this comment.
The row shape is entirely scalar (id, ts, value, partition), so no case here exercises the legacy parquet format's vectorized nested-column read - historically the most fragile branch of these per-version copies (the nested-column vectorized-read bug fixed in HUDI-7190 / #10265). A regression re-introduced there would still ship silently, which is exactly the blind spot this suite is meant to close. Consider adding a struct (or array) column to the schema so the vectorized nested path is covered too; the existing cases only hit flat columns.
Describe the issue this Pull Request addresses
The legacy (pre-file-group-reader) Spark read path has zero automated coverage: the per-Spark-version
SparkXXLegacyHoodieParquetFileFormatcopies (~245 lines each across the six version modules) andBaseFileOnlyRelation,IncrementalRelationV1,IncrementalRelationV2,IncrementalRelationUtilinhudi-spark-commonall report 0% on Codecov. These classes still run in production:DefaultSource#resolveBaseFileOnlyRelationusesBaseFileOnlyRelationfor metadata-table reads, andHoodieStreamSourceV1/V2pickIncrementalRelationV1/V2whenhoodie.file.group.reader.enabledis false. A regression there currently ships silently.Note on routing: the batch
DefaultSource.createRelationpath no longer consultshoodie.file.group.reader.enabledfor normal snapshot/incremental reads; the legacy relations are reachable via the streaming sources and metadata-table reads. The tests therefore construct the legacy relations directly with the flag disabled in their options, matching the streaming-source invocation.Summary and Changelog
Add
TestLegacyParquetReadPath(hudi-spark functional suite, COW, <=30 rows per table) asserting sorted row-level equality between the legacy path and the file-group-reader path on the same table, plus independently computed expected values so both paths cannot be equally wrong:BaseFileOnlyRelation.buildScanand itstoHadoopFsRelationconversion (asserting the relation executes the legacy Hudi parquet file format).spark.sql.parquet.enableVectorizedReader=false.hoodie.datasource.read.extract.partition.values.from.path.IncrementalRelationV1(instant-time start) andIncrementalRelationV2(completion-time start,OPEN_CLOSED), compared against the file-group-reader incremental result.Since the suite lives in the shared
hudi-sparkmodule functional package, each CI matrix leg exercises its own Spark version's legacy file format copy.Impact
Test-only; no production code change. Closes the coverage blind spot on the legacy read path (six version-module file format copies plus the legacy relations in
hudi-spark-common).Risk Level
low
Documentation Update
None.
Contributor's checklist