test(spark): cover the legacy parquet read path with file-group reader disabled by yihua · Pull Request #19133 · apache/hudi

yihua · 2026-07-02T02:16:08Z

Describe the issue this Pull Request addresses

The legacy (pre-file-group-reader) Spark read path has zero automated coverage: the per-Spark-version SparkXXLegacyHoodieParquetFileFormat copies (~245 lines each across the six version modules) and BaseFileOnlyRelation, IncrementalRelationV1, IncrementalRelationV2, IncrementalRelationUtil in hudi-spark-common all report 0% on Codecov. These classes still run in production: DefaultSource#resolveBaseFileOnlyRelation uses BaseFileOnlyRelation for metadata-table reads, and HoodieStreamSourceV1/V2 pick IncrementalRelationV1/V2 when hoodie.file.group.reader.enabled is false. A regression there currently ships silently.

Note on routing: the batch DefaultSource.createRelation path no longer consults hoodie.file.group.reader.enabled for normal snapshot/incremental reads; the legacy relations are reachable via the streaming sources and metadata-table reads. The tests therefore construct the legacy relations directly with the flag disabled in their options, matching the streaming-source invocation.

Summary and Changelog

Add TestLegacyParquetReadPath (hudi-spark functional suite, COW, <=30 rows per table) asserting sorted row-level equality between the legacy path and the file-group-reader path on the same table, plus independently computed expected values so both paths cannot be equally wrong:

COW snapshot read after insert + upsert, through both BaseFileOnlyRelation.buildScan and its toHadoopFsRelation conversion (asserting the relation executes the legacy Hudi parquet file format).
The non-vectorized (row-based) branch via spark.sql.parquet.enableVectorizedReader=false.
The append-partition-values branch via hive-style partitions and hoodie.datasource.read.extract.partition.values.from.path.
Partition + data filter pushdown through both legacy scan paths.
Incremental query across 3 commits through IncrementalRelationV1 (instant-time start) and IncrementalRelationV2 (completion-time start, OPEN_CLOSED), compared against the file-group-reader incremental result.

Since the suite lives in the shared hudi-spark module functional package, each CI matrix leg exercises its own Spark version's legacy file format copy.

Impact

Test-only; no production code change. Closes the coverage blind spot on the legacy read path (six version-module file format copies plus the legacy relations in hudi-spark-common).

Risk Level

low

Documentation Update

None.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

…r disabled

hudi-agent

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR adds TestLegacyParquetReadPath, a COW functional suite that exercises the previously-uncovered legacy Spark read path (BaseFileOnlyRelation, IncrementalRelationV1/V2, and the per-version legacy parquet file format) by constructing the legacy relations directly with the file-group reader disabled and comparing results row-by-row against the file-group-reader path plus independently computed expected values. I traced the expected counts for the snapshot, filter-pushdown, and incremental cases and they hold up, and the per-branch coverage (vectorized/non-vectorized, relation vs. file-format conversion, V1 vs. V2) looks well-targeted. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One small maintainability nit below — the magic positional index in collectSorted; everything else looks clean and well-commented.

cc @yihua

hudi-agent · 2026-07-02T02:28:26Z

+  private def collectSorted(df: DataFrame): Seq[Row] =
+    df.select(comparedCols.map(col): _*).collect().toSeq.sortBy(_.getString(2).toInt)
+
+  private def assertSameRows(expected: DataFrame, actual: DataFrame): Unit = {


🤖 nit: the magic index 2 silently couples this sort to the current position of "id" in comparedCols — could you use _.getAs[String]("id").toInt instead so it stays correct if comparedCols is ever reordered?

_{⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.}

wombatu-kun · 2026-07-02T03:56:02Z

+    spark = null
+  }
+
+  private def makeRows(ids: Seq[Int], ts: Long, valueFn: Int => Long): Seq[(String, Long, Long, String)] =


The row shape is entirely scalar (id, ts, value, partition), so no case here exercises the legacy parquet format's vectorized nested-column read - historically the most fragile branch of these per-version copies (the nested-column vectorized-read bug fixed in HUDI-7190 / #10265). A regression re-introduced there would still ship silently, which is exactly the blind spot this suite is meant to close. Consider adding a struct (or array) column to the schema so the vectorized nested path is covered too; the existing cases only hit flat columns.

hudi-bot · 2026-07-02T03:56:39Z

CI report:

a6c73de Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

test(spark): cover the legacy parquet read path with file-group reade…

a6c73de

…r disabled

hudi-agent reviewed Jul 2, 2026

View reviewed changes

wombatu-kun reviewed Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(spark): cover the legacy parquet read path with file-group reader disabled#19133

test(spark): cover the legacy parquet read path with file-group reader disabled#19133
yihua wants to merge 1 commit into
apache:masterfrom
yihua:legacy-parquet-read-path-tests

yihua commented Jul 2, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent Jul 2, 2026

Uh oh!

wombatu-kun Jul 2, 2026

Uh oh!

hudi-bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

yihua commented Jul 2, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

wombatu-kun Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Jul 2, 2026

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants