Skip to content

test(spark): cover the legacy parquet read path with file-group reader disabled#19133

Open
yihua wants to merge 1 commit into
apache:masterfrom
yihua:legacy-parquet-read-path-tests
Open

test(spark): cover the legacy parquet read path with file-group reader disabled#19133
yihua wants to merge 1 commit into
apache:masterfrom
yihua:legacy-parquet-read-path-tests

Conversation

@yihua

@yihua yihua commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

The legacy (pre-file-group-reader) Spark read path has zero automated coverage: the per-Spark-version SparkXXLegacyHoodieParquetFileFormat copies (~245 lines each across the six version modules) and BaseFileOnlyRelation, IncrementalRelationV1, IncrementalRelationV2, IncrementalRelationUtil in hudi-spark-common all report 0% on Codecov. These classes still run in production: DefaultSource#resolveBaseFileOnlyRelation uses BaseFileOnlyRelation for metadata-table reads, and HoodieStreamSourceV1/V2 pick IncrementalRelationV1/V2 when hoodie.file.group.reader.enabled is false. A regression there currently ships silently.

Note on routing: the batch DefaultSource.createRelation path no longer consults hoodie.file.group.reader.enabled for normal snapshot/incremental reads; the legacy relations are reachable via the streaming sources and metadata-table reads. The tests therefore construct the legacy relations directly with the flag disabled in their options, matching the streaming-source invocation.

Summary and Changelog

Add TestLegacyParquetReadPath (hudi-spark functional suite, COW, <=30 rows per table) asserting sorted row-level equality between the legacy path and the file-group-reader path on the same table, plus independently computed expected values so both paths cannot be equally wrong:

  • COW snapshot read after insert + upsert, through both BaseFileOnlyRelation.buildScan and its toHadoopFsRelation conversion (asserting the relation executes the legacy Hudi parquet file format).
  • The non-vectorized (row-based) branch via spark.sql.parquet.enableVectorizedReader=false.
  • The append-partition-values branch via hive-style partitions and hoodie.datasource.read.extract.partition.values.from.path.
  • Partition + data filter pushdown through both legacy scan paths.
  • Incremental query across 3 commits through IncrementalRelationV1 (instant-time start) and IncrementalRelationV2 (completion-time start, OPEN_CLOSED), compared against the file-group-reader incremental result.

Since the suite lives in the shared hudi-spark module functional package, each CI matrix leg exercises its own Spark version's legacy file format copy.

Impact

Test-only; no production code change. Closes the coverage blind spot on the legacy read path (six version-module file format copies plus the legacy relations in hudi-spark-common).

Risk Level

low

Documentation Update

None.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR adds TestLegacyParquetReadPath, a COW functional suite that exercises the previously-uncovered legacy Spark read path (BaseFileOnlyRelation, IncrementalRelationV1/V2, and the per-version legacy parquet file format) by constructing the legacy relations directly with the file-group reader disabled and comparing results row-by-row against the file-group-reader path plus independently computed expected values. I traced the expected counts for the snapshot, filter-pushdown, and incremental cases and they hold up, and the per-branch coverage (vectorized/non-vectorized, relation vs. file-format conversion, V1 vs. V2) looks well-targeted. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One small maintainability nit below — the magic positional index in collectSorted; everything else looks clean and well-commented.

cc @yihua

private def collectSorted(df: DataFrame): Seq[Row] =
df.select(comparedCols.map(col): _*).collect().toSeq.sortBy(_.getString(2).toInt)

private def assertSameRows(expected: DataFrame, actual: DataFrame): Unit = {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the magic index 2 silently couples this sort to the current position of "id" in comparedCols — could you use _.getAs[String]("id").toInt instead so it stays correct if comparedCols is ever reordered?

⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.

spark = null
}

private def makeRows(ids: Seq[Int], ts: Long, valueFn: Int => Long): Seq[(String, Long, Long, String)] =

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The row shape is entirely scalar (id, ts, value, partition), so no case here exercises the legacy parquet format's vectorized nested-column read - historically the most fragile branch of these per-version copies (the nested-column vectorized-read bug fixed in HUDI-7190 / #10265). A regression re-introduced there would still ship silently, which is exactly the blind spot this suite is meant to close. Consider adding a struct (or array) column to the schema so the vectorized nested path is covered too; the existing cases only hit flat columns.

@hudi-bot

hudi-bot commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants