test: Spark core tests by yihua · Pull Request #19082 · apache/hudi

yihua · 2026-06-27T04:59:48Z

Describe the issue this Pull Request addresses

The Java CI workflow runs the full Spark datasource test suite once per Spark version and Java/Scala track, producing a large number of checks and high compute cost. Measured per-job timings from a recent master run show ~2,119 runner-minutes per push, ~90% of it in the per-version Spark datasource jobs, most of which re-run Spark-version-independent tests on every version.

This PR runs a curated core subset on every Spark version and the full suites only on the latest 3.x (spark3.5) and latest 4.x (spark4.2).

Summary and Changelog

New core-tests Maven profile: surefire groups=core plus scalatest tagsToInclude=org.apache.hudi.functional.SparkSQLCoreFlow, injected only in this profile (an empty tagsToInclude would emit -n "" and run zero scala tests).
The core set is excluded from the normal tasks so nothing runs twice on a version: core added to excludedGroups of unit-tests/functional-tests/-b/-c, and the scalatest exclude is driven by the hoodie.scalatest.tagsToExclude property (default SparkSQLCoreFlow).
CoreFlow scalatest Tag object (name matches the existing @SparkSQLCoreFlow @TagAnnotation) for per-test taggedAs(CoreFlow).
Tagged core set: @Tag("core") on representative TestCOWDataSource/TestMORDataSource methods (read/write round-trip, vectorized reader, snapshot/read-optimized/incremental reads, schema evolution) and the parquet/orc/InternalRow writers; taggedAs(CoreFlow) on basic INSERT/UPDATE/DELETE/MERGE INTO/CREATE TABLE blocks.
The previously dead-wired TestSparkSqlCoreFlow anchor (tagged @SparkSQLCoreFlow, excluded everywhere, included nowhere) is now run by the core job, trimmed to a representative table-type/metadata/keygen/index spread so it fits the per-version budget.
bot.yml: full-suite datasource jobs reduced to a single version each; one matrix-driven job test-spark-core-tests (Java 11: spark3.3/3.4/3.5; Java 17: spark3.5/4.0/4.1/4.2, via per-entry javaVersion/mvnProfiles) builds once and runs -Pcore-tests plus the quickstart on every version. Flink (already tiered), bundle-validation, docker, and build-only jobs are unchanged.

Estimated impact: ~2,119 -> ~1,142 runner-minutes per push (~46%), and Spark datasource checks drop from 30 to ~14.

Impact

CI/test-execution only; no production code or public API change. Non-latest Spark versions trade the full per-version suite for a curated core subset; full coverage for all versions can still be restored via a scheduled full-matrix run.

Risk Level

low

CI/test-tiering change only, no source changes. Opened as draft to validate the new matrix in CI before review.

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

@tag

Introduce a curated 'core' test set that can run on every Spark version, while the full unit/functional suites run only on the latest Spark majors. - core-tests Maven profile: surefire groups=core plus scalatest tagsToInclude=SparkSQLCoreFlow (injected only here; an empty tagsToInclude would emit -n "" and run zero scala tests). - Exclude the core set from the normal tasks: add 'core' to excludedGroups in unit-tests/functional-tests/-b/-c, and drive the scalatest tagsToExclude via the hoodie.scalatest.tagsToExclude property (default SparkSQLCoreFlow) so core-flow blocks stay out of normal scala runs. - CoreFlow scalatest Tag object (name matches the @SparkSQLCoreFlow annotation) for per-test taggedAs(CoreFlow). - Tag the core set: @tag("core") on representative TestCOWDataSource / TestMORDataSource methods and parquet/orc/InternalRow writers; taggedAs(CoreFlow) on basic INSERT/UPDATE/DELETE/MERGE/CREATE blocks. - Trim the dead-wired TestSparkSqlCoreFlow anchor to a representative table-type/metadata/keygen/index spread so it fits the per-version budget.

Run the full unit/functional/scala suites only on the latest 3.x (spark3.5/Java11) and latest 4.x (spark4.2/Java17). Add test-spark-core-tests (spark3.3/3.4/3.5) and test-spark-java17-core-tests (spark3.5/4.0/4.1/4.2) that build once and run -Pcore-tests plus the quickstart on every version. Leaves Flink, bundle-validation, docker, and build-only jobs unchanged.

Collapse test-spark-core-tests and test-spark-java17-core-tests into a single matrix-driven job: each matrix entry carries javaVersion and mvnProfiles (-Pjava17 for the Java 17 rows, empty for Java 11), so the JDK setup and mvn invocations are shared. Standardize the module list to include hudi-common on both tracks (harmless on Java 11). Net: 28 -> 27 jobs, no behavior change to which tests run per version.

Set an explicit name template on test-spark-core-tests so the check name shows just scalaProfile, sparkProfile, and javaVersion, hiding the sparkModules and mvnProfiles matrix fields.

@tag

The tagged set spans DataFrame round-trips, SQL DML/DDL, and native parquet/orc/InternalRow I/O, not just SQL, so SparkSQLCoreFlow understated it. Consolidate to a single SparkCoreFlow scalatest Tag object as the one source of truth: delete the SparkSQLCoreFlow Java @TagAnnotation and tag the TestSparkSqlCoreFlow anchor's blocks via taggedAs(SparkCoreFlow) instead of the class-level annotation. Update the per-test tags in the SQL suites, the core-tests profile tagsToInclude, and the hoodie.scalatest.tagsToExclude default to org.apache.hudi.functional.SparkCoreFlow. The JUnit5 side keeps the separate surefire @tag("core") group.

hudi-bot · 2026-06-28T01:41:43Z

CI report:

28e1921 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nsivabalan · 2026-07-01T20:05:17Z

⚠️ Coverage gap: version-specific test classes in hudi-sparkX.Y.x submodules run on zero Spark versions after this PR.

The version-specific test modules (hudi-spark3.3.x, hudi-spark3.4.x, hudi-spark4.0.x, hudi-spark4.1.x) exist precisely because they exercise Spark-version-pinned integration points. On master today, each of these classes runs on its own Spark version as part of the full per-version datasource job. After this PR, those per-version jobs are trimmed to spark3.5 (Java 11) and spark4.2 (Java 17) only, and the new test-spark-core-tests job runs surefire groups=core in those same modules — but none of the version-specific tests carry @Tag("core"), so surefire finds zero matches and they stop running everywhere.

Concretely, the test classes that run on zero Spark versions after this PR:

hudi-spark-datasource/hudi-spark3.3.x/src/test/scala/org/apache/hudi/TestHoodieStreamingSinkConstants.scala
hudi-spark-datasource/hudi-spark3.4.x/src/test/scala/org/apache/hudi/TestHoodieStreamingSinkConstants.scala
hudi-spark-datasource/hudi-spark4.0.x/src/test/scala/org/apache/hudi/TestHoodieStreamingSinkConstants.scala
hudi-spark-datasource/hudi-spark4.1.x/src/test/scala/org/apache/hudi/TestHoodieStreamingSinkConstants.scala
hudi-spark-datasource/hudi-spark4.0.x/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/TestSpark40HoodieParquetReadSupport.scala
hudi-spark-datasource/hudi-spark4.0.x/src/test/java/org/apache/hudi/io/storage/row/TestHoodieRowParquetWriteSupportVariant.java
hudi-spark-datasource/hudi-spark4.1.x/src/test/scala/org/apache/hudi/TestSpark4_1AvroLogicalTypeBytes.scala

TestHoodieStreamingSinkConstants is the clearest example — its docstring says "Validates that HoodieStreamingSink.QUERY_ID_KEY matches StreamExecution.QUERY_ID_KEY from this Spark version." The whole point of the test is that it's Spark-version-pinned; if Spark quietly renames or renumbers the private field on 3.3, 3.4, 4.0, or 4.1, CI no longer catches it.

Suggested fix: add @Tag("core") at the class level to each of the classes above so they continue to run on their own Spark version via test-spark-core-tests. It's a one-line change per file:

@Tag("core")
class TestHoodieStreamingSinkConstants { ... }

Cheap to keep, high signal on the exact drift they're written to catch.

nsivabalan · 2026-07-01T20:05:24Z

💬 Follow-up on the TestSparkSqlCoreFlow param trim — clarification and a small ask.

Wanted to double-check what actually changes here. On master, TestSparkSqlCoreFlow is class-annotated @SparkSQLCoreFlow, which is listed in hudi-spark/pom.xml's <tagsToExclude> — and no CI job sets a matching <tagsToInclude>. So on master the whole suite is dead-wired: 48 param permutations existed in the source (16 core-flow + 32 immutable-user) but ran on zero Spark versions.

This PR revives the suite by re-tagging it with the new SparkCoreFlow scalatest tag and running it in the new test-spark-core-tests job on every Spark version. Great — that's a net gain from 0 → running. But the param matrix itself is trimmed from 48 → 11 (8 core-flow + 3 immutable-user) in the same PR, so what we're actually landing is:

	On master today	With this PR
Runs on	`spark3.3`, `3.4`, `3.5`, `4.0`, `4.1`, `4.2`	`spark3.3`, `3.4`, `3.5`, `4.0`, `4.1`, `4.2`
Param permutations	0 (dead-wired)	12 (subset of the 48 that live in code)

So this is unambiguously a net-positive change and not a coverage regression — the 37 permutations that were dropped from paramsForImmutable/params weren't running before either. Sorry for calling this a coverage gap in my original review.

That said, the trim itself is a design choice, and now that the suite is actually going to run, it's worth asking whether the 11 kept permutations are the ones most likely to catch a per-Spark-version regression. A few observations that might help you refine the picks:

params (core-flow, 8 kept of 16): keeps both table types × both keygens × both index families (GLOBAL_* and non-global) × metadata=true|false. Looks like a reasonable pairwise spread — one permutation per (tableType, keygen, indexType) triple with metadata alternating on/off across the pairs.
paramsForImmutable (4 kept of 32): this one is much more aggressive.
- COPY_ON_WRITE|insert|false|SimpleKeyGenerator|GLOBAL_BLOOM
- COPY_ON_WRITE|bulk_insert|true|NonpartitionedKeyGenerator|SIMPLE
- MERGE_ON_READ|insert|false|SimpleKeyGenerator|GLOBAL_BLOOM
- MERGE_ON_READ|bulk_insert|true|NonpartitionedKeyGenerator|SIMPLE
The coverage is asymmetric: insert is only tested on SimpleKeyGenerator + GLOBAL_BLOOM, and bulk_insert only on NonpartitionedKeyGenerator + SIMPLE. So we don't have any single row that tests insert with NonpartitionedKeyGenerator, or bulk_insert with SimpleKeyGenerator, or insert with a non-global index — those combinations run on zero permutations across the whole immutable-user suite.

Suggested tweak: swap or add one row so that (writeOp × keygen) is covered pairwise. Something like:
- Replace one existing row with MERGE_ON_READ|insert|true|NonpartitionedKeyGenerator|BLOOM (adds insert × NonpartitionedKeyGenerator × non-global).
- Or replace with COPY_ON_WRITE|bulk_insert|false|SimpleKeyGenerator|GLOBAL_SIMPLE (adds bulk_insert × SimpleKeyGenerator).
Comment on the source is slightly misleading now: the code comment says the dropped permutations "remain covered by the full suite that runs on the latest Spark versions" — but the full suite still has SparkCoreFlow in tagsToExclude (that's how the test-spark-scala-* jobs avoid double-running). So the dropped 37 permutations don't run in the full-suite job either. Suggest changing the comment to something like "trimmed from the 48-permutation matrix in code to a representative spread; the untrimmed permutations do not run on any version", so a future reader doesn't chase a phantom safety net.

None of this blocks the PR — it's still a strict improvement. Just worth spending a minute picking better representative permutations while we're touching this list, since it'll be the only version-cross-cutting SQL-flow coverage on non-latest versions.

yihua added 4 commits June 26, 2026 18:17

Show only scala/spark/java in the core-tests job name

3c9220a

Set an explicit name template on test-spark-core-tests so the check name shows just scalaProfile, sparkProfile, and javaVersion, hiding the sparkModules and mvnProfiles matrix fields.

github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Jun 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: Spark core tests#19082

test: Spark core tests#19082
yihua wants to merge 5 commits into
apache:masterfrom
yihua:spark-core-tests

yihua commented Jun 27, 2026 •

edited

Loading

Uh oh!

hudi-bot commented Jun 28, 2026

Uh oh!

nsivabalan commented Jul 1, 2026

Uh oh!

nsivabalan commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

yihua commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-bot commented Jun 28, 2026

CI report:

Uh oh!

nsivabalan commented Jul 1, 2026

Uh oh!

nsivabalan commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yihua commented Jun 27, 2026 •

edited

Loading