Skip to content

test: Spark core tests#19082

Draft
yihua wants to merge 5 commits into
apache:masterfrom
yihua:spark-core-tests
Draft

test: Spark core tests#19082
yihua wants to merge 5 commits into
apache:masterfrom
yihua:spark-core-tests

Conversation

@yihua

@yihua yihua commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

The Java CI workflow runs the full Spark datasource test suite once per Spark version and Java/Scala track, producing a large number of checks and high compute cost. Measured per-job timings from a recent master run show ~2,119 runner-minutes per push, ~90% of it in the per-version Spark datasource jobs, most of which re-run Spark-version-independent tests on every version.

This PR runs a curated core subset on every Spark version and the full suites only on the latest 3.x (spark3.5) and latest 4.x (spark4.2).

Summary and Changelog

  • New core-tests Maven profile: surefire groups=core plus scalatest tagsToInclude=org.apache.hudi.functional.SparkSQLCoreFlow, injected only in this profile (an empty tagsToInclude would emit -n "" and run zero scala tests).
  • The core set is excluded from the normal tasks so nothing runs twice on a version: core added to excludedGroups of unit-tests/functional-tests/-b/-c, and the scalatest exclude is driven by the hoodie.scalatest.tagsToExclude property (default SparkSQLCoreFlow).
  • CoreFlow scalatest Tag object (name matches the existing @SparkSQLCoreFlow @TagAnnotation) for per-test taggedAs(CoreFlow).
  • Tagged core set: @Tag("core") on representative TestCOWDataSource/TestMORDataSource methods (read/write round-trip, vectorized reader, snapshot/read-optimized/incremental reads, schema evolution) and the parquet/orc/InternalRow writers; taggedAs(CoreFlow) on basic INSERT/UPDATE/DELETE/MERGE INTO/CREATE TABLE blocks.
  • The previously dead-wired TestSparkSqlCoreFlow anchor (tagged @SparkSQLCoreFlow, excluded everywhere, included nowhere) is now run by the core job, trimmed to a representative table-type/metadata/keygen/index spread so it fits the per-version budget.
  • bot.yml: full-suite datasource jobs reduced to a single version each; one matrix-driven job test-spark-core-tests (Java 11: spark3.3/3.4/3.5; Java 17: spark3.5/4.0/4.1/4.2, via per-entry javaVersion/mvnProfiles) builds once and runs -Pcore-tests plus the quickstart on every version. Flink (already tiered), bundle-validation, docker, and build-only jobs are unchanged.

Estimated impact: ~2,119 -> ~1,142 runner-minutes per push (~46%), and Spark datasource checks drop from 30 to ~14.

Impact

CI/test-execution only; no production code or public API change. Non-latest Spark versions trade the full per-version suite for a curated core subset; full coverage for all versions can still be restored via a scheduled full-matrix run.

Risk Level

low

CI/test-tiering change only, no source changes. Opened as draft to validate the new matrix in CI before review.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

yihua added 4 commits June 26, 2026 18:17
Introduce a curated 'core' test set that can run on every Spark version,
while the full unit/functional suites run only on the latest Spark majors.

- core-tests Maven profile: surefire groups=core plus scalatest
  tagsToInclude=SparkSQLCoreFlow (injected only here; an empty
  tagsToInclude would emit -n "" and run zero scala tests).
- Exclude the core set from the normal tasks: add 'core' to excludedGroups
  in unit-tests/functional-tests/-b/-c, and drive the scalatest
  tagsToExclude via the hoodie.scalatest.tagsToExclude property (default
  SparkSQLCoreFlow) so core-flow blocks stay out of normal scala runs.
- CoreFlow scalatest Tag object (name matches the @SparkSQLCoreFlow
  annotation) for per-test taggedAs(CoreFlow).
- Tag the core set: @tag("core") on representative TestCOWDataSource /
  TestMORDataSource methods and parquet/orc/InternalRow writers;
  taggedAs(CoreFlow) on basic INSERT/UPDATE/DELETE/MERGE/CREATE blocks.
- Trim the dead-wired TestSparkSqlCoreFlow anchor to a representative
  table-type/metadata/keygen/index spread so it fits the per-version budget.
Run the full unit/functional/scala suites only on the latest 3.x
(spark3.5/Java11) and latest 4.x (spark4.2/Java17). Add test-spark-core-tests
(spark3.3/3.4/3.5) and test-spark-java17-core-tests (spark3.5/4.0/4.1/4.2)
that build once and run -Pcore-tests plus the quickstart on every version.
Leaves Flink, bundle-validation, docker, and build-only jobs unchanged.
Collapse test-spark-core-tests and test-spark-java17-core-tests into a
single matrix-driven job: each matrix entry carries javaVersion and
mvnProfiles (-Pjava17 for the Java 17 rows, empty for Java 11), so the
JDK setup and mvn invocations are shared. Standardize the module list to
include hudi-common on both tracks (harmless on Java 11). Net: 28 -> 27
jobs, no behavior change to which tests run per version.
Set an explicit name template on test-spark-core-tests so the check name
shows just scalaProfile, sparkProfile, and javaVersion, hiding the
sparkModules and mvnProfiles matrix fields.
@github-actions github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Jun 27, 2026
The tagged set spans DataFrame round-trips, SQL DML/DDL, and native
parquet/orc/InternalRow I/O, not just SQL, so SparkSQLCoreFlow understated
it. Consolidate to a single SparkCoreFlow scalatest Tag object as the one
source of truth: delete the SparkSQLCoreFlow Java @TagAnnotation and tag
the TestSparkSqlCoreFlow anchor's blocks via taggedAs(SparkCoreFlow)
instead of the class-level annotation. Update the per-test tags in the
SQL suites, the core-tests profile tagsToInclude, and the
hoodie.scalatest.tagsToExclude default to org.apache.hudi.functional.SparkCoreFlow.
The JUnit5 side keeps the separate surefire @tag("core") group.
@hudi-bot

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan

Copy link
Copy Markdown
Contributor

⚠️ Coverage gap: version-specific test classes in hudi-sparkX.Y.x submodules run on zero Spark versions after this PR.

The version-specific test modules (hudi-spark3.3.x, hudi-spark3.4.x, hudi-spark4.0.x, hudi-spark4.1.x) exist precisely because they exercise Spark-version-pinned integration points. On master today, each of these classes runs on its own Spark version as part of the full per-version datasource job. After this PR, those per-version jobs are trimmed to spark3.5 (Java 11) and spark4.2 (Java 17) only, and the new test-spark-core-tests job runs surefire groups=core in those same modules — but none of the version-specific tests carry @Tag("core"), so surefire finds zero matches and they stop running everywhere.

Concretely, the test classes that run on zero Spark versions after this PR:

  • hudi-spark-datasource/hudi-spark3.3.x/src/test/scala/org/apache/hudi/TestHoodieStreamingSinkConstants.scala
  • hudi-spark-datasource/hudi-spark3.4.x/src/test/scala/org/apache/hudi/TestHoodieStreamingSinkConstants.scala
  • hudi-spark-datasource/hudi-spark4.0.x/src/test/scala/org/apache/hudi/TestHoodieStreamingSinkConstants.scala
  • hudi-spark-datasource/hudi-spark4.1.x/src/test/scala/org/apache/hudi/TestHoodieStreamingSinkConstants.scala
  • hudi-spark-datasource/hudi-spark4.0.x/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/TestSpark40HoodieParquetReadSupport.scala
  • hudi-spark-datasource/hudi-spark4.0.x/src/test/java/org/apache/hudi/io/storage/row/TestHoodieRowParquetWriteSupportVariant.java
  • hudi-spark-datasource/hudi-spark4.1.x/src/test/scala/org/apache/hudi/TestSpark4_1AvroLogicalTypeBytes.scala

TestHoodieStreamingSinkConstants is the clearest example — its docstring says "Validates that HoodieStreamingSink.QUERY_ID_KEY matches StreamExecution.QUERY_ID_KEY from this Spark version." The whole point of the test is that it's Spark-version-pinned; if Spark quietly renames or renumbers the private field on 3.3, 3.4, 4.0, or 4.1, CI no longer catches it.

Suggested fix: add @Tag("core") at the class level to each of the classes above so they continue to run on their own Spark version via test-spark-core-tests. It's a one-line change per file:

@Tag("core")
class TestHoodieStreamingSinkConstants { ... }

Cheap to keep, high signal on the exact drift they're written to catch.

@nsivabalan

Copy link
Copy Markdown
Contributor

💬 Follow-up on the TestSparkSqlCoreFlow param trim — clarification and a small ask.

Wanted to double-check what actually changes here. On master, TestSparkSqlCoreFlow is class-annotated @SparkSQLCoreFlow, which is listed in hudi-spark/pom.xml's <tagsToExclude> — and no CI job sets a matching <tagsToInclude>. So on master the whole suite is dead-wired: 48 param permutations existed in the source (16 core-flow + 32 immutable-user) but ran on zero Spark versions.

This PR revives the suite by re-tagging it with the new SparkCoreFlow scalatest tag and running it in the new test-spark-core-tests job on every Spark version. Great — that's a net gain from 0 → running. But the param matrix itself is trimmed from 48 → 11 (8 core-flow + 3 immutable-user) in the same PR, so what we're actually landing is:

On master today With this PR
Runs on spark3.3, 3.4, 3.5, 4.0, 4.1, 4.2 spark3.3, 3.4, 3.5, 4.0, 4.1, 4.2
Param permutations 0 (dead-wired) 12 (subset of the 48 that live in code)

So this is unambiguously a net-positive change and not a coverage regression — the 37 permutations that were dropped from paramsForImmutable/params weren't running before either. Sorry for calling this a coverage gap in my original review.

That said, the trim itself is a design choice, and now that the suite is actually going to run, it's worth asking whether the 11 kept permutations are the ones most likely to catch a per-Spark-version regression. A few observations that might help you refine the picks:

  1. params (core-flow, 8 kept of 16): keeps both table types × both keygens × both index families (GLOBAL_* and non-global) × metadata=true|false. Looks like a reasonable pairwise spread — one permutation per (tableType, keygen, indexType) triple with metadata alternating on/off across the pairs.

  2. paramsForImmutable (4 kept of 32): this one is much more aggressive.

    • COPY_ON_WRITE|insert|false|SimpleKeyGenerator|GLOBAL_BLOOM
    • COPY_ON_WRITE|bulk_insert|true|NonpartitionedKeyGenerator|SIMPLE
    • MERGE_ON_READ|insert|false|SimpleKeyGenerator|GLOBAL_BLOOM
    • MERGE_ON_READ|bulk_insert|true|NonpartitionedKeyGenerator|SIMPLE

    The coverage is asymmetric: insert is only tested on SimpleKeyGenerator + GLOBAL_BLOOM, and bulk_insert only on NonpartitionedKeyGenerator + SIMPLE. So we don't have any single row that tests insert with NonpartitionedKeyGenerator, or bulk_insert with SimpleKeyGenerator, or insert with a non-global index — those combinations run on zero permutations across the whole immutable-user suite.

    Suggested tweak: swap or add one row so that (writeOp × keygen) is covered pairwise. Something like:

    • Replace one existing row with MERGE_ON_READ|insert|true|NonpartitionedKeyGenerator|BLOOM (adds insert × NonpartitionedKeyGenerator × non-global).
    • Or replace with COPY_ON_WRITE|bulk_insert|false|SimpleKeyGenerator|GLOBAL_SIMPLE (adds bulk_insert × SimpleKeyGenerator).
  3. Comment on the source is slightly misleading now: the code comment says the dropped permutations "remain covered by the full suite that runs on the latest Spark versions" — but the full suite still has SparkCoreFlow in tagsToExclude (that's how the test-spark-scala-* jobs avoid double-running). So the dropped 37 permutations don't run in the full-suite job either. Suggest changing the comment to something like "trimmed from the 48-permutation matrix in code to a representative spread; the untrimmed permutations do not run on any version", so a future reader doesn't chase a phantom safety net.

None of this blocks the PR — it's still a strict improvement. Just worth spending a minute picking better representative permutations while we're touching this list, since it'll be the only version-cross-cutting SQL-flow coverage on non-latest versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants