test: Spark core tests#19082
Conversation
Introduce a curated 'core' test set that can run on every Spark version, while the full unit/functional suites run only on the latest Spark majors. - core-tests Maven profile: surefire groups=core plus scalatest tagsToInclude=SparkSQLCoreFlow (injected only here; an empty tagsToInclude would emit -n "" and run zero scala tests). - Exclude the core set from the normal tasks: add 'core' to excludedGroups in unit-tests/functional-tests/-b/-c, and drive the scalatest tagsToExclude via the hoodie.scalatest.tagsToExclude property (default SparkSQLCoreFlow) so core-flow blocks stay out of normal scala runs. - CoreFlow scalatest Tag object (name matches the @SparkSQLCoreFlow annotation) for per-test taggedAs(CoreFlow). - Tag the core set: @tag("core") on representative TestCOWDataSource / TestMORDataSource methods and parquet/orc/InternalRow writers; taggedAs(CoreFlow) on basic INSERT/UPDATE/DELETE/MERGE/CREATE blocks. - Trim the dead-wired TestSparkSqlCoreFlow anchor to a representative table-type/metadata/keygen/index spread so it fits the per-version budget.
Run the full unit/functional/scala suites only on the latest 3.x (spark3.5/Java11) and latest 4.x (spark4.2/Java17). Add test-spark-core-tests (spark3.3/3.4/3.5) and test-spark-java17-core-tests (spark3.5/4.0/4.1/4.2) that build once and run -Pcore-tests plus the quickstart on every version. Leaves Flink, bundle-validation, docker, and build-only jobs unchanged.
Collapse test-spark-core-tests and test-spark-java17-core-tests into a single matrix-driven job: each matrix entry carries javaVersion and mvnProfiles (-Pjava17 for the Java 17 rows, empty for Java 11), so the JDK setup and mvn invocations are shared. Standardize the module list to include hudi-common on both tracks (harmless on Java 11). Net: 28 -> 27 jobs, no behavior change to which tests run per version.
Set an explicit name template on test-spark-core-tests so the check name shows just scalaProfile, sparkProfile, and javaVersion, hiding the sparkModules and mvnProfiles matrix fields.
The tagged set spans DataFrame round-trips, SQL DML/DDL, and native parquet/orc/InternalRow I/O, not just SQL, so SparkSQLCoreFlow understated it. Consolidate to a single SparkCoreFlow scalatest Tag object as the one source of truth: delete the SparkSQLCoreFlow Java @TagAnnotation and tag the TestSparkSqlCoreFlow anchor's blocks via taggedAs(SparkCoreFlow) instead of the class-level annotation. Update the per-test tags in the SQL suites, the core-tests profile tagsToInclude, and the hoodie.scalatest.tagsToExclude default to org.apache.hudi.functional.SparkCoreFlow. The JUnit5 side keeps the separate surefire @tag("core") group.
|
The version-specific test modules ( Concretely, the test classes that run on zero Spark versions after this PR:
Suggested fix: add @Tag("core")
class TestHoodieStreamingSinkConstants { ... }Cheap to keep, high signal on the exact drift they're written to catch. |
|
💬 Follow-up on the Wanted to double-check what actually changes here. On master, This PR revives the suite by re-tagging it with the new
So this is unambiguously a net-positive change and not a coverage regression — the 37 permutations that were dropped from That said, the trim itself is a design choice, and now that the suite is actually going to run, it's worth asking whether the 11 kept permutations are the ones most likely to catch a per-Spark-version regression. A few observations that might help you refine the picks:
None of this blocks the PR — it's still a strict improvement. Just worth spending a minute picking better representative permutations while we're touching this list, since it'll be the only version-cross-cutting SQL-flow coverage on non-latest versions. |
Describe the issue this Pull Request addresses
The
Java CIworkflow runs the full Spark datasource test suite once per Spark version and Java/Scala track, producing a large number of checks and high compute cost. Measured per-job timings from a recent master run show ~2,119 runner-minutes per push, ~90% of it in the per-version Spark datasource jobs, most of which re-run Spark-version-independent tests on every version.This PR runs a curated core subset on every Spark version and the full suites only on the latest 3.x (spark3.5) and latest 4.x (spark4.2).
Summary and Changelog
core-testsMaven profile: surefiregroups=coreplus scalatesttagsToInclude=org.apache.hudi.functional.SparkSQLCoreFlow, injected only in this profile (an emptytagsToIncludewould emit-n ""and run zero scala tests).coreadded toexcludedGroupsofunit-tests/functional-tests/-b/-c, and the scalatest exclude is driven by thehoodie.scalatest.tagsToExcludeproperty (defaultSparkSQLCoreFlow).CoreFlowscalatestTagobject (name matches the existing@SparkSQLCoreFlow@TagAnnotation) for per-testtaggedAs(CoreFlow).@Tag("core")on representativeTestCOWDataSource/TestMORDataSourcemethods (read/write round-trip, vectorized reader, snapshot/read-optimized/incremental reads, schema evolution) and the parquet/orc/InternalRow writers;taggedAs(CoreFlow)on basic INSERT/UPDATE/DELETE/MERGE INTO/CREATE TABLE blocks.TestSparkSqlCoreFlowanchor (tagged@SparkSQLCoreFlow, excluded everywhere, included nowhere) is now run by the core job, trimmed to a representative table-type/metadata/keygen/index spread so it fits the per-version budget.bot.yml: full-suite datasource jobs reduced to a single version each; one matrix-driven jobtest-spark-core-tests(Java 11: spark3.3/3.4/3.5; Java 17: spark3.5/4.0/4.1/4.2, via per-entryjavaVersion/mvnProfiles) builds once and runs-Pcore-testsplus the quickstart on every version. Flink (already tiered), bundle-validation, docker, and build-only jobs are unchanged.Estimated impact: ~2,119 -> ~1,142 runner-minutes per push (~46%), and Spark datasource checks drop from 30 to ~14.
Impact
CI/test-execution only; no production code or public API change. Non-latest Spark versions trade the full per-version suite for a curated core subset; full coverage for all versions can still be restored via a scheduled full-matrix run.
Risk Level
low
CI/test-tiering change only, no source changes. Opened as draft to validate the new matrix in CI before review.
Documentation Update
none
Contributor's checklist