Skip to content

refactor(spark): prune the forked extended SQL parser to Hudi-only statements#19132

Open
yihua wants to merge 1 commit into
apache:masterfrom
yihua:simplify-extended-sql-parser
Open

refactor(spark): prune the forked extended SQL parser to Hudi-only statements#19132
yihua wants to merge 1 commit into
apache:masterfrom
yihua:simplify-extended-sql-parser

Conversation

@yihua

@yihua yihua commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

Each Spark version module (hudi-spark3.3.x through hudi-spark4.2.x) carries a full fork of Spark's SQL grammar (antlr4/imports/SqlBase.g4, ~1,900 lines) and AstBuilder (HoodieSparkX_YExtendedSqlAstBuilder, ~3,500 lines). The fork exists because, before Spark 3.3, Spark had no time-travel SQL, so Hudi had to parse the entire query surface to support TIMESTAMP AS OF. Every supported Spark version (3.3+) now parses TIMESTAMP/VERSION AS OF natively, including the SYSTEM_TIME/SYSTEM_VERSION spellings, so the bulk of the fork is dead weight that must be re-copied and re-ported for every new Spark version.

This supersedes #12270 (HUDI-4468) by @KnightChess, which pioneered the same simplification but predates Spark 4 support and the BLOB/VECTOR column types that now depend on the fork's CREATE TABLE path.

Summary and Changelog

Narrow the fork to what is actually Hudi-specific; delegate everything else to Spark's parser.

  • HoodieSqlBase.g4 (x6): the statement rule keeps only createTable (without the CTAS AS query suffix, since CTAS cannot specify a schema and therefore never needs Hudi types at parse time) and the four index DDL statements; everything else hits passThrough and is parsed by the delegate.
  • HoodieSparkX_YExtendedSqlAstBuilder (x6): reduced from ~3,540 to ~1,430 lines, keeping the visitor closure for CREATE TABLE (including BLOB/VECTOR column types), index DDL, and their transitive helpers (types, literals, properties, transforms).
  • HoodieSparkX_YExtendedSqlParser (x6): the isHoodieCommand gate no longer intercepts time-travel text; only index DDL and BLOB/VECTOR statements route to the fork.
  • HoodieCommonSqlParser: after delegation, rewrites Spark's native RelationTimeTravel into Hudi's TimeTravelRelation (transformDownWithSubqueries, available since Spark 3.3), so resolution flows through the existing Hudi rule unchanged.
  • HoodieSparkBaseAnalysis: the parse-time validation that a time-travel timestamp has no column references or subqueries moves into the resolution rule.
  • TestTimeTravelTable: adds coverage for the SYSTEM_TIME AS OF / FOR SYSTEM_TIME AS OF / FOR TIMESTAMP AS OF spellings now served by Spark's native grammar.

Behavior notes:

  • Time travel on non-Hudi tables with the Hudi extension installed fails analysis exactly as before (the fork also produced an unresolvable TimeTravelRelation for them); making that delegate cleanly to Spark native resolution is a possible follow-up.
  • VERSION AS OF continues to throw "Version expression is not supported for time travel" for Hudi tables, as before.

Design differences from #12270: the imports/SqlBase.g4 copies are retained (only the statement surface narrowed) so ANTLR keeps generating the token vocabulary the index/createTable rules reference; time travel is bridged at the parser level instead of adding HoodieCatalog.loadTable(ident, timestamp/version) overloads, so behavior does not depend on spark_catalog being set to HoodieCatalog; and the BLOB/VECTOR CREATE TABLE path introduced after that PR is preserved.

Impact

~12,700 lines of forked parser code removed across the six Spark version modules. New Spark version modules no longer need to port the full AstBuilder fork. No public API change; SQL syntax accepted before and after is identical.

Risk Level

medium

The retained visitor closure was verified per module (structural balance, no references to removed members, per-version visitCreateTable bodies preserved verbatim minus CTAS). Time-travel semantics are pinned by the existing TestTimeTravelTable suite plus the new spelling tests; index DDL by TestIndexSyntax/TestSecondaryIndex; BLOB/VECTOR DDL by the existing create-table tests. Full CI matrix (Spark 3.3-4.2) validates each fork copy.

Documentation Update

None. system_time as of and system_version as of remain accepted (natively by Spark).

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

…atements

The per-Spark-version HoodieSqlBase grammar and ExtendedSqlAstBuilder forked
Spark's full SQL surface only so that time travel could be parsed before
Spark 3.3. All supported Spark versions now parse TIMESTAMP/VERSION AS OF
(including the SYSTEM_TIME/SYSTEM_VERSION spellings) natively, so the fork
narrows to what is actually Hudi-specific: CREATE TABLE with Hudi column
types (BLOB, VECTOR) and the index DDL statements.

HoodieCommonSqlParser rewrites Spark's RelationTimeTravel into Hudi's
TimeTravelRelation after delegation, preserving the existing Hudi
resolution path; the parse-time timestamp validation moves into the
analysis rule.
@github-actions github-actions Bot added the size:XL PR with lines of changes > 1000 label Jul 2, 2026
@hudi-bot

hudi-bot commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

})
}

test("Test time travel with SQL:2011 temporal clause spellings") {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The moved timestamp validation has no test coverage. The column-reference / subquery guard changed from a parse-time ParseException (two distinct messages) to an analysis-time HoodieAnalysisException (one combined message), and VERSION AS OF now reaches the 'Version expression is not supported' guard through the native delegation path. Consider adding negative cases here: a TIMESTAMP AS OF referencing a column, a TIMESTAMP AS OF with a subquery, and a VERSION AS OF on a Hudi table - asserting the new exception type and message so a future regression in the relocated guard is caught.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants