refactor(spark): prune the forked extended SQL parser to Hudi-only statements#19132
Open
yihua wants to merge 1 commit into
Open
refactor(spark): prune the forked extended SQL parser to Hudi-only statements#19132yihua wants to merge 1 commit into
yihua wants to merge 1 commit into
Conversation
…atements The per-Spark-version HoodieSqlBase grammar and ExtendedSqlAstBuilder forked Spark's full SQL surface only so that time travel could be parsed before Spark 3.3. All supported Spark versions now parse TIMESTAMP/VERSION AS OF (including the SYSTEM_TIME/SYSTEM_VERSION spellings) natively, so the fork narrows to what is actually Hudi-specific: CREATE TABLE with Hudi column types (BLOB, VECTOR) and the index DDL statements. HoodieCommonSqlParser rewrites Spark's RelationTimeTravel into Hudi's TimeTravelRelation after delegation, preserving the existing Hudi resolution path; the parse-time timestamp validation moves into the analysis rule.
Collaborator
wombatu-kun
reviewed
Jul 2, 2026
| }) | ||
| } | ||
|
|
||
| test("Test time travel with SQL:2011 temporal clause spellings") { |
Contributor
There was a problem hiding this comment.
The moved timestamp validation has no test coverage. The column-reference / subquery guard changed from a parse-time ParseException (two distinct messages) to an analysis-time HoodieAnalysisException (one combined message), and VERSION AS OF now reaches the 'Version expression is not supported' guard through the native delegation path. Consider adding negative cases here: a TIMESTAMP AS OF referencing a column, a TIMESTAMP AS OF with a subquery, and a VERSION AS OF on a Hudi table - asserting the new exception type and message so a future regression in the relocated guard is caught.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Describe the issue this Pull Request addresses
Each Spark version module (
hudi-spark3.3.xthroughhudi-spark4.2.x) carries a full fork of Spark's SQL grammar (antlr4/imports/SqlBase.g4, ~1,900 lines) and AstBuilder (HoodieSparkX_YExtendedSqlAstBuilder, ~3,500 lines). The fork exists because, before Spark 3.3, Spark had no time-travel SQL, so Hudi had to parse the entire query surface to supportTIMESTAMP AS OF. Every supported Spark version (3.3+) now parsesTIMESTAMP/VERSION AS OFnatively, including theSYSTEM_TIME/SYSTEM_VERSIONspellings, so the bulk of the fork is dead weight that must be re-copied and re-ported for every new Spark version.This supersedes #12270 (HUDI-4468) by @KnightChess, which pioneered the same simplification but predates Spark 4 support and the BLOB/VECTOR column types that now depend on the fork's
CREATE TABLEpath.Summary and Changelog
Narrow the fork to what is actually Hudi-specific; delegate everything else to Spark's parser.
HoodieSqlBase.g4(x6): thestatementrule keeps onlycreateTable(without the CTASAS querysuffix, since CTAS cannot specify a schema and therefore never needs Hudi types at parse time) and the four index DDL statements; everything else hitspassThroughand is parsed by the delegate.HoodieSparkX_YExtendedSqlAstBuilder(x6): reduced from ~3,540 to ~1,430 lines, keeping the visitor closure forCREATE TABLE(including BLOB/VECTOR column types), index DDL, and their transitive helpers (types, literals, properties, transforms).HoodieSparkX_YExtendedSqlParser(x6): theisHoodieCommandgate no longer intercepts time-travel text; only index DDL and BLOB/VECTOR statements route to the fork.HoodieCommonSqlParser: after delegation, rewrites Spark's nativeRelationTimeTravelinto Hudi'sTimeTravelRelation(transformDownWithSubqueries, available since Spark 3.3), so resolution flows through the existing Hudi rule unchanged.HoodieSparkBaseAnalysis: the parse-time validation that a time-travel timestamp has no column references or subqueries moves into the resolution rule.TestTimeTravelTable: adds coverage for theSYSTEM_TIME AS OF/FOR SYSTEM_TIME AS OF/FOR TIMESTAMP AS OFspellings now served by Spark's native grammar.Behavior notes:
TimeTravelRelationfor them); making that delegate cleanly to Spark native resolution is a possible follow-up.VERSION AS OFcontinues to throw "Version expression is not supported for time travel" for Hudi tables, as before.Design differences from #12270: the
imports/SqlBase.g4copies are retained (only the statement surface narrowed) so ANTLR keeps generating the token vocabulary the index/createTable rules reference; time travel is bridged at the parser level instead of addingHoodieCatalog.loadTable(ident, timestamp/version)overloads, so behavior does not depend onspark_catalogbeing set toHoodieCatalog; and the BLOB/VECTORCREATE TABLEpath introduced after that PR is preserved.Impact
~12,700 lines of forked parser code removed across the six Spark version modules. New Spark version modules no longer need to port the full AstBuilder fork. No public API change; SQL syntax accepted before and after is identical.
Risk Level
medium
The retained visitor closure was verified per module (structural balance, no references to removed members, per-version
visitCreateTablebodies preserved verbatim minus CTAS). Time-travel semantics are pinned by the existingTestTimeTravelTablesuite plus the new spelling tests; index DDL byTestIndexSyntax/TestSecondaryIndex; BLOB/VECTOR DDL by the existing create-table tests. Full CI matrix (Spark 3.3-4.2) validates each fork copy.Documentation Update
None.
system_time as ofandsystem_version as ofremain accepted (natively by Spark).Contributor's checklist