feat: add first-class SELECT DISTINCT support to SQL AST#2678
Draft
paulteehan wants to merge 2 commits into
Draft
feat: add first-class SELECT DISTINCT support to SQL AST#2678paulteehan wants to merge 2 commits into
paulteehan wants to merge 2 commits into
Conversation
Add `distinct: bool = False` to the SELECT AST node so callers can write
`SELECT(fields=[...], distinct=True)` and get `SELECT DISTINCT col1, col2 FROM ...`
without string-replace hacks or abusing the DISTINCT expression node (which
remains the correct model for aggregate-level use like `COUNT(DISTINCT x)`).
The two SQL features share a keyword but are different grammar productions:
- set quantifier on SELECT: deduplicates the whole result set
- aggregate-level modifier: per-aggregate input dedup
The DISTINCT expression node renders as `DISTINCT(...)` with parens, which
is correct inside aggregates but rejected as a SELECT set quantifier by
DB2, Athena/Presto/Trino, and BigQuery. Modeling them separately fixes
that and lets callers drop fragile workarounds like
`sql.replace("SELECT", "SELECT DISTINCT", 1)`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
Thank you very much for this @paulteehan! |
- Align continuation-line indent to keyword width so multi-column SELECT DISTINCT output stays under the first field (16-space indent for SELECT DISTINCT, 7-space indent unchanged for SELECT) - Replace OR-aggregation of distinct across SELECT elements with a plain assignment; removes the "any flips all" footgun while preserving behaviour for the typical single-SELECT case - Add unit test for SELECT(STAR(), distinct=True) covering the SqlExpression-not-list branch - Add unit test asserting the updated warning fires when DISTINCT is nested inside SELECT.fields Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Niels-b
requested changes
Apr 29, 2026
Contributor
Niels-b
left a comment
There was a problem hiding this comment.
Not sure if we should do it like this. We already have a DISTINCT in the AST, you should just use that, no?
In my opinion we should be able to do SELECT(DISTINCT([my_elements])). Open for discussion.
Also, what are we assuming will happen here? If we have a SELECT DISTINCT that the entire row is "distinct"?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
distinct: bool = Falseto theSELECTAST node so callers can writeSELECT(fields=[...], distinct=True)and getSELECT DISTINCT col1, col2 FROM ...natively_build_select_sql_linesemitsSELECT DISTINCTwhen any incomingSELECTelement hasdistinct=TrueDISTINCTexpression node and_build_distinct_sqlare intentionally untouched — they remain correct for aggregate-level use (COUNT(DISTINCT x),SUM(DISTINCT x))Why
SQL uses
DISTINCTin two semantically different positions:SELECT DISTINCT a, b FROM tCOUNT(DISTINCT a)Today the AST only models the second. The first has no representation, and the workaround of nesting
DISTINCT([a, b, c])intoSELECT.fieldsrenders asSELECT DISTINCT(a, b, c)— accepted loosely by Postgres/MySQL but rejected by DB2, Athena/Presto/Trino, and BigQuery, where parens around the field list are interpreted as a row constructor and change semantics. The existing warning atsql_ast.py:61-67flags this exact shape as wrong.This was discovered during review of soda-extensions PR #356, which currently ships:
After this change lands, that caller (and any future ones) can be migrated to
SELECT(fields=columns, distinct=True)and the string-replace hack deleted.What changed
soda-core/src/soda_core/common/sql_ast.py—SELECTgetsdistinct: bool = False; warning message updated to point callers at the new flagsoda-core/src/soda_core/common/sql_dialect.py—_build_select_sql_linesreadsselect_element.distinctand prependsDISTINCTto the keyword. Continuation-line indent kept at 7 spaces (no realignment toSELECT DISTINCT's width — still readable)soda-postgres/tests/unit/test_postgres_dialect.py— 6 new testsAcceptance criteria (from internal plan)
SELECT(fields=[\"a\", \"b\"], distinct=True)rendersSELECT DISTINCT a, b(no parens)SELECT(fields=[\"a\"])continues to renderSELECT aunchangedCOUNT(DISTINCT(expression=\"x\"))continues to renderCOUNT(DISTINCT(x))distinct=False)Test plan
uv run pytest soda-{postgres,bigquery,athena,databricks,duckdb,redshift,snowflake,sparkdf,sqlserver,synapse,trino,fabric}/tests/unit— 30/30 passedsoda-reconciliation/.../reference_diff_check.pyonce soda-extensions PR #356 merges, thengrep -rn '.replace(\"SELECT\", \"SELECT DISTINCT\"' src/should return zero hitsOut of scope
DISTINCTexpression node — it has legitimate aggregate-level usersTODO: refactor build_select_sql to use ASTatsql_dialect.py:669— independent and largerSELECT DISTINCT ON (...)(Postgres-specific) — not needed here🤖 Generated with Claude Code