Skip to content

feat: add first-class SELECT DISTINCT support to SQL AST#2678

Draft
paulteehan wants to merge 2 commits into
mainfrom
feat/sql-ast-select-distinct
Draft

feat: add first-class SELECT DISTINCT support to SQL AST#2678
paulteehan wants to merge 2 commits into
mainfrom
feat/sql-ast-select-distinct

Conversation

@paulteehan
Copy link
Copy Markdown
Contributor

Summary

  • Add distinct: bool = False to the SELECT AST node so callers can write SELECT(fields=[...], distinct=True) and get SELECT DISTINCT col1, col2 FROM ... natively
  • _build_select_sql_lines emits SELECT DISTINCT when any incoming SELECT element has distinct=True
  • Existing DISTINCT expression node and _build_distinct_sql are intentionally untouched — they remain correct for aggregate-level use (COUNT(DISTINCT x), SUM(DISTINCT x))

Why

SQL uses DISTINCT in two semantically different positions:

Feature Example Semantics
Set quantifier on SELECT SELECT DISTINCT a, b FROM t Dedupes the whole result set
Aggregate-level modifier COUNT(DISTINCT a) Per-aggregate input dedup

Today the AST only models the second. The first has no representation, and the workaround of nesting DISTINCT([a, b, c]) into SELECT.fields renders as SELECT DISTINCT(a, b, c) — accepted loosely by Postgres/MySQL but rejected by DB2, Athena/Presto/Trino, and BigQuery, where parens around the field list are interpreted as a row constructor and change semantics. The existing warning at sql_ast.py:61-67 flags this exact shape as wrong.

This was discovered during review of soda-extensions PR #356, which currently ships:

paginated_sql = data_source.sql_dialect.build_select_sql(statements)
paginated_sql = paginated_sql.replace(\"SELECT\", \"SELECT DISTINCT\", 1)

After this change lands, that caller (and any future ones) can be migrated to SELECT(fields=columns, distinct=True) and the string-replace hack deleted.

What changed

  • soda-core/src/soda_core/common/sql_ast.pySELECT gets distinct: bool = False; warning message updated to point callers at the new flag
  • soda-core/src/soda_core/common/sql_dialect.py_build_select_sql_lines reads select_element.distinct and prepends DISTINCT to the keyword. Continuation-line indent kept at 7 spaces (no realignment to SELECT DISTINCT 's width — still readable)
  • soda-postgres/tests/unit/test_postgres_dialect.py — 6 new tests

Acceptance criteria (from internal plan)

  • SELECT(fields=[\"a\", \"b\"], distinct=True) renders SELECT DISTINCT a, b (no parens)
  • SELECT(fields=[\"a\"]) continues to render SELECT a unchanged
  • COUNT(DISTINCT(expression=\"x\")) continues to render COUNT(DISTINCT(x))
  • All existing call sites work without change (default distinct=False)
  • Unit tests cover: single col, multiple cols, full paginated shape (WHERE/ORDER BY/LIMIT/OFFSET), aggregate-level DISTINCT preserved, default-false

Test plan

  • uv run pytest soda-{postgres,bigquery,athena,databricks,duckdb,redshift,snowflake,sparkdf,sqlserver,synapse,trino,fabric}/tests/unit — 30/30 passed
  • Full-suite collection: 905 tests collect cleanly, no import regressions
  • Integration matrix (run in CI) — especially DB2, Athena, BigQuery, Trino once soda-extensions caller is migrated
  • Follow-up: migrate soda-reconciliation/.../reference_diff_check.py once soda-extensions PR #356 merges, then grep -rn '.replace(\"SELECT\", \"SELECT DISTINCT\"' src/ should return zero hits

Out of scope

  • Removing/renaming the existing DISTINCT expression node — it has legitimate aggregate-level users
  • The TODO: refactor build_select_sql to use AST at sql_dialect.py:669 — independent and larger
  • SELECT DISTINCT ON (...) (Postgres-specific) — not needed here

🤖 Generated with Claude Code

Add `distinct: bool = False` to the SELECT AST node so callers can write
`SELECT(fields=[...], distinct=True)` and get `SELECT DISTINCT col1, col2 FROM ...`
without string-replace hacks or abusing the DISTINCT expression node (which
remains the correct model for aggregate-level use like `COUNT(DISTINCT x)`).

The two SQL features share a keyword but are different grammar productions:
- set quantifier on SELECT: deduplicates the whole result set
- aggregate-level modifier: per-aggregate input dedup

The DISTINCT expression node renders as `DISTINCT(...)` with parens, which
is correct inside aggregates but rejected as a SELECT set quantifier by
DB2, Athena/Presto/Trino, and BigQuery. Modeling them separately fixes
that and lets callers drop fragile workarounds like
`sql.replace("SELECT", "SELECT DISTINCT", 1)`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Lepo00
Copy link
Copy Markdown
Contributor

Lepo00 commented Apr 28, 2026

Thank you very much for this @paulteehan!

- Align continuation-line indent to keyword width so multi-column
  SELECT DISTINCT output stays under the first field (16-space indent
  for SELECT DISTINCT, 7-space indent unchanged for SELECT)
- Replace OR-aggregation of distinct across SELECT elements with a
  plain assignment; removes the "any flips all" footgun while
  preserving behaviour for the typical single-SELECT case
- Add unit test for SELECT(STAR(), distinct=True) covering the
  SqlExpression-not-list branch
- Add unit test asserting the updated warning fires when DISTINCT is
  nested inside SELECT.fields

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@Niels-b Niels-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should do it like this. We already have a DISTINCT in the AST, you should just use that, no?

In my opinion we should be able to do SELECT(DISTINCT([my_elements])). Open for discussion.

Also, what are we assuming will happen here? If we have a SELECT DISTINCT that the entire row is "distinct"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants