feat: add first-class SELECT DISTINCT support to SQL AST by paulteehan · Pull Request #2678 · sodadata/soda-core

paulteehan · 2026-04-28T13:03:42Z

Summary

Add distinct: bool = False to the SELECT AST node so callers can write SELECT(fields=[...], distinct=True) and get SELECT DISTINCT col1, col2 FROM ... natively
_build_select_sql_lines emits SELECT DISTINCT when any incoming SELECT element has distinct=True
Existing DISTINCT expression node and _build_distinct_sql are intentionally untouched — they remain correct for aggregate-level use (COUNT(DISTINCT x), SUM(DISTINCT x))

Why

SQL uses DISTINCT in two semantically different positions:

Feature	Example	Semantics
Set quantifier on SELECT	`SELECT DISTINCT a, b FROM t`	Dedupes the whole result set
Aggregate-level modifier	`COUNT(DISTINCT a)`	Per-aggregate input dedup

Today the AST only models the second. The first has no representation, and the workaround of nesting DISTINCT([a, b, c]) into SELECT.fields renders as SELECT DISTINCT(a, b, c) — accepted loosely by Postgres/MySQL but rejected by DB2, Athena/Presto/Trino, and BigQuery, where parens around the field list are interpreted as a row constructor and change semantics. The existing warning at sql_ast.py:61-67 flags this exact shape as wrong.

This was discovered during review of soda-extensions PR #356, which currently ships:

paginated_sql = data_source.sql_dialect.build_select_sql(statements)
paginated_sql = paginated_sql.replace(\"SELECT\", \"SELECT DISTINCT\", 1)

After this change lands, that caller (and any future ones) can be migrated to SELECT(fields=columns, distinct=True) and the string-replace hack deleted.

What changed

soda-core/src/soda_core/common/sql_ast.py — SELECT gets distinct: bool = False; warning message updated to point callers at the new flag
soda-core/src/soda_core/common/sql_dialect.py — _build_select_sql_lines reads select_element.distinct and prepends DISTINCT to the keyword. Continuation-line indent kept at 7 spaces (no realignment to SELECT DISTINCT 's width — still readable)
soda-postgres/tests/unit/test_postgres_dialect.py — 6 new tests

Acceptance criteria (from internal plan)

SELECT(fields=[\"a\", \"b\"], distinct=True) renders SELECT DISTINCT a, b (no parens)
SELECT(fields=[\"a\"]) continues to render SELECT a unchanged
COUNT(DISTINCT(expression=\"x\")) continues to render COUNT(DISTINCT(x))
All existing call sites work without change (default distinct=False)
Unit tests cover: single col, multiple cols, full paginated shape (WHERE/ORDER BY/LIMIT/OFFSET), aggregate-level DISTINCT preserved, default-false

Test plan

uv run pytest soda-{postgres,bigquery,athena,databricks,duckdb,redshift,snowflake,sparkdf,sqlserver,synapse,trino,fabric}/tests/unit — 30/30 passed
Full-suite collection: 905 tests collect cleanly, no import regressions
Integration matrix (run in CI) — especially DB2, Athena, BigQuery, Trino once soda-extensions caller is migrated
Follow-up: migrate soda-reconciliation/.../reference_diff_check.py once soda-extensions PR #356 merges, then grep -rn '.replace(\"SELECT\", \"SELECT DISTINCT\"' src/ should return zero hits

Out of scope

Removing/renaming the existing DISTINCT expression node — it has legitimate aggregate-level users
The TODO: refactor build_select_sql to use AST at sql_dialect.py:669 — independent and larger
SELECT DISTINCT ON (...) (Postgres-specific) — not needed here

🤖 Generated with Claude Code

Add `distinct: bool = False` to the SELECT AST node so callers can write `SELECT(fields=[...], distinct=True)` and get `SELECT DISTINCT col1, col2 FROM ...` without string-replace hacks or abusing the DISTINCT expression node (which remains the correct model for aggregate-level use like `COUNT(DISTINCT x)`). The two SQL features share a keyword but are different grammar productions: - set quantifier on SELECT: deduplicates the whole result set - aggregate-level modifier: per-aggregate input dedup The DISTINCT expression node renders as `DISTINCT(...)` with parens, which is correct inside aggregates but rejected as a SELECT set quantifier by DB2, Athena/Presto/Trino, and BigQuery. Modeling them separately fixes that and lets callers drop fragile workarounds like `sql.replace("SELECT", "SELECT DISTINCT", 1)`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lepo00 · 2026-04-28T14:37:14Z

Thank you very much for this @paulteehan!

- Align continuation-line indent to keyword width so multi-column SELECT DISTINCT output stays under the first field (16-space indent for SELECT DISTINCT, 7-space indent unchanged for SELECT) - Replace OR-aggregation of distinct across SELECT elements with a plain assignment; removes the "any flips all" footgun while preserving behaviour for the typical single-SELECT case - Add unit test for SELECT(STAR(), distinct=True) covering the SqlExpression-not-list branch - Add unit test asserting the updated warning fires when DISTINCT is nested inside SELECT.fields Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-04-29T11:33:31Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Niels-b

Not sure if we should do it like this. We already have a DISTINCT in the AST, you should just use that, no?

In my opinion we should be able to do SELECT(DISTINCT([my_elements])). Open for discussion.

Also, what are we assuming will happen here? If we have a SELECT DISTINCT that the entire row is "distinct"?

Niels-b requested changes Apr 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add first-class SELECT DISTINCT support to SQL AST#2678

feat: add first-class SELECT DISTINCT support to SQL AST#2678
paulteehan wants to merge 2 commits into
mainfrom
feat/sql-ast-select-distinct

paulteehan commented Apr 28, 2026

Uh oh!

Lepo00 commented Apr 28, 2026

Uh oh!

sonarqubecloud Bot commented Apr 29, 2026

Uh oh!

Niels-b left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

paulteehan commented Apr 28, 2026

Summary

Why

What changed

Acceptance criteria (from internal plan)

Test plan

Out of scope

Uh oh!

Lepo00 commented Apr 28, 2026

Uh oh!

sonarqubecloud Bot commented Apr 29, 2026

Quality Gate passed

Uh oh!

Niels-b left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants