OpenLineage: Add Persistence schema and JDBC models#4827
Conversation
dimas-b
left a comment
There was a problem hiding this comment.
Hi @iting0321 , I see this PR is a draft ATM, but commenting proactively to allow time for a smooth discussion.
| -- SCAN METRICS REPORT TABLE | ||
| -- ============================================================================ | ||
|
|
||
| CREATE TABLE IF NOT EXISTS scan_metrics_report ( |
There was a problem hiding this comment.
I think our SQL DDL scripts jam too many features into the same file. I'd like to propose splitting them by feature (MetaStore, Metrics, Events, OpenLineage).
Does OL data absolutely have to be in the same RDBMS schema as the MetaStore data?
There was a problem hiding this comment.
Hi @dimas-b, I think there are two possible levels of change here:
Small change: keep the existing JDBC bootstrap contract: Polaris still asks DatabaseType for one top-level schema resource, e.g. postgres/schema-v5.sql, and bootstrap still executes a single SQL stream.
The change would be inside the schema resource loader: before returning that stream, it would expand fragment directives in schema-v5.sql. So schema-v5.sql can keep the existing core/events/metrics DDL inline, and replace only the OpenLineage section with:
-- POLARIS_SCHEMA_FRAGMENT: schema/lineage/lineage-v1.sqlThe loader would resolve that to postgres/schema/lineage/lineage-v1.sql, concatenate it into the SQL stream, and then bootstrap executes the combined result.
The architecture will look like:
postgres/
├── schema-v5.sql
└── schema/
└── lineage/
└── lineage-v1.sql
Bigger change: make bootstrap understand separately versioned schema components, e.g. core = v5, lineage = v1. That would let lineage evolve independently from the MetaStore schema, but it requires changing how we record schema versions, how bootstrap decides what to run, and eventually how upgrades/migrations work per component.
For this PR, I’d prefer the small change: split the DDL by feature while preserving the current single-version bootstrap behavior. Then we can leave independently versioned persistence components as a follow-up design.
There was a problem hiding this comment.
My preference is for the "bigger change". I understand, it's more work, but at the same time it allows greater flexibility in new (OL) schema evolution.
Also, users not interested in the OL feature will not have to deal with OL tables.
Given the recent influx of persistence-related features (OL, OSI, Metrics, Events). I think it is time to revisit and enhance schema management and bootstrap.
There was a problem hiding this comment.
Thanks for your feedback! I’ll continue refactoring this as part of the "bigger change".
e11af5f to
de80910
Compare
There was a problem hiding this comment.
Pull request overview
Adds an OpenLineage-focused lineage domain model + service/persistence SPI in polaris-core, wires a runtime LineageService, and extends the relational-JDBC persistence layer with schema v5 + lineage CRUD/query support (including H2/Postgres/CockroachDB coverage).
Changes:
- Introduces core lineage API/SPI types (
LineageService,LineagePersistence, request/graph models). - Adds runtime lineage service + config surface, plus a default “disabled” persistence bean.
- Adds JDBC schema v5 lineage tables and implements
LineagePersistenceinJdbcBasePersistenceImplwith unit/integration tests.
Reviewed changes
Copilot reviewed 35 out of 35 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| site/content/in-dev/unreleased/configuration/config-sections/smallrye-polaris_lineage.md | Adds generated(?) SmallRye config reference for lineage config keys. |
| site/content/in-dev/unreleased/configuration/config-sections/flags-polaris_features.md | Documents new ENABLE_LINEAGE feature flag. |
| runtime/service/src/test/java/org/apache/polaris/service/lineage/DisabledLineagePersistenceTest.java | Tests default disabled persistence throws clearly. |
| runtime/service/src/test/java/org/apache/polaris/service/lineage/DefaultLineageServiceTest.java | Tests feature/config gating and delegation behavior in runtime lineage service. |
| runtime/service/src/main/java/org/apache/polaris/service/lineage/LineageConfiguration.java | Defines SmallRye config mapping for polaris.lineage.*. |
| runtime/service/src/main/java/org/apache/polaris/service/lineage/DisabledLineagePersistence.java | Adds @DefaultBean placeholder persistence implementation. |
| runtime/service/src/main/java/org/apache/polaris/service/lineage/DefaultLineageService.java | Adds request-scoped runtime lineage service implementation. |
| polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageService.java | Introduces service boundary for lineage ingest/query. |
| polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageQueryRequest.java | Adds query request record for normalized lineage lookups. |
| polaris-core/src/main/java/org/apache/polaris/core/lineage/LineagePersistence.java | Adds persistence SPI contract for lineage storage backends. |
| polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageNodeType.java | Adds node type enum for lineage results. |
| polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageNode.java | Adds node model returned in lineage graph responses. |
| polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageIngestRequest.java | Adds normalized ingest request model independent of transport event shape. |
| polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageGraph.java | Adds normalized lineage query response model. |
| polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageGranularity.java | Adds query granularity enum (dataset vs column). |
| polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageFieldReference.java | Adds field reference model for column lineage edges. |
| polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageFieldMapping.java | Adds column mapping model for column-granularity query responses. |
| polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageEdge.java | Adds dataset-level edge model. |
| polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageDirection.java | Adds query direction enum. |
| polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageDataset.java | Adds dataset identity model (catalog/namespace/name + optional entity id). |
| polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageData.java | Adds dataset metadata payload used in lineage responses. |
| polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageColumnEdge.java | Adds column-level edge model. |
| polaris-core/src/main/java/org/apache/polaris/core/config/FeatureConfiguration.java | Adds ENABLE_LINEAGE realm feature flag definition. |
| persistence/relational-jdbc/src/test/java/org/apache/polaris/persistence/relational/jdbc/LineagePersistenceTest.java | Adds H2 unit tests for lineage upserts, replacement semantics, and querying. |
| persistence/relational-jdbc/src/test/java/org/apache/polaris/persistence/relational/jdbc/LineagePersistenceJdbcIT.java | Adds Postgres/CockroachDB integration smoke tests for lineage persistence. |
| persistence/relational-jdbc/src/test/java/org/apache/polaris/persistence/relational/jdbc/JdbcBootstrapUtilsTest.java | Updates bootstrap schema version expectations to v5. |
| persistence/relational-jdbc/src/main/resources/postgres/schema-v5.sql | Adds Postgres JDBC schema v5 including lineage tables. |
| persistence/relational-jdbc/src/main/resources/h2/schema-v5.sql | Adds H2 JDBC schema v5 including lineage tables. |
| persistence/relational-jdbc/src/main/resources/cockroachdb/schema-v5.sql | Adds CockroachDB JDBC schema v5 including lineage tables. |
| persistence/relational-jdbc/src/main/java/org/apache/polaris/persistence/relational/jdbc/models/ModelLineageEdge.java | Adds JDBC model for lineage_edges. |
| persistence/relational-jdbc/src/main/java/org/apache/polaris/persistence/relational/jdbc/models/ModelLineageDataset.java | Adds JDBC model for lineage_datasets. |
| persistence/relational-jdbc/src/main/java/org/apache/polaris/persistence/relational/jdbc/models/ModelLineageColumnEdge.java | Adds JDBC model for lineage_column_edges. |
| persistence/relational-jdbc/src/main/java/org/apache/polaris/persistence/relational/jdbc/JdbcBasePersistenceImpl.java | Implements LineagePersistence and adds realm cleanup + schema guards. |
| persistence/relational-jdbc/src/main/java/org/apache/polaris/persistence/relational/jdbc/DatabaseType.java | Bumps latest JDBC schema version to v5 for all supported DBs. |
| persistence/relational-jdbc/build.gradle.kts | Adds Testcontainers CockroachDB dependency for integration tests. |
|
I think this persistence piece, together with #4826, is too early to merge as-is. The concern is not implementation quality, but commit point. Once we add the lineage persistence model and JDBC schema, changes to the lineage identity model, edge replacement semantics, column-lineage representation, or storage placement become schema/migration changes for users. That makes this much harder to revise than ordinary internal Java code. The dev@ discussion still has open questions around at least:
I think we should settle those points before merging the persistence SPI/schema work. It is fine to keep iterating on the PR, but I would not treat it as merge-ready yet. |
Hi @snazy , thanks for clarifying. I understand your concern about committing to the persistence model and schema before the remaining design questions are resolved. I’ll continue following the dev@ discussion and update the PR based on the resulting consensus. |
3f4072f to
97e4e14
Compare
Summary
Adds the JDBC persistence implementation for OpenLineage dataset and column lineage.
The JDBC backend now persists lineage datasets, dataset-level edges, and column-level edges, and supports querying depth-one upstream/downstream lineage graphs through the existing
LineagePersistencecontract.Changes
Add JDBC schema v5 for H2, PostgreSQL, and CockroachDB with:
lineage_datasetslineage_edgeslineage_column_edgesImplements JDBC-backed LineagePersistence, and persists dataset, edge, and column-level lineage by realm and OpenLineage identity. It also handles latest-event replacement semantics, realm cleanup, schema-version validation, and adds H2 plus PostgreSQL/CockroachDB test coverage.
Testing
Added coverage in LineagePersistenceTest.java for dataset/edge/column upserts, timestamp monotonicity, stale upstream replacement, older-event protection, clearing upstream lineage, depth one lineage graph loading, and the pre-v5 schema guard.
Added coverage in LineagePersistenceJdbcIT.java for PostgreSQL and CockroachDB lineage CRUD smoke tests.
Checklist
CHANGELOG.md(if needed)site/content/in-dev/unreleased(if needed)