Skip to content

OpenLineage: Add Persistence schema and JDBC models#4827

Open
iting0321 wants to merge 18 commits into
apache:mainfrom
iting0321:ol-jdbc-persistence-layer-2
Open

OpenLineage: Add Persistence schema and JDBC models#4827
iting0321 wants to merge 18 commits into
apache:mainfrom
iting0321:ol-jdbc-persistence-layer-2

Conversation

@iting0321

@iting0321 iting0321 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds the JDBC persistence implementation for OpenLineage dataset and column lineage.

The JDBC backend now persists lineage datasets, dataset-level edges, and column-level edges, and supports querying depth-one upstream/downstream lineage graphs through the existing LineagePersistence contract.

Changes

Add JDBC schema v5 for H2, PostgreSQL, and CockroachDB with:

  • lineage_datasets
  • lineage_edges
  • lineage_column_edges

Implements JDBC-backed LineagePersistence, and persists dataset, edge, and column-level lineage by realm and OpenLineage identity. It also handles latest-event replacement semantics, realm cleanup, schema-version validation, and adds H2 plus PostgreSQL/CockroachDB test coverage.

Testing

Added coverage in LineagePersistenceTest.java for dataset/edge/column upserts, timestamp monotonicity, stale upstream replacement, older-event protection, clearing upstream lineage, depth one lineage graph loading, and the pre-v5 schema guard.

Added coverage in LineagePersistenceJdbcIT.java for PostgreSQL and CockroachDB lineage CRUD smoke tests.

Checklist

  • 🛡️ Don't disclose security issues! (contact security@apache.org)
  • 🔗 Clearly explained why the changes are needed, or linked related issues: Fixes #
  • 🧪 Added/updated tests with good coverage, or manually tested (and explained how)
  • 💡 Added comments for complex logic
  • 🧾 Updated CHANGELOG.md (if needed)
  • 📚 Updated documentation in site/content/in-dev/unreleased (if needed)

@dimas-b dimas-b left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @iting0321 , I see this PR is a draft ATM, but commenting proactively to allow time for a smooth discussion.

-- SCAN METRICS REPORT TABLE
-- ============================================================================

CREATE TABLE IF NOT EXISTS scan_metrics_report (

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our SQL DDL scripts jam too many features into the same file. I'd like to propose splitting them by feature (MetaStore, Metrics, Events, OpenLineage).

Does OL data absolutely have to be in the same RDBMS schema as the MetaStore data?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dimas-b, I think there are two possible levels of change here:

Small change: keep the existing JDBC bootstrap contract: Polaris still asks DatabaseType for one top-level schema resource, e.g. postgres/schema-v5.sql, and bootstrap still executes a single SQL stream.

The change would be inside the schema resource loader: before returning that stream, it would expand fragment directives in schema-v5.sql. So schema-v5.sql can keep the existing core/events/metrics DDL inline, and replace only the OpenLineage section with:

-- POLARIS_SCHEMA_FRAGMENT: schema/lineage/lineage-v1.sql

The loader would resolve that to postgres/schema/lineage/lineage-v1.sql, concatenate it into the SQL stream, and then bootstrap executes the combined result.

The architecture will look like:

 postgres/
  ├── schema-v5.sql
  └── schema/
      └── lineage/
          └── lineage-v1.sql

Bigger change: make bootstrap understand separately versioned schema components, e.g. core = v5, lineage = v1. That would let lineage evolve independently from the MetaStore schema, but it requires changing how we record schema versions, how bootstrap decides what to run, and eventually how upgrades/migrations work per component.

For this PR, I’d prefer the small change: split the DDL by feature while preserving the current single-version bootstrap behavior. Then we can leave independently versioned persistence components as a follow-up design.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is for the "bigger change". I understand, it's more work, but at the same time it allows greater flexibility in new (OL) schema evolution.

Also, users not interested in the OL feature will not have to deal with OL tables.

Given the recent influx of persistence-related features (OL, OSI, Metrics, Events). I think it is time to revisit and enhance schema management and bootstrap.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your feedback! I’ll continue refactoring this as part of the "bigger change".

@iting0321 iting0321 force-pushed the ol-jdbc-persistence-layer-2 branch from e11af5f to de80910 Compare June 21, 2026 12:08
@iting0321 iting0321 marked this pull request as ready for review June 22, 2026 04:26
Copilot AI review requested due to automatic review settings June 22, 2026 04:26

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an OpenLineage-focused lineage domain model + service/persistence SPI in polaris-core, wires a runtime LineageService, and extends the relational-JDBC persistence layer with schema v5 + lineage CRUD/query support (including H2/Postgres/CockroachDB coverage).

Changes:

  • Introduces core lineage API/SPI types (LineageService, LineagePersistence, request/graph models).
  • Adds runtime lineage service + config surface, plus a default “disabled” persistence bean.
  • Adds JDBC schema v5 lineage tables and implements LineagePersistence in JdbcBasePersistenceImpl with unit/integration tests.

Reviewed changes

Copilot reviewed 35 out of 35 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
site/content/in-dev/unreleased/configuration/config-sections/smallrye-polaris_lineage.md Adds generated(?) SmallRye config reference for lineage config keys.
site/content/in-dev/unreleased/configuration/config-sections/flags-polaris_features.md Documents new ENABLE_LINEAGE feature flag.
runtime/service/src/test/java/org/apache/polaris/service/lineage/DisabledLineagePersistenceTest.java Tests default disabled persistence throws clearly.
runtime/service/src/test/java/org/apache/polaris/service/lineage/DefaultLineageServiceTest.java Tests feature/config gating and delegation behavior in runtime lineage service.
runtime/service/src/main/java/org/apache/polaris/service/lineage/LineageConfiguration.java Defines SmallRye config mapping for polaris.lineage.*.
runtime/service/src/main/java/org/apache/polaris/service/lineage/DisabledLineagePersistence.java Adds @DefaultBean placeholder persistence implementation.
runtime/service/src/main/java/org/apache/polaris/service/lineage/DefaultLineageService.java Adds request-scoped runtime lineage service implementation.
polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageService.java Introduces service boundary for lineage ingest/query.
polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageQueryRequest.java Adds query request record for normalized lineage lookups.
polaris-core/src/main/java/org/apache/polaris/core/lineage/LineagePersistence.java Adds persistence SPI contract for lineage storage backends.
polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageNodeType.java Adds node type enum for lineage results.
polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageNode.java Adds node model returned in lineage graph responses.
polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageIngestRequest.java Adds normalized ingest request model independent of transport event shape.
polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageGraph.java Adds normalized lineage query response model.
polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageGranularity.java Adds query granularity enum (dataset vs column).
polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageFieldReference.java Adds field reference model for column lineage edges.
polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageFieldMapping.java Adds column mapping model for column-granularity query responses.
polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageEdge.java Adds dataset-level edge model.
polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageDirection.java Adds query direction enum.
polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageDataset.java Adds dataset identity model (catalog/namespace/name + optional entity id).
polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageData.java Adds dataset metadata payload used in lineage responses.
polaris-core/src/main/java/org/apache/polaris/core/lineage/LineageColumnEdge.java Adds column-level edge model.
polaris-core/src/main/java/org/apache/polaris/core/config/FeatureConfiguration.java Adds ENABLE_LINEAGE realm feature flag definition.
persistence/relational-jdbc/src/test/java/org/apache/polaris/persistence/relational/jdbc/LineagePersistenceTest.java Adds H2 unit tests for lineage upserts, replacement semantics, and querying.
persistence/relational-jdbc/src/test/java/org/apache/polaris/persistence/relational/jdbc/LineagePersistenceJdbcIT.java Adds Postgres/CockroachDB integration smoke tests for lineage persistence.
persistence/relational-jdbc/src/test/java/org/apache/polaris/persistence/relational/jdbc/JdbcBootstrapUtilsTest.java Updates bootstrap schema version expectations to v5.
persistence/relational-jdbc/src/main/resources/postgres/schema-v5.sql Adds Postgres JDBC schema v5 including lineage tables.
persistence/relational-jdbc/src/main/resources/h2/schema-v5.sql Adds H2 JDBC schema v5 including lineage tables.
persistence/relational-jdbc/src/main/resources/cockroachdb/schema-v5.sql Adds CockroachDB JDBC schema v5 including lineage tables.
persistence/relational-jdbc/src/main/java/org/apache/polaris/persistence/relational/jdbc/models/ModelLineageEdge.java Adds JDBC model for lineage_edges.
persistence/relational-jdbc/src/main/java/org/apache/polaris/persistence/relational/jdbc/models/ModelLineageDataset.java Adds JDBC model for lineage_datasets.
persistence/relational-jdbc/src/main/java/org/apache/polaris/persistence/relational/jdbc/models/ModelLineageColumnEdge.java Adds JDBC model for lineage_column_edges.
persistence/relational-jdbc/src/main/java/org/apache/polaris/persistence/relational/jdbc/JdbcBasePersistenceImpl.java Implements LineagePersistence and adds realm cleanup + schema guards.
persistence/relational-jdbc/src/main/java/org/apache/polaris/persistence/relational/jdbc/DatabaseType.java Bumps latest JDBC schema version to v5 for all supported DBs.
persistence/relational-jdbc/build.gradle.kts Adds Testcontainers CockroachDB dependency for integration tests.

@snazy

snazy commented Jun 22, 2026

Copy link
Copy Markdown
Member

I think this persistence piece, together with #4826, is too early to merge as-is.

The concern is not implementation quality, but commit point. Once we add the lineage persistence model and JDBC schema, changes to the lineage identity model, edge replacement semantics, column-lineage representation, or storage placement become schema/migration changes for users. That makes this much harder to revise than ordinary internal Java code.

The dev@ discussion still has open questions around at least:

  • what exact query/use case the reduced local graph is supposed to support;
  • whether “latest” edge replacement is the right semantic model;
  • whether lineage persistence must be tied to the same JDBC/metastore schema;
  • whether the REST/API and persistence pieces should be optional modules;
  • what backend capability contract non-JDBC implementations are expected to satisfy.

I think we should settle those points before merging the persistence SPI/schema work. It is fine to keep iterating on the PR, but I would not treat it as merge-ready yet.

@iting0321

Copy link
Copy Markdown
Contributor Author

I think this persistence piece, together with #4826, is too early to merge as-is.

The concern is not implementation quality, but commit point. Once we add the lineage persistence model and JDBC schema, changes to the lineage identity model, edge replacement semantics, column-lineage representation, or storage placement become schema/migration changes for users. That makes this much harder to revise than ordinary internal Java code.

The dev@ discussion still has open questions around at least:

  • what exact query/use case the reduced local graph is supposed to support;
  • whether “latest” edge replacement is the right semantic model;
  • whether lineage persistence must be tied to the same JDBC/metastore schema;
  • whether the REST/API and persistence pieces should be optional modules;
  • what backend capability contract non-JDBC implementations are expected to satisfy.

I think we should settle those points before merging the persistence SPI/schema work. It is fine to keep iterating on the PR, but I would not treat it as merge-ready yet.

Hi @snazy , thanks for clarifying. I understand your concern about committing to the persistence model and schema before the remaining design questions are resolved. I’ll continue following the dev@ discussion and update the PR based on the resulting consensus.

@iting0321 iting0321 force-pushed the ol-jdbc-persistence-layer-2 branch from 3f4072f to 97e4e14 Compare June 29, 2026 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants