Skip to content

refactor(common): Move Avro schema interning inside HoodieSchema#19144

Open
voonhous wants to merge 6 commits into
apache:masterfrom
voonhous:fix-schema-compat-checker-identity-memo
Open

refactor(common): Move Avro schema interning inside HoodieSchema#19144
voonhous wants to merge 6 commits into
apache:masterfrom
voonhous:fix-schema-compat-checker-identity-memo

Conversation

@voonhous

@voonhous voonhous commented Jul 2, 2026

Copy link
Copy Markdown
Member

Describe the issue this Pull Request addresses

Closes #19143.

Follow-up to #18967, where the review preference was to cache the Avro Schema -> HoodieSchema conversion inside HoodieSchema itself so call sites stay oblivious. That was blocked by HoodieSchemaCompatibilityChecker, whose memoization assumed a distinct schema instance per occurrence. This PR removes that assumption first, folds the interning in, then adds the guards needed to keep shared instances correct.

Summary and Changelog

HoodieSchema.fromAvroSchema now returns the canonical (interned) instance for a schema, so explicit call-site interning disappears. Construction paths (parse, create*) return fresh, owned instances that are safe to mutate:

flowchart LR
  subgraph before["Before (#18967)"]
    direction TB
    s1["9 per-record call sites"] -- "HoodieAvroSchemaCache.intern(avro)" --> a1["Avro-identity weak cache<br/>(public class)"]
    a1 --> v1["HoodieSchemaCache<br/>(value-keyed)"]
    n1["navigation: getElementType(),<br/>getTypes(), field schemas"] --> f1["fresh wrapper per call"]
  end
  subgraph after["After"]
    direction TB
    s2["per-record call sites<br/>+ navigation"] -- "fromAvroSchema(avro)" --> a2["Avro-identity weak cache<br/>(private detail)"]
    a2 --> v2["HoodieSchemaCache<br/>(content-keyed canonical pool)"]
    p2["parse() / create*"] --> o2["fresh owned instance<br/>(mutable, not interned)"]
  end
Loading

Core change

  • Rework HoodieSchemaCompatibilityChecker memoization to be identity-free: recursion is detected with an in-progress set keyed on value-equal (reader, writer) pairs, and only COMPATIBLE results are memoized, so every occurrence of an incompatible subschema reports its own field path. This also fixes a latent bug: recursive schemas stack-overflowed, because navigation allocates fresh wrappers so the identity-keyed memo never hit. The checker remains a port of Avro 1.10 SchemaCompatibility; the divergence is documented in the class javadoc.
  • fromAvroSchema interns: a private Avro-identity weak cache inside HoodieSchema fronts the value-keyed HoodieSchemaCache, keeping the per-record hot path a single identity hit. Sub-schema navigation returns canonical instances automatically.
  • Drop all HoodieAvroSchemaCache.intern(...) call sites and delete the class. Two HoodieSchema -> Avro -> HoodieSchema round-trips (RealtimeCompactedRecordReader#mergeRecord, ExpressionPayload insert path) now pass the HoodieSchema through directly.

Guards for shared instances

Interning makes equal schemas share one instance, so three correctness guards were added for code that mutates or serializes schemas (each was caught by CI on this PR):

  • Docs/aliases preserved. HoodieSchemaCache keys on the schema's full serialized content, not Avro equals (which ignores doc strings and aliases). Without this, a commented table schema collapses onto an equal uncommented one and Hive/catalog sync loses column comments.
  • Non-serializable schemas tolerated. Some projection/reader paths build schemas that are valid in memory but not serializable to JSON (two distinct nested records sharing a name). Interning now catches the toString failure and returns such schemas uncached instead of throwing (SchemaParseException: Can't redefine).
  • parse() returns an owned instance. Parsed schemas are frequently mutated (e.g. addProp in client-init callbacks). parse() now returns a fresh instance -- like the create* factories -- so such mutation cannot corrupt the shared canonical instance; fromAvroSchema remains the interning path.

Impact

No user-facing change. HoodieAvroSchemaCache (added in #18967, unreleased) is deleted. Canonical instances from fromAvroSchema are shared and must not be mutated (documented in the javadoc); parse/create* return owned instances that are safe to mutate. fromAvroSchema(s).getAvroSchema() may now return an equal-but-different Avro instance. Navigation returns cached canonical wrappers instead of allocating per call.

Risk Level

low. hudi-common schema/avro suites pass with interning enabled, including the compatibility-checker suite against canonical instances -- the exact combination that broke the first attempt inside #18967 (under-reported incompatibility locations). Tests cover: recursive schemas (StackOverflowError before the checker fix), per-path reporting of a shared incompatible subschema, canonicalization of conversion and navigation, doc/alias preservation through interning, tolerance of non-serializable schemas, and parse() mutation isolation. The three guards above each fixed a CI regression surfaced during review (TestHiveSyncTool, TestHoodieFileGroupReaderOnJava, TestHoodieClientInitCallback).

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

…e identity

The ReaderWriter memo keyed on instance identity, assuming each subschema
occurrence is a distinct instance. HoodieSchema navigation allocates fresh
wrappers, so pairs never repeated: the memo never hit, and recursion
protection was dead, stack-overflowing on recursive schemas. Detect
recursion with an in-progress set keyed on value-equal pairs, and memoize
only compatible results so each occurrence of an incompatible subschema
reports its own field path. Unblocks canonical/interned HoodieSchema
instances (apache#19143).
@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label Jul 2, 2026

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR reworks the schema-compatibility checker's memoization so it no longer relies on HoodieSchema instance identity — it detects recursion via an in-progress set of value-equal reader/writer pairs and memoizes only compatible results, which fixes the stack overflow on recursive schemas and enables interned/canonical HoodieSchema instances. The core logic looks correct after tracing the recursion and per-path reporting behavior; one edge case around repeated recomputation of shared incompatible subschemas is worth a look in the inline comment. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here.

}
SchemaCompatibilityResult result = calculateCompatibility(reader, writer, locations);
inProgressPairs.remove(pair);
if (result.getCompatibilityType() == SchemaCompatibilityType.COMPATIBLE) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Since incompatible results are intentionally not memoized (to report each occurrence path), a subschema that is shared across many field paths gets fully recomputed at every path. For a schema where an incompatible named type is reachable through nested diamond-shaped sharing, the number of recomputations (and reported incompatibility entries) can grow multiplicatively with nesting depth. Non-recursive shared types don't hit inProgressPairs, so nothing bounds this. Is that a realistic concern for the schemas this runs on, or is depth small enough in practice that it's fine? A Set of already-reported (pair, locationPath) keys, or capping duplicate paths, could bound it if needed.

⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recomputation is output-bound: the contract is to report every occurrence path, and in a diamond the path count is itself 2^d, so anything that doesn't under-report does at least that much work. Compatible children of a shared subtree still memo-hit, so a revisit only walks the incompatible spine. Upstream Avro bounds this corner only by under-reporting (first path wins) -- the bug the four-path test pins. Pre-#18967 code had the same profile as this PR: fresh wrappers meant the identity memo never hit either.

Measured the corner (named type shared at 2 fields per level, incompatible leaf):

depth=10  paths=1024    reported=1024    9ms
depth=15  paths=32768   reported=32768   67ms
depth=18  paths=262144  reported=262144  262ms

Linear in output, ~1us per reported path, and this runs once per schema validation, not per record. Real schemas share a type at 2-3 fields, a level or two deep.

A (pair, locationPath) set wouldn't bound it: every (pair, path) in a diamond is distinct, so it never dedups. If this corner ever got real, the fix is memoizing incompatible results with subtree-relative locations re-anchored on reuse (noted on #19143) -- bounds the walk, still emits every path.

…roSchema

fromAvroSchema now returns the canonical instance for a schema: an
Avro-identity weak cache fronts the value-keyed HoodieSchemaCache, so
per-record call sites and sub-schema navigation get shared wrappers
without explicit interning. Drop all HoodieAvroSchemaCache.intern call
sites and delete the class; HoodieSchema -> Avro -> HoodieSchema
round-trips in RealtimeCompactedRecordReader and ExpressionPayload now
pass the HoodieSchema through directly. Safe now that the compatibility
checker no longer depends on instance identity (apache#19143).
@voonhous voonhous changed the title fix(common): Make schema compat checker independent of schema instanc… refactor(common): Move Avro schema interning inside HoodieSchema Jul 2, 2026

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the follow-up! This PR folds the Avro SchemaHoodieSchema interning inside HoodieSchema.fromAvroSchema and reworks HoodieSchemaCompatibilityChecker memoization to be identity-free. The checker rework, the non-loader interning approach, and the migrated call sites all look correct on tracing. One architectural note about shared-mutable instances is worth a look in the inline comment. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here.

* @return new HoodieSchema instance
* @return the canonical HoodieSchema instance, or null if avroSchema is null
*/
public static HoodieSchema fromAvroSchema(Schema avroSchema) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Now that fromAvroSchema returns globally-shared canonical instances for every caller (previously only the ~9 hot-path sites went through interning), the public void setFields(...) at L1187 — which mutates avroSchema.setFields(...) in place — becomes a real footgun: any caller that mutates a fromAvroSchema-derived wrapper would silently corrupt the shared pool and also change a live cache key's hashCode. I couldn't find a current path that does this, so it's latent rather than a live bug, but would it be worth guarding (e.g. making the canonical wrapper immutable, or having setFields defensively copy) rather than relying on the javadoc invariant alone?

⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.

…d aliases

Avro Schema.equals ignores doc strings and aliases, so value-interning
keyed on it collapsed schemas differing only in that metadata onto one
canonical wrapper -- hive sync then read empty field docs when building
column comments (TestHiveSyncTool#testSyncWithCommentedSchema). Key the
canonical pool on (wrapper class, serialized JSON) instead: the JSON
form covers docs and aliases, so interning is never lossy, while
content-identical schemas still converge. The wrapper class in the key
keeps logical-type subclasses from collapsing onto plain wrappers of
equal content.
@voonhous

voonhous commented Jul 3, 2026

Copy link
Copy Markdown
Member Author

CI red on the last push was a real regression, fixed in the latest commit: Avro Schema.equals ignores doc strings and aliases, so value-interning inside fromAvroSchema collapsed a commented table schema onto a previously-interned uncommented-but-equal wrapper -- hive sync then built column comments from empty doc()s (testSyncWithCommentedSchema / testUpdateTableComments).

Fix: the canonical pool now keys on the schema's full serialized content (plus wrapper class), which does cover docs/aliases -- interning is never lossy, while content-identical schemas still dedup. Added a unit test pinning doc/alias preservation through fromAvroSchema; the six failing hive-sync tests pass locally against the fix. HoodieSchema.equals still follows Avro semantics (doc/alias-blind) -- now documented on the method.

@voonhous voonhous requested a review from danny0405 July 3, 2026 06:20

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the follow-up work here! This PR folds the Avro Schema to HoodieSchema interning inside HoodieSchema.fromAvroSchema, removes the standalone HoodieAvroSchemaCache, and reworks the compatibility checker memoization to be identity-free (value-keyed recursion detection, compatible-only memoization). I traced the hashCode/equals contract of the new value-keyed ReaderWriter (consistent, so recursion detection and memoization are sound), the loader re-entrancy handling in fromAvroSchema, and the alias/doc preservation of SchemaContentKey, and found no new critical correctness issues in this round. The main themes worth a human eye (per-occurrence recomputation cost for shared subschemas, and the broader exposure of globally-shared canonical instances now that all callers intern) were already raised in prior rounds. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

@github-actions github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Jul 3, 2026
voonhous added 3 commits July 3, 2026 15:33
The content-key cache (added to preserve docs/aliases) derives its key from
avroSchema.toString(). Some projection/reader paths -- e.g. the ORC reader via
HoodieSchemaUtils.projectSchema -- build schemas that are valid in memory but not
serializable (two distinct nested records sharing a name), so toString throws
'Can't redefine: record', surfacing as SchemaParseException in
TestHoodieFileGroupReaderOnJava. The previous Avro equals/hashCode key never
serialized, so these interned fine. Catch the failure and return the schema
uncached; interning is only a de-duplication optimization.
…mutating shared schemas

fromAvroSchema interns and returns a shared canonical instance. HoodieSchema.parse
delegated to it, so parse(str).addProp(...) -- as client-init callbacks do -- mutated
the shared instance in place, leaking the change to every other holder of the same
schema string (TestHoodieClientInitCallback saw a config's write schema already carry
a callback-added prop before its own callback ran).

parse() now returns a fresh, owned instance via buildFromAvroSchema, consistent with
the create* factories which already return fresh instances. fromAvroSchema keeps
interning for the read/canonicalization path where schemas are treated as immutable.
…n-agnostic

testInterningToleratesNonSerializableSchema asserted parent.toString() throws
SchemaParseException as a precondition, but that behavior is Avro-version-specific:
1.11.x throws 'Can't redefine' on duplicate nested-record names, 1.12.x does not
(verified against both jars). The precondition failed under the 1.12.x build
profiles. Drop it and assert only the version-agnostic invariant -- interning
tolerates the schema and returns a valid RECORD wrapper (uncached when the content
key cannot be computed). The intern try/catch is still exercised on 1.11.x, where
fromAvroSchema would otherwise propagate the throw.
if (result.getCompatibilityType() == SchemaCompatibilityType.COMPATIBLE) {
// Incompatible results embed the field path they were found at, so they are
// recomputed per occurrence; memoizing them would report only the first path.
memoizedCompatibleResults.put(pair, result);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memoizing COMPATIBLE results keyed only on the value pair (reader, writer) can mask a union-scope name mismatch and flip the verdict to COMPATIBLE. checkSchemaNames only compares fully-qualified names when the pair is top-level or directly inside a UNION, so a named record/enum/fixed's verdict is path-dependent. If such a type is first reached at a plain (non-union) field path the name check is skipped and the pair is memoized COMPATIBLE; a later occurrence of the same pair inside a union then hits this memo and returns COMPATIBLE without ever running the union-scope name check, dropping a genuine NAME_MISMATCH. Reachable from the checkNaming=true entry points (isSchemaCompatible / checkSchemaCompatible). The old identity-keyed memo over fresh navigation wrappers never hit across occurrences, so each path recomputed and this could not happen. Consider folding the name-checking context into the memo key.

@hudi-bot

hudi-bot commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Move Avro schema interning inside HoodieSchema once it decouples from Avro

4 participants