refactor(common): Move Avro schema interning inside HoodieSchema#19144
refactor(common): Move Avro schema interning inside HoodieSchema#19144voonhous wants to merge 6 commits into
Conversation
…e identity The ReaderWriter memo keyed on instance identity, assuming each subschema occurrence is a distinct instance. HoodieSchema navigation allocates fresh wrappers, so pairs never repeated: the memo never hit, and recursion protection was dead, stack-overflowing on recursive schemas. Detect recursion with an in-progress set keyed on value-equal pairs, and memoize only compatible results so each occurrence of an incompatible subschema reports its own field path. Unblocks canonical/interned HoodieSchema instances (apache#19143).
hudi-agent
left a comment
There was a problem hiding this comment.
Thanks for working on this! This PR reworks the schema-compatibility checker's memoization so it no longer relies on HoodieSchema instance identity — it detects recursion via an in-progress set of value-equal reader/writer pairs and memoizes only compatible results, which fixes the stack overflow on recursive schemas and enables interned/canonical HoodieSchema instances. The core logic looks correct after tracing the recursion and per-path reporting behavior; one edge case around repeated recomputation of shared incompatible subschemas is worth a look in the inline comment. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here.
| } | ||
| SchemaCompatibilityResult result = calculateCompatibility(reader, writer, locations); | ||
| inProgressPairs.remove(pair); | ||
| if (result.getCompatibilityType() == SchemaCompatibilityType.COMPATIBLE) { |
There was a problem hiding this comment.
🤖 Since incompatible results are intentionally not memoized (to report each occurrence path), a subschema that is shared across many field paths gets fully recomputed at every path. For a schema where an incompatible named type is reachable through nested diamond-shaped sharing, the number of recomputations (and reported incompatibility entries) can grow multiplicatively with nesting depth. Non-recursive shared types don't hit inProgressPairs, so nothing bounds this. Is that a realistic concern for the schemas this runs on, or is depth small enough in practice that it's fine? A Set of already-reported (pair, locationPath) keys, or capping duplicate paths, could bound it if needed.
There was a problem hiding this comment.
The recomputation is output-bound: the contract is to report every occurrence path, and in a diamond the path count is itself 2^d, so anything that doesn't under-report does at least that much work. Compatible children of a shared subtree still memo-hit, so a revisit only walks the incompatible spine. Upstream Avro bounds this corner only by under-reporting (first path wins) -- the bug the four-path test pins. Pre-#18967 code had the same profile as this PR: fresh wrappers meant the identity memo never hit either.
Measured the corner (named type shared at 2 fields per level, incompatible leaf):
depth=10 paths=1024 reported=1024 9ms
depth=15 paths=32768 reported=32768 67ms
depth=18 paths=262144 reported=262144 262ms
Linear in output, ~1us per reported path, and this runs once per schema validation, not per record. Real schemas share a type at 2-3 fields, a level or two deep.
A (pair, locationPath) set wouldn't bound it: every (pair, path) in a diamond is distinct, so it never dedups. If this corner ever got real, the fix is memoizing incompatible results with subtree-relative locations re-anchored on reuse (noted on #19143) -- bounds the walk, still emits every path.
…roSchema fromAvroSchema now returns the canonical instance for a schema: an Avro-identity weak cache fronts the value-keyed HoodieSchemaCache, so per-record call sites and sub-schema navigation get shared wrappers without explicit interning. Drop all HoodieAvroSchemaCache.intern call sites and delete the class; HoodieSchema -> Avro -> HoodieSchema round-trips in RealtimeCompactedRecordReader and ExpressionPayload now pass the HoodieSchema through directly. Safe now that the compatibility checker no longer depends on instance identity (apache#19143).
hudi-agent
left a comment
There was a problem hiding this comment.
Thanks for the follow-up! This PR folds the Avro Schema → HoodieSchema interning inside HoodieSchema.fromAvroSchema and reworks HoodieSchemaCompatibilityChecker memoization to be identity-free. The checker rework, the non-loader interning approach, and the migrated call sites all look correct on tracing. One architectural note about shared-mutable instances is worth a look in the inline comment. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here.
| * @return new HoodieSchema instance | ||
| * @return the canonical HoodieSchema instance, or null if avroSchema is null | ||
| */ | ||
| public static HoodieSchema fromAvroSchema(Schema avroSchema) { |
There was a problem hiding this comment.
🤖 Now that fromAvroSchema returns globally-shared canonical instances for every caller (previously only the ~9 hot-path sites went through interning), the public void setFields(...) at L1187 — which mutates avroSchema.setFields(...) in place — becomes a real footgun: any caller that mutates a fromAvroSchema-derived wrapper would silently corrupt the shared pool and also change a live cache key's hashCode. I couldn't find a current path that does this, so it's latent rather than a live bug, but would it be worth guarding (e.g. making the canonical wrapper immutable, or having setFields defensively copy) rather than relying on the javadoc invariant alone?
…d aliases Avro Schema.equals ignores doc strings and aliases, so value-interning keyed on it collapsed schemas differing only in that metadata onto one canonical wrapper -- hive sync then read empty field docs when building column comments (TestHiveSyncTool#testSyncWithCommentedSchema). Key the canonical pool on (wrapper class, serialized JSON) instead: the JSON form covers docs and aliases, so interning is never lossy, while content-identical schemas still converge. The wrapper class in the key keeps logical-type subclasses from collapsing onto plain wrappers of equal content.
|
CI red on the last push was a real regression, fixed in the latest commit: Avro Fix: the canonical pool now keys on the schema's full serialized content (plus wrapper class), which does cover docs/aliases -- interning is never lossy, while content-identical schemas still dedup. Added a unit test pinning doc/alias preservation through |
hudi-agent
left a comment
There was a problem hiding this comment.
Thanks for the follow-up work here! This PR folds the Avro Schema to HoodieSchema interning inside HoodieSchema.fromAvroSchema, removes the standalone HoodieAvroSchemaCache, and reworks the compatibility checker memoization to be identity-free (value-keyed recursion detection, compatible-only memoization). I traced the hashCode/equals contract of the new value-keyed ReaderWriter (consistent, so recursion detection and memoization are sound), the loader re-entrancy handling in fromAvroSchema, and the alias/doc preservation of SchemaContentKey, and found no new critical correctness issues in this round. The main themes worth a human eye (per-occurrence recomputation cost for shared subschemas, and the broader exposure of globally-shared canonical instances now that all callers intern) were already raised in prior rounds. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.
cc @yihua
The content-key cache (added to preserve docs/aliases) derives its key from avroSchema.toString(). Some projection/reader paths -- e.g. the ORC reader via HoodieSchemaUtils.projectSchema -- build schemas that are valid in memory but not serializable (two distinct nested records sharing a name), so toString throws 'Can't redefine: record', surfacing as SchemaParseException in TestHoodieFileGroupReaderOnJava. The previous Avro equals/hashCode key never serialized, so these interned fine. Catch the failure and return the schema uncached; interning is only a de-duplication optimization.
…mutating shared schemas fromAvroSchema interns and returns a shared canonical instance. HoodieSchema.parse delegated to it, so parse(str).addProp(...) -- as client-init callbacks do -- mutated the shared instance in place, leaking the change to every other holder of the same schema string (TestHoodieClientInitCallback saw a config's write schema already carry a callback-added prop before its own callback ran). parse() now returns a fresh, owned instance via buildFromAvroSchema, consistent with the create* factories which already return fresh instances. fromAvroSchema keeps interning for the read/canonicalization path where schemas are treated as immutable.
…n-agnostic testInterningToleratesNonSerializableSchema asserted parent.toString() throws SchemaParseException as a precondition, but that behavior is Avro-version-specific: 1.11.x throws 'Can't redefine' on duplicate nested-record names, 1.12.x does not (verified against both jars). The precondition failed under the 1.12.x build profiles. Drop it and assert only the version-agnostic invariant -- interning tolerates the schema and returns a valid RECORD wrapper (uncached when the content key cannot be computed). The intern try/catch is still exercised on 1.11.x, where fromAvroSchema would otherwise propagate the throw.
| if (result.getCompatibilityType() == SchemaCompatibilityType.COMPATIBLE) { | ||
| // Incompatible results embed the field path they were found at, so they are | ||
| // recomputed per occurrence; memoizing them would report only the first path. | ||
| memoizedCompatibleResults.put(pair, result); |
There was a problem hiding this comment.
Memoizing COMPATIBLE results keyed only on the value pair (reader, writer) can mask a union-scope name mismatch and flip the verdict to COMPATIBLE. checkSchemaNames only compares fully-qualified names when the pair is top-level or directly inside a UNION, so a named record/enum/fixed's verdict is path-dependent. If such a type is first reached at a plain (non-union) field path the name check is skipped and the pair is memoized COMPATIBLE; a later occurrence of the same pair inside a union then hits this memo and returns COMPATIBLE without ever running the union-scope name check, dropping a genuine NAME_MISMATCH. Reachable from the checkNaming=true entry points (isSchemaCompatible / checkSchemaCompatible). The old identity-keyed memo over fresh navigation wrappers never hit across occurrences, so each path recomputed and this could not happen. Consider folding the name-checking context into the memo key.
Describe the issue this Pull Request addresses
Closes #19143.
Follow-up to #18967, where the review preference was to cache the Avro
Schema->HoodieSchemaconversion insideHoodieSchemaitself so call sites stay oblivious. That was blocked byHoodieSchemaCompatibilityChecker, whose memoization assumed a distinct schema instance per occurrence. This PR removes that assumption first, folds the interning in, then adds the guards needed to keep shared instances correct.Summary and Changelog
HoodieSchema.fromAvroSchemanow returns the canonical (interned) instance for a schema, so explicit call-site interning disappears. Construction paths (parse,create*) return fresh, owned instances that are safe to mutate:flowchart LR subgraph before["Before (#18967)"] direction TB s1["9 per-record call sites"] -- "HoodieAvroSchemaCache.intern(avro)" --> a1["Avro-identity weak cache<br/>(public class)"] a1 --> v1["HoodieSchemaCache<br/>(value-keyed)"] n1["navigation: getElementType(),<br/>getTypes(), field schemas"] --> f1["fresh wrapper per call"] end subgraph after["After"] direction TB s2["per-record call sites<br/>+ navigation"] -- "fromAvroSchema(avro)" --> a2["Avro-identity weak cache<br/>(private detail)"] a2 --> v2["HoodieSchemaCache<br/>(content-keyed canonical pool)"] p2["parse() / create*"] --> o2["fresh owned instance<br/>(mutable, not interned)"] endCore change
HoodieSchemaCompatibilityCheckermemoization to be identity-free: recursion is detected with an in-progress set keyed on value-equal(reader, writer)pairs, and only COMPATIBLE results are memoized, so every occurrence of an incompatible subschema reports its own field path. This also fixes a latent bug: recursive schemas stack-overflowed, because navigation allocates fresh wrappers so the identity-keyed memo never hit. The checker remains a port of Avro 1.10SchemaCompatibility; the divergence is documented in the class javadoc.fromAvroSchemainterns: a private Avro-identity weak cache insideHoodieSchemafronts the value-keyedHoodieSchemaCache, keeping the per-record hot path a single identity hit. Sub-schema navigation returns canonical instances automatically.HoodieAvroSchemaCache.intern(...)call sites and delete the class. TwoHoodieSchema -> Avro -> HoodieSchemaround-trips (RealtimeCompactedRecordReader#mergeRecord,ExpressionPayloadinsert path) now pass theHoodieSchemathrough directly.Guards for shared instances
Interning makes equal schemas share one instance, so three correctness guards were added for code that mutates or serializes schemas (each was caught by CI on this PR):
HoodieSchemaCachekeys on the schema's full serialized content, not Avroequals(which ignores doc strings and aliases). Without this, a commented table schema collapses onto an equal uncommented one and Hive/catalog sync loses column comments.toStringfailure and returns such schemas uncached instead of throwing (SchemaParseException: Can't redefine).parse()returns an owned instance. Parsed schemas are frequently mutated (e.g.addPropin client-init callbacks).parse()now returns a fresh instance -- like thecreate*factories -- so such mutation cannot corrupt the shared canonical instance;fromAvroSchemaremains the interning path.Impact
No user-facing change.
HoodieAvroSchemaCache(added in #18967, unreleased) is deleted. Canonical instances fromfromAvroSchemaare shared and must not be mutated (documented in the javadoc);parse/create*return owned instances that are safe to mutate.fromAvroSchema(s).getAvroSchema()may now return an equal-but-different Avro instance. Navigation returns cached canonical wrappers instead of allocating per call.Risk Level
low. hudi-common schema/avro suites pass with interning enabled, including the compatibility-checker suite against canonical instances -- the exact combination that broke the first attempt inside #18967 (under-reported incompatibility locations). Tests cover: recursive schemas (StackOverflowError before the checker fix), per-path reporting of a shared incompatible subschema, canonicalization of conversion and navigation, doc/alias preservation through interning, tolerance of non-serializable schemas, and
parse()mutation isolation. The three guards above each fixed a CI regression surfaced during review (TestHiveSyncTool,TestHoodieFileGroupReaderOnJava,TestHoodieClientInitCallback).Documentation Update
none
Contributor's checklist