-
Notifications
You must be signed in to change notification settings - Fork 2.5k
refactor(common): Move Avro schema interning inside HoodieSchema #19144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 6 commits
bfdaa64
16c1f37
32713d4
851dd74
e4aa735
68ba597
09aba74
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -27,6 +27,8 @@ | |
| import org.apache.hudi.exception.HoodieIOException; | ||
| import org.apache.hudi.internal.schema.HoodieSchemaException; | ||
|
|
||
| import com.github.benmanes.caffeine.cache.Cache; | ||
| import com.github.benmanes.caffeine.cache.Caffeine; | ||
| import lombok.Getter; | ||
| import org.apache.avro.JsonProperties; | ||
| import org.apache.avro.LogicalType; | ||
|
|
@@ -94,6 +96,12 @@ | |
| public class HoodieSchema implements Serializable { | ||
| private static final long serialVersionUID = 1L; | ||
|
|
||
| // Avro-identity fast path onto the value-keyed HoodieSchemaCache, backing fromAvroSchema: | ||
| // records of one file share the same live Schema instance, so the per-record hot path is a | ||
| // single weak-identity hit with no wrapper allocation or type dispatch. | ||
| private static final Cache<Schema, HoodieSchema> AVRO_SCHEMA_CACHE = | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This 1024-cap cache is now populated by every navigation call, not just the old top-level intern sites: one parsed 500-col nullable schema is ~1500 identity-distinct Avro nodes (Avro primitives are not shared singletons), ~500 of which enter on a typical read, so two wide tables exceed the cap. Single-table jobs are fine (TinyLFU keeps the per-record-hot keys), but multi-table long-lived JVMs sustain insert pressure between hot-key accesses. Consider |
||
| Caffeine.newBuilder().weakKeys().maximumSize(1024).build(); | ||
|
|
||
| /** | ||
| * Constant representing a null JSON value, equivalent to JsonProperties.NULL_VALUE. | ||
| * This provides compatibility with Avro's JsonProperties while maintaining Hudi's API. | ||
|
|
@@ -338,13 +346,42 @@ private HoodieSchema(Schema avroSchema, List<HoodieSchemaField> fields) { | |
| /** | ||
| * Factory method to create HoodieSchema from an Avro schema. | ||
| * | ||
| * <p>Returns the canonical instance for the given schema, converting and interning it on | ||
| * first use: distinct Avro schema instances with identical serialized content converge on | ||
| * one shared wrapper through {@link HoodieSchemaCache}, with a weak identity fast path for | ||
| * the per-record hot path where all records of a file share the same live {@link Schema} | ||
| * instance. | ||
| * | ||
| * <p>Interning is never lossy: the canonicalization key covers the schema's full content, | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: this rationale is spelled out in full here, on |
||
| * including doc strings and aliases, which Avro equality ignores. Schemas that differ only | ||
| * in docs or aliases stay distinct wrappers (even though they are {@code equals()}), so | ||
| * metadata consumed downstream (e.g. catalog sync column comments, alias-based field | ||
| * matching) is always preserved. | ||
| * | ||
| * <p>Because the canonical wrapper may have been created from a content-identical but | ||
| * different Avro schema instance, {@code fromAvroSchema(s).getAvroSchema()} does not | ||
| * necessarily return {@code s} itself. Canonical instances are shared: neither the wrapper | ||
| * nor its underlying Avro schema may be mutated. | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This contract is enforced only by javadoc: |
||
| * | ||
| * @param avroSchema the Avro schema to wrap | ||
| * @return new HoodieSchema instance | ||
| * @return the canonical HoodieSchema instance, or null if avroSchema is null | ||
| */ | ||
| public static HoodieSchema fromAvroSchema(Schema avroSchema) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🤖 Now that |
||
| if (avroSchema == null) { | ||
| return null; | ||
| } | ||
| HoodieSchema canonical = AVRO_SCHEMA_CACHE.getIfPresent(avroSchema); | ||
| if (canonical == null) { | ||
| // getIfPresent/put rather than a computing get: construction may re-enter fromAvroSchema | ||
| // for subschemas (e.g. Variant validation), which a Caffeine loader must not do. A racy | ||
| // duplicate build converges on one instance through the value-keyed intern. | ||
| canonical = HoodieSchemaCache.intern(buildFromAvroSchema(avroSchema)); | ||
| AVRO_SCHEMA_CACHE.put(avroSchema, canonical); | ||
| } | ||
| return canonical; | ||
| } | ||
|
|
||
| private static HoodieSchema buildFromAvroSchema(Schema avroSchema) { | ||
| LogicalType logicalType = avroSchema.getLogicalType(); | ||
| if (logicalType != null) { | ||
| if (logicalType instanceof LogicalTypes.Decimal) { | ||
|
|
@@ -1586,6 +1623,14 @@ public String toString(boolean pretty) { | |
| return avroSchema.toString(pretty); | ||
| } | ||
|
|
||
| /** | ||
| * Equality delegates to Avro {@link Schema#equals}, which IGNORES doc strings and aliases: | ||
| * two schemas differing only in that metadata are equal. Consumers that care about docs or | ||
| * aliases (e.g. catalog sync comments, alias-based field matching) must not rely on | ||
| * equality or value-keyed maps to tell such schemas apart. Canonicalization in | ||
| * {@link HoodieSchemaCache} deliberately keys on the full serialized content instead, so | ||
| * interning never conflates them. | ||
| */ | ||
| @Override | ||
| public boolean equals(Object obj) { | ||
| if (this == obj) { | ||
|
|
@@ -1644,7 +1689,10 @@ public HoodieSchema parse(String jsonSchema) { | |
|
|
||
| try { | ||
| Schema avroSchema = avroParser.parse(jsonSchema); | ||
| return fromAvroSchema(avroSchema); | ||
| // Return a fresh, owned instance rather than the interned canonical one: parsed schemas | ||
| // are frequently mutated by callers (e.g. addProp in client-init callbacks), and mutating | ||
| // a shared interned instance would corrupt it for every other holder of the same schema. | ||
| return buildFromAvroSchema(avroSchema); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
shouldn't the schema always been immutable once it is built? |
||
| } catch (IllegalArgumentException e) { | ||
| throw new HoodieAvroSchemaException("Invalid schema string format", e); | ||
| } catch (Exception e) { | ||
|
|
@@ -1664,7 +1712,8 @@ public HoodieSchema parse(InputStream inputStream) { | |
|
|
||
| try { | ||
| Schema avroSchema = avroParser.parse(inputStream); | ||
| return fromAvroSchema(avroSchema); | ||
| // See parse(String): return an owned, mutable instance, not the interned canonical one. | ||
| return buildFromAvroSchema(avroSchema); | ||
| } catch (IOException e) { | ||
| throw new HoodieIOException("Failed to parse schema from InputStream", e); | ||
| } catch (IllegalArgumentException e) { | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -20,6 +20,7 @@ | |
|
|
||
| import com.github.benmanes.caffeine.cache.Caffeine; | ||
| import com.github.benmanes.caffeine.cache.LoadingCache; | ||
| import org.apache.avro.AvroRuntimeException; | ||
|
|
||
| /** | ||
| * A global cache for HoodieSchema instances to ensure that there is only one | ||
|
|
@@ -28,21 +29,77 @@ | |
| * <p>This is a global cache which works for a JVM lifecycle. | ||
| * A collection of schema instances are maintained. | ||
| * | ||
| * <p>This value-keyed pool is the canonicalization mechanism behind | ||
| * {@link HoodieSchema#fromAvroSchema}, and can also be used directly to intern schemas | ||
| * produced without an Avro source (builders, converters). | ||
| * | ||
| * <p>Interning is never lossy: entries are keyed on the schema's full serialized content, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the intern should be used on hotspot code path that schema is mostly used in read/write path, so the missing of comment or doc should be fine, let's not introduce unnecessary complexities. |
||
| * NOT on {@link HoodieSchema#equals}. Avro equality (which HoodieSchema equality delegates | ||
| * to) ignores doc strings and aliases, so keying on it would collapse schemas that differ | ||
| * only in that metadata and silently drop it -- docs drive catalog sync column comments and | ||
| * aliases drive schema-evolution field matching. Schemas that differ in docs or aliases | ||
| * intern to distinct canonical instances even though they are {@code equals()}. | ||
| * | ||
| * <p>NOTE: The schema which is used frequently should be cached through this cache. | ||
| */ | ||
| public class HoodieSchemaCache { | ||
|
|
||
| // Ensure that there is only one variable instance of the same schema within an entire JVM lifetime | ||
| private static final LoadingCache<HoodieSchema, HoodieSchema> SCHEMA_CACHE = | ||
| Caffeine.newBuilder().weakValues().maximumSize(1024).build(k -> k); | ||
| private static final LoadingCache<SchemaContentKey, HoodieSchema> SCHEMA_CACHE = | ||
| Caffeine.newBuilder().weakValues().maximumSize(1024).build(key -> key.schema); | ||
|
|
||
| /** | ||
| * Get schema variable from global cache. If not found, put it into the cache and then return it. | ||
| * | ||
| * <p>Two schemas converge on one canonical instance only when their full serialized form | ||
| * (including doc strings and aliases) is identical; see the class javadoc. | ||
| * | ||
| * <p>A schema that is valid in memory but cannot be serialized to JSON -- e.g. two distinct | ||
| * nested records that share a name, as some projection/reader paths produce -- has no content | ||
| * key, so it is returned uncached instead of interned. Canonicalization is only a | ||
| * de-duplication optimization, so skipping it stays correct. | ||
| * | ||
| * @param schema schema to get | ||
| * @return if found, return the exist schema variable, otherwise return the param itself. | ||
| */ | ||
| public static HoodieSchema intern(HoodieSchema schema) { | ||
| return SCHEMA_CACHE.get(schema); | ||
| SchemaContentKey key; | ||
| try { | ||
| key = new SchemaContentKey(schema); | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This now serializes the full schema JSON on every |
||
| } catch (AvroRuntimeException e) { | ||
| // Not serializable -> no content key derivable; skip interning rather than fail the caller. | ||
| return schema; | ||
| } | ||
| return SCHEMA_CACHE.get(key); | ||
| } | ||
|
|
||
| /** | ||
| * Content-complete cache key: the serialized JSON form covers doc strings and aliases that | ||
| * Avro equality ignores. The wrapper class is part of the key so a logical-type subclass | ||
| * (e.g. {@link HoodieSchema.Decimal}) never collapses onto a plain wrapper of equal content, | ||
| * which would break downcasts. | ||
| */ | ||
| private static final class SchemaContentKey { | ||
| private final HoodieSchema schema; | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The key strongly references the value, so |
||
| private final String contentJson; | ||
|
|
||
| SchemaContentKey(HoodieSchema schema) { | ||
| this.schema = schema; | ||
| this.contentJson = schema.getAvroSchema().toString(); | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Verified issue (Spark 4 profiles): on Avro 1.12.x |
||
| } | ||
|
|
||
| @Override | ||
| public int hashCode() { | ||
| return contentJson.hashCode(); | ||
| } | ||
|
|
||
| @Override | ||
| public boolean equals(Object obj) { | ||
| if (!(obj instanceof SchemaContentKey)) { | ||
| return false; | ||
| } | ||
| SchemaContentKey that = (SchemaContentKey) obj; | ||
| return schema.getClass() == that.schema.getClass() && contentJson.equals(that.contentJson); | ||
| } | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-existing (since #13646), but since this method is being touched: in the loop below, a dotted path with a null intermediate struct NPEs --
getValue(record, "a.b")witha == nullassignscurrentRecord = nulland the next iteration derefs it.HoodieAvroUtils#getNestedFieldValguards this case and returns null. Worth adding the null check here plus aTestAvroRecordContextcase withaddress = null(nested ordering/precombine fields reach this path).