GH-45947 : [C++][Parquet] Variant encoding#50122
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
7f51026 to
70a364b
Compare
70a364b to
8ab28f0
Compare
| DecodeMetadata(encoded1.metadata.data(), | ||
| static_cast<int64_t>(encoded1.metadata.size()))); | ||
|
|
||
| // Build a new variant reusing the same metadata |
There was a problem hiding this comment.
I would add a test where we reuse existing metadata but make a mistake with the data types. For example, according to the metadata, we should write a string but we write an integer instead. I think there is currently no validation for this case in VariantBuilder—is that on purpose?
Either way, the final Variant will be malformed, so the round-trip should fail.
There was a problem hiding this comment.
Thanks for the suggestion. After working through the refactoring, I realized this test isn't possible because of how the variant format works, the metadata dictionary contains only key names (string interning for object field names), not value types. The format is self-describing: each value carries its own type tag in its header byte.
A "type mismatch between metadata and data" is architecturally impossible because metadata doesn't encode types at all. The refactored VariantMetadata struct's docstring makes this explicit: "This is NOT a schema, it contains key names only, not value types."
What you can test (and what the validation tests cover) is structural invariant violations: field IDs that exceed the dictionary size, offsets that point out of bounds, truncated values, etc. These are exercised by the ValidateVariant tests.
| class ARROW_EXPORT VariantBuilder { | ||
| public: | ||
| VariantBuilder(); | ||
| explicit VariantBuilder(const VariantMetadata& existing_metadata); |
There was a problem hiding this comment.
It might also be useful to pass a value buffer to VariantBuilder to initialize buffer_. This way, it will be possible to continue building an existing Variant value.
There was a problem hiding this comment.
Thanks for the valid point. The variant binary format is immutable by design, inserting a field into an existing object requires rewriting the header (field IDs and offsets are packed arrays, not linked structures). So "modify in place" isn't feasible at the format level.
The refactored design handles the "start from existing data" use case through:
VariantBuilder(const VariantMetadata& existing_metadata): constructor that pre-populates the key dictionary from an existing metadata buffer.- Read->rebuild pattern:
VariantObjectView(read existing) +ObjectScope+UnsafeAppendEncoded(write new). This enables zero-copy field transfer between variants. It's exactly what the shredding reconstruction path uses.
This matches Rust's Variant (read-only) vs VariantBuilder (write-only) architecture, and is the standard immutable-format pattern (FlatBuffers, Cap'n Proto, etc.).
| /// builder.Int(30); | ||
| /// builder.FinishObject(start, fields); | ||
| /// ARROW_ASSIGN_OR_RAISE(auto result, builder.Finish()); | ||
| class ARROW_EXPORT VariantBuilder { |
There was a problem hiding this comment.
This API is great for building new variants. Did you also consider adding an API that allows modifying existing Variant values? We would need to add a function to VariantBuilder similar to FindObjectField from the decoding PR, which would "move"| the context of VariantBuilder to a specific place/field. Once called, you would then be able to override the existing value.
There was a problem hiding this comment.
The refactored design makes this separation explicit and deliberate:
- Views: zero-copy navigation of existing bytes (stack-allocated)
- Builder: produces new bytes from scratch (dictionary preserved across
Finish())
"Modify existing" = read the parts you want to keep via views, write them into a new builder with UnsafeAppendEncoded, add/change what you need, finish.
A higher-level mutable convenience API (think nlohmann::json-style) could be layered on top in a follow-up. For the first implementation, keeping the primitives clean and composable felt like the right foundation.
8ab28f0 to
f6b8e66
Compare
|
Force-pushed a refactored version. The initial implementation carried over some Go patterns (manual start/finish without safety guarantees, separate lookup buffer). After reviewing the feedback — particularly around initialization from existing buffers and the question about modifying existing variants, I reworked the builder with C++ idiom at the center of the design. Key changes:
Regarding the earlier review questions about modifying existing variants and type mismatch testing, I've addressed these in replies above with architectural context from the refactored design. All 335 tests pass end-to-end with |
Rationale for this change
Implements Variant binary encoding (the write side of decoding from GH-45946). Part of GH-45937 (Add variant support to C++). Depends on #50121.
Note: This PR depends on #50121 (Variant decoding) and is branched from it. Please review/merge #50121 first.
What changes are included in this PR?
Adds
VariantBuilderclass and RAII scope helpers tovariant.h/variant_builder.cc:VariantBuilder: move-only encoder supporting all 21 primitive types + containers.Int()auto-selects smallest encoding width.String()auto-selects short-string(≤63 bytes) vs long-string encoding. Dictionary (key interning) preserved across
Finish()calls for column-scan workloads.ObjectScope/ListScope: RAII scopes returned byStartObject()/StartList().Destructor auto-rolls back the buffer if
Finish()is not called — safe underexceptions, early returns, and scope exit.
[[nodiscard]]on scope-returning functions prevents accidental discard.via
SetAllowDuplicates(true)for shredding reconstruction (GH-45948).FinishObject()skipsstd::sortwhen fields arealready in lexicographic order (common for schema-driven insertion).
dict_usesis_transparenttags for forward-compatibilitywith C++20 heterogeneous lookup (eliminates the old
lookup_buf_member variable).Design decisions:
Offset/NextField/FinishObject) retained for shredding internalsFinishObjectsorts fields in-place (documented non-const-ref parameter)Finish()— amortizes key lookup for repeated schemasAre these changes tested?
221 total tests (87 new encoder + 134 decoder) pass with
BUILD_WARNING_LEVEL=CHECKINcovering: all primitives, auto-sizing, int boundaries, short/long string boundary,
special floats (NaN, ±Inf), arrays, objects, duplicate rejection, sorting, RAII scopes,
reset/reuse, builder from existing metadata, large containers (>255 elements), and
round-trip verification via decoder views.
Are there any user-facing changes?
New public API:
VariantBuilder,ObjectScope,ListScopeinvariant.h.VariantBuilder::EncodedVariantreturn type fromFinish().AI Disclosure: AI coding assistants were used during development for scaffolding,
test generation, and review iteration. All code has been reviewed, debugged, and
verified by the author who owns and understands the changes.