Skip to content

feat(parquet): apply column default values when reading missing fields (2/4)#792

Merged
wgtmac merged 3 commits into
apache:mainfrom
huan233usc:feat/default-values-read-parquet
Jul 2, 2026
Merged

feat(parquet): apply column default values when reading missing fields (2/4)#792
wgtmac merged 3 commits into
apache:mainfrom
huan233usc:feat/default-values-read-parquet

Conversation

@huan233usc

Copy link
Copy Markdown
Contributor

What

Part 2 of 4 of Iceberg v3 column default-value support (POC #731), built on the
schema-layer support merged in #746.

When a column is present in the read (table) schema but absent from a Parquet
data file — because the column was added after those rows were written — fill it
with the column's v3 initial-default instead of null.

Changes

  • arrow/literal_util (new): a shared Arrow materializer that converts a
    Literal to an Arrow scalar (ToArrowScalar), to a constant-filled array
    (MakeDefaultArray), or appends it once to a builder
    (AppendDefaultToBuilder). Covers all primitive types, including
    decimal/uuid/fixed and the four timestamp variants.
  • Parquet projection (parquet_schema_util.cc): when a field is missing
    from the file and carries an initial-default, project it as
    FieldProjection::Kind::kDefault, mirroring the generic schema_util.cc path
    already merged in feat(schema): represent, serialize and validate v3 column default values (1/4) #746.
  • Parquet data (parquet_data_util.cc): materialize the kDefault branch
    via MakeDefaultArray.

Tests

  • literal_util_test: scalar/array/builder conversion across primitive types,
    null and narrowing-sentinel handling, and casting to the target Arrow type.
  • parquet_data_test: ProjectRecordBatch fills missing required and optional
    default fields, including a nested struct.
  • parquet_test: end-to-end — write a file with an old schema, then read it
    through ReaderFactoryRegistry with an evolved schema carrying defaults
    (ReadMissingFieldsWithDefaults).

Stack

  1. feat(schema): represent, serialize and validate v3 column default values (1/4) #746 — schema: represent / serialize / validate (merged)
  2. this PR — read path: Parquet
  3. read path: Avro (follows)
  4. schema evolution: addColumn / updateColumnDefault (follows)

@huan233usc huan233usc marked this pull request as draft June 29, 2026 21:51
@huan233usc huan233usc force-pushed the feat/default-values-read-parquet branch from 925d8cd to 378150c Compare June 29, 2026 22:47
@huan233usc huan233usc closed this Jun 29, 2026
@huan233usc huan233usc reopened this Jun 29, 2026
@huan233usc huan233usc marked this pull request as ready for review June 30, 2026 02:12
@huan233usc huan233usc force-pushed the feat/default-values-read-parquet branch from 6b08bed to 8db3067 Compare June 30, 2026 02:55
Comment thread src/iceberg/arrow/literal_util.cc Outdated
}

Result<std::shared_ptr<::arrow::Buffer>> ToArrowBuffer(
const std::vector<uint8_t>& bytes) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add arrow::MemoryPool* as an input parameter.

Comment thread src/iceberg/arrow/literal_util.cc Outdated
Status AppendDefaultToBuilder(const Literal& literal, ::arrow::ArrayBuilder* builder) {
ICEBERG_ASSIGN_OR_RAISE(std::shared_ptr<::arrow::Scalar> scalar,
ToArrowScalar(literal));
if (!scalar->type->Equals(*builder->type())) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we mirror MakeDefaultArray's extension handling here? ToArrowScalar(UUID) returns fixed_size_binary(16), and Scalar::CastTo(arrow.uuid()) is not implemented in Arrow, so using this helper with an arrow.uuid builder would fail.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this, I took a second look and this method is not used in parquet(but avro in the poc). It seems that it got accidentally added to this pr. I will include the fix in avro pr later.

@huan233usc huan233usc force-pushed the feat/default-values-read-parquet branch 3 times, most recently from 31f879b to 2f73f33 Compare July 1, 2026 19:02
…s (2/4)

When a column is present in the read schema but missing from a Parquet data
file (written before the column existed), fill it with the column's v3
initial-default instead of null. Adds a shared Arrow materializer
(arrow/literal_util) that turns a Literal into an Arrow scalar/array, and a
kDefault projection branch in the Parquet schema/data projection paths.

The materializer wraps the storage array in the extension type for extension
types such as `arrow.uuid` (compute::Cast has no storage->extension kernel).

Part 2 of the v3 column-default-values work (POC apache#731), built on the schema
support merged in apache#746.
@huan233usc huan233usc force-pushed the feat/default-values-read-parquet branch from 2f73f33 to cfbc774 Compare July 1, 2026 19:09
@huan233usc huan233usc requested a review from wgtmac July 1, 2026 19:26
Comment thread src/iceberg/arrow/literal_util_internal.h Outdated
huan233usc and others added 2 commits July 1, 2026 19:55
Co-authored-by: Gang Wu <ustcwg@gmail.com>
The ToArrowScalar default argument was removed, so update the test call sites
to pass arrow::default_memory_pool() explicitly.
@wgtmac wgtmac merged commit cd4ca42 into apache:main Jul 2, 2026
21 checks passed
@wgtmac

wgtmac commented Jul 2, 2026

Copy link
Copy Markdown
Member

Thank you, @huan233usc!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants