Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
f663e0e
feat(schema): represent, serialize and validate v3 column default values
huan233usc Jun 15, 2026
511d9c8
test: remove unused TableMetadataV3Valid.json resource
huan233usc Jun 20, 2026
9690529
test(json_serde): cover default-value serde in json_serde_test
huan233usc Jun 20, 2026
fe3b348
Merge remote-tracking branch 'upstream/main' into feat/default-values…
huan233usc Jun 20, 2026
d15a0fe
test(json_serde): fix clang-format in default-value tests
huan233usc Jun 20, 2026
180a9f9
fix(schema): align v3 default-value validation/serde with spec and Java
huan233usc Jun 23, 2026
2052e4f
chore: drop accidentally committed build-rest-tests.sh
huan233usc Jun 23, 2026
54d9636
test(temporal): add unit test for TemporalUtils::IsUtcOffset
huan233usc Jun 23, 2026
925611a
ci: re-trigger CI
huan233usc Jun 23, 2026
0b4d452
fix(schema): address review nits on default-value validation/serde
huan233usc Jun 26, 2026
b252b72
refactor(serde): return Result from ToJson for schema/type/metadata s…
huan233usc Jun 26, 2026
36a76ca
refactor(serde): avoid duplicate macro for fallible REST ToJson
huan233usc Jun 26, 2026
7f49ebc
refactor(serde): declare fallible REST ToJson inline
huan233usc Jun 26, 2026
04ee5c6
Merge remote-tracking branch 'upstream/main' into feat/default-values…
huan233usc Jun 26, 2026
0163d35
Merge remote-tracking branch 'upstream/main' into feat/default-values…
huan233usc Jun 27, 2026
26b1f1b
docs: clarify IsUtcOffset accepts Z/+00:00/-00:00; trim default-value…
huan233usc Jun 27, 2026
462d2d9
fix(schema): normalize default value to the field type at construction
huan233usc Jun 27, 2026
c58050f
refactor(schema): normalize default values via a fallible SchemaField…
huan233usc Jun 27, 2026
41c2bd1
refactor(schema): consolidate default getters and drop [[nodiscard]]
huan233usc Jun 28, 2026
bda2d78
fix(schema): model a null default value as absence
huan233usc Jun 28, 2026
38bbdb0
fix(schema): review follow-ups — lenient float default parse, must-be…
huan233usc Jun 28, 2026
6fe4640
revert: leave expression/json_serde.cc untouched (out of PR scope)
huan233usc Jun 28, 2026
2efca23
revert: keep [[nodiscard]] on pre-existing SchemaField methods
huan233usc Jun 28, 2026
70e029b
feat(schema): clarify error for types that cannot have a default value
huan233usc Jun 29, 2026
d022953
refactor(schema): drop unused default-value cast machinery
huan233usc Jun 29, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 30 additions & 2 deletions src/iceberg/json_serde.cc
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@
#include <nlohmann/json.hpp>

#include "iceberg/constants.h"
#include "iceberg/expression/json_serde_internal.h"
#include "iceberg/expression/literal.h"
#include "iceberg/json_serde_internal.h"
#include "iceberg/name_mapping.h"
#include "iceberg/partition_field.h"
Expand Down Expand Up @@ -298,6 +300,15 @@ nlohmann::json ToJson(const SchemaField& field) {
if (!field.doc().empty()) {
json[kDoc] = field.doc();
}
// Defaults are validated to be primitive literals matching the field type, so
Comment thread
huan233usc marked this conversation as resolved.
Outdated
// single-value serialization cannot fail here.
if (field.initial_default().has_value()) {
ICEBERG_ASSIGN_OR_THROW(json[kInitialDefault],

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though it cannot not fail, we shouldn't use ICEBERG_ASSIGN_OR_THROW here. Let's change the signature to return Result<nlohmann::json> just in case any error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wgtmac, this is in general a good call. But the changing the signature to return Resultnlohmann::json just in case any error requires changing lots of places that is not related on default values.

I updated the pr but also create a pure refactor in #785

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Merged that PR.

ToJson(field.initial_default()->get()));
}
if (field.write_default().has_value()) {
ICEBERG_ASSIGN_OR_THROW(json[kWriteDefault], ToJson(field.write_default()->get()));
}
return json;
}

Expand All @@ -310,7 +321,6 @@ nlohmann::json ToJson(const Type& type) {
nlohmann::json fields_json = nlohmann::json::array();
for (const auto& field : struct_type.fields()) {
fields_json.push_back(ToJson(field));
// TODO(gangwu): add default values
}
json[kFields] = fields_json;
return json;
Expand Down Expand Up @@ -552,9 +562,27 @@ Result<std::unique_ptr<SchemaField>> FieldFromJson(const nlohmann::json& json) {
ICEBERG_ASSIGN_OR_RAISE(auto name, GetJsonValue<std::string>(json, kName));
ICEBERG_ASSIGN_OR_RAISE(auto required, GetJsonValue<bool>(json, kRequired));
ICEBERG_ASSIGN_OR_RAISE(auto doc, GetJsonValueOrDefault<std::string>(json, kDoc));
ICEBERG_ASSIGN_OR_RAISE(std::optional<nlohmann::json> initial_default_json,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use auto?

GetJsonValueOptional<nlohmann::json>(json, kInitialDefault));
ICEBERG_ASSIGN_OR_RAISE(std::optional<nlohmann::json> write_default_json,
GetJsonValueOptional<nlohmann::json>(json, kWriteDefault));

std::shared_ptr<const Literal> initial_default;
if (initial_default_json.has_value()) {
ICEBERG_ASSIGN_OR_RAISE(Literal literal,
LiteralFromJson(*initial_default_json, type.get()));
Comment thread
wgtmac marked this conversation as resolved.
initial_default = std::make_shared<const Literal>(std::move(literal));
}
std::shared_ptr<const Literal> write_default;
if (write_default_json.has_value()) {
ICEBERG_ASSIGN_OR_RAISE(Literal literal,
LiteralFromJson(*write_default_json, type.get()));
write_default = std::make_shared<const Literal>(std::move(literal));
}
Comment thread
huan233usc marked this conversation as resolved.

return std::make_unique<SchemaField>(field_id, std::move(name), std::move(type),
!required, doc);
!required, doc, std::move(initial_default),
std::move(write_default));
}

Result<std::unique_ptr<Schema>> SchemaFromJson(const nlohmann::json& json) {
Expand Down
26 changes: 23 additions & 3 deletions src/iceberg/schema.cc
Original file line number Diff line number Diff line change
Expand Up @@ -116,9 +116,15 @@ std::shared_ptr<Type> ReassignTypeIds(const std::shared_ptr<Type>& type,
SchemaField ReassignField(const SchemaField& field, int32_t new_id,
const Schema::GetId& get_id, Schema::IdMap& ids_to_reassigned,
Schema::IdMap& ids_to_original) {
return {new_id, std::string(field.name()),
// Reassigning IDs only rewrites the field ID and nested type IDs; share the field's
// (immutable) default values rather than copying them.
return {new_id,
std::string(field.name()),
ReassignTypeIds(field.type(), get_id, ids_to_reassigned, ids_to_original),
field.optional(), std::string(field.doc())};
field.optional(),
std::string(field.doc()),
field.initial_default_ptr(),
field.write_default_ptr()};
}

std::vector<SchemaField> ReassignIds(std::vector<SchemaField> fields,
Expand Down Expand Up @@ -447,7 +453,21 @@ Status Schema::Validate(int32_t format_version) const {
}
}

// TODO(GuoTao.yu): Check default values when they are supported
// Only the initial-default is gated on format version: it changes how existing
// data files are read (rows written before the column existed materialize this
// value), so it requires the v3 reader contract. A write-default only affects
// values written going forward and does not reinterpret existing data.
if (field.initial_default().has_value() &&
format_version < TableMetadata::kMinFormatVersionDefaultValues) {
return InvalidSchema(
"Invalid initial default for {}: non-null default ({}) is not supported "
"until v{}",
field.name(), field.initial_default()->get(),
TableMetadata::kMinFormatVersionDefaultValues);
}
if (field.initial_default().has_value() || field.write_default().has_value()) {
ICEBERG_RETURN_UNEXPECTED(field.Validate());
}
Comment thread
huan233usc marked this conversation as resolved.
}

return {};
Expand Down
79 changes: 76 additions & 3 deletions src/iceberg/schema_field.cc
Original file line number Diff line number Diff line change
Expand Up @@ -21,19 +21,26 @@

#include <format>
#include <string_view>
#include <utility>

#include "iceberg/expression/literal.h"
#include "iceberg/type.h"
#include "iceberg/util/formatter.h" // IWYU pragma: keep
#include "iceberg/util/macros.h"

namespace iceberg {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always create a new literal even when their types are the same which is unlikely?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, remove the normalization method

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return Result<std::shared_ptr<const Literal>> so we don't swallow any unexpected error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

SchemaField::SchemaField(int32_t field_id, std::string_view name,
std::shared_ptr<Type> type, bool optional, std::string_view doc)
std::shared_ptr<Type> type, bool optional, std::string_view doc,
std::shared_ptr<const Literal> initial_default,
std::shared_ptr<const Literal> write_default)
: field_id_(field_id),
name_(name),
type_(std::move(type)),
optional_(optional),
doc_(doc) {}
doc_(doc),
initial_default_(std::move(initial_default)),
write_default_(std::move(write_default)) {}

SchemaField SchemaField::MakeOptional(int32_t field_id, std::string_view name,
std::shared_ptr<Type> type, std::string_view doc) {
Expand All @@ -55,13 +62,65 @@ bool SchemaField::optional() const { return optional_; }

std::string_view SchemaField::doc() const { return doc_; }

std::optional<std::reference_wrapper<const Literal>> SchemaField::initial_default()
const {
if (initial_default_ == nullptr) {
return std::nullopt;
}
return std::cref(*initial_default_);
}

std::optional<std::reference_wrapper<const Literal>> SchemaField::write_default() const {
if (write_default_ == nullptr) {
return std::nullopt;
}
return std::cref(*write_default_);
}

const std::shared_ptr<const Literal>& SchemaField::initial_default_ptr() const {
return initial_default_;
}

const std::shared_ptr<const Literal>& SchemaField::write_default_ptr() const {
return write_default_;
}

namespace {

Status ValidateDefault(const SchemaField& field, const Literal& value,
std::string_view kind) {
if (value.IsNull() || value.IsAboveMax() || value.IsBelowMin()) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rejects explicit null defaults. The spec allows optional field defaults to be null and says they may be explicitly set. Java mostly models a null default as absence, but C++ now has a present-null Literal path that will fail validation and, for initial-default, also trips the format-version gate by presence rather than by non-null value.

return InvalidSchema("Invalid {} value for {}: must be a non-null value", kind,
field.name());
}
if (field.type() == nullptr || !field.type()->is_primitive()) {
return InvalidSchema(
"Invalid {} value for {}: default values are only supported for primitive types",
Comment thread
wgtmac marked this conversation as resolved.
kind, field.name());
}
if (*value.type() != *field.type()) {
Comment thread
wgtmac marked this conversation as resolved.
return InvalidSchema("{} of field {} has type {} but expected {}", kind, field.name(),
*value.type(), *field.type());
}
return {};
}

} // namespace

Status SchemaField::Validate() const {
if (name_.empty()) [[unlikely]] {
return InvalidSchema("SchemaField cannot have empty name");
}
if (type_ == nullptr) [[unlikely]] {
return InvalidSchema("SchemaField cannot have null type");
}
if (initial_default_ != nullptr) {
ICEBERG_RETURN_UNEXPECTED(
ValidateDefault(*this, *initial_default_, "initial-default"));
}
if (write_default_ != nullptr) {
ICEBERG_RETURN_UNEXPECTED(ValidateDefault(*this, *write_default_, "write-default"));
}
return {};
}

Expand All @@ -72,9 +131,23 @@ std::string SchemaField::ToString() const {
return result;
}

namespace {

bool DefaultEquals(const std::shared_ptr<const Literal>& lhs,
const std::shared_ptr<const Literal>& rhs) {
if (lhs == nullptr || rhs == nullptr) {
return lhs == rhs;
}
return *lhs == *rhs;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder that Literal::operator<=> returns unordered when any side IsNull() returns true so two null defaults do not equal. I think we should fix Literal::operator<=> to be null safe?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or add a overload Literal::NullSafeEquals()

}

} // namespace

bool SchemaField::Equals(const SchemaField& other) const {
return field_id_ == other.field_id_ && name_ == other.name_ && *type_ == *other.type_ &&
optional_ == other.optional_;
optional_ == other.optional_ &&
DefaultEquals(initial_default_, other.initial_default_) &&
DefaultEquals(write_default_, other.write_default_);
}

} // namespace iceberg
41 changes: 40 additions & 1 deletion src/iceberg/schema_field.h
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,9 @@
/// type (e.g. a struct).

#include <cstdint>
#include <functional>
#include <memory>
#include <optional>
#include <string>
#include <string_view>

Expand All @@ -46,8 +48,14 @@ class ICEBERG_EXPORT SchemaField : public iceberg::util::Formattable {
/// \param[in] type The field type.
/// \param[in] optional Whether values of this field are required or nullable.
/// \param[in] doc Optional documentation string for the field.
/// \param[in] initial_default The v3 `initial-default` value, or null if absent. The
/// field shares ownership of the (immutable) value.
/// \param[in] write_default The v3 `write-default` value, or null if absent. The field
/// shares ownership of the (immutable) value.
SchemaField(int32_t field_id, std::string_view name, std::shared_ptr<Type> type,
bool optional, std::string_view doc = {});
bool optional, std::string_view doc = {},
std::shared_ptr<const Literal> initial_default = nullptr,
std::shared_ptr<const Literal> write_default = nullptr);

/// \brief Construct an optional (nullable) field.
static SchemaField MakeOptional(int32_t field_id, std::string_view name,
Expand All @@ -71,6 +79,32 @@ class ICEBERG_EXPORT SchemaField : public iceberg::util::Formattable {
/// \brief Get the field documentation.
std::string_view doc() const;

/// \brief Get the default value for this field used when reading rows written
/// before the field existed (v3 `initial-default`). Empty if absent.
///
/// The returned reference is a non-owning view into a value owned by this field;
/// it remains valid for the lifetime of this SchemaField.
[[nodiscard]] std::optional<std::reference_wrapper<const Literal>> initial_default()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove [[nodiscard]] because users should always know what they are doing when these are called. I think we need also clean up existing functions as well.

const;

/// \brief Get the default value for this field used when a writer does not
/// supply a value (v3 `write-default`). Empty if absent.
///
/// The returned reference is a non-owning view into a value owned by this field;
/// it remains valid for the lifetime of this SchemaField.
[[nodiscard]] std::optional<std::reference_wrapper<const Literal>> write_default()
const;

/// \brief Get the shared owning pointer to the `initial-default` value, or null if
/// absent. Prefer initial_default() for reading; this exists so a rebuilt field can
/// share the (immutable) value rather than copy it.
[[nodiscard]] const std::shared_ptr<const Literal>& initial_default_ptr() const;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks odd to provide both initial_default and initial_default_ptr. Should we keep one, like const std::shared_ptr<const Literal>& initial_default() const?


/// \brief Get the shared owning pointer to the `write-default` value, or null if
/// absent. Prefer write_default() for reading; this exists so a rebuilt field can
/// share the (immutable) value rather than copy it.
[[nodiscard]] const std::shared_ptr<const Literal>& write_default_ptr() const;

[[nodiscard]] std::string ToString() const override;

Status Validate() const;
Expand Down Expand Up @@ -100,6 +134,11 @@ class ICEBERG_EXPORT SchemaField : public iceberg::util::Formattable {
std::shared_ptr<Type> type_;
bool optional_;
std::string doc_;
// Default values are owned by this field and never mutated after being set; copies

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a strong opinion but the comment looks too verbose.

// of the field share the same payload (reference-counted) instead of deep-copying,
// like `type_` above. Sharing is unobservable because the payload is immutable.
std::shared_ptr<const Literal> initial_default_;
std::shared_ptr<const Literal> write_default_;
Comment thread
huan233usc marked this conversation as resolved.
};

} // namespace iceberg
6 changes: 5 additions & 1 deletion src/iceberg/schema_util.cc
Original file line number Diff line number Diff line change
Expand Up @@ -172,10 +172,14 @@ Result<FieldProjection> ProjectNested(const Type& expected_type, const Type& sou
iter->second.local_index, prune_source));
} else if (MetadataColumns::IsMetadataColumn(field_id)) {
child_projection.kind = FieldProjection::Kind::kMetadata;
} else if (expected_field.initial_default().has_value()) {
// Rows written before the field existed assume its `initial-default` value.
child_projection.kind = FieldProjection::Kind::kDefault;
child_projection.from = expected_field.initial_default()->get();
} else if (expected_field.optional()) {
child_projection.kind = FieldProjection::Kind::kNull;
} else {
// TODO(gangwu): support default value for v3 and constant value
// TODO(gangwu): support constant value
return InvalidSchema("Missing required field: {}", expected_field.ToString());
}
result.children.emplace_back(std::move(child_projection));
Expand Down
Loading
Loading