Skip to content

feat(partition-ttl): Introduce KeepByEventTimeStrategy to expire partitions by event time#19081

Open
wangxianghu wants to merge 1 commit into
apache:masterfrom
wangxianghu:event_time_ttl
Open

feat(partition-ttl): Introduce KeepByEventTimeStrategy to expire partitions by event time#19081
wangxianghu wants to merge 1 commit into
apache:masterfrom
wangxianghu:event_time_ttl

Conversation

@wangxianghu

@wangxianghu wangxianghu commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

Today Hudi ships two partition TTL strategies, both keyed off processing time:

  • KeepByCreationTimeStrategy uses the partition creation commit time recorded in .hoodie_partition_metadata. Backfilled historical partitions look "newly created" and never expire.
  • KeepByTimeStrategy uses the last commit time that touched the partition. Any late-arriving write extends the partition's lifetime, silently disabling TTL.

In practice, what most users actually want is event-time semantics: the date encoded in the partition path (e.g. dt=2026-04-24) is the business event date, and anything older than N days should be dropped —
regardless of when/how writes land.

This PR introduces a new strategy KeepByEventTimeStrategy that derives the partition's reference time from the partition path itself, decoupling TTL from write behavior.

Summary and Changelog

For users: a new KEEP_BY_EVENT_TIME partition TTL strategy that expires partitions based on the date encoded in the partition path. TTL is no longer affected by backfill, late-arriving data, or commit timing.

Supported partition path shapes (v1 scope):

Day, format=yyyy-MM-dd
time only: 2026-06-27, dt=2026-06-27 — startIndex 0
prefix + time: region=us/2026-06-27, region=us/dt=2026-06-27 — startIndex 1
time + suffix: 2026-06-27/source=app, dt=2026-06-27/source=app — startIndex 0
prefix + time + suffix: region=us/dt=2026-06-27/source=app — startIndex 1

Day, format=yyyyMMdd
time only: 20260627, dt=20260627 — startIndex 0
prefix + time: region=us/20260627, region=us/dt=20260627 — startIndex 1
time + suffix: 20260627/source=app, dt=20260627/source=app — startIndex 0
prefix + time + suffix: region=us/dt=20260627/source=app — startIndex 1

Hour, format=yyyy-MM-dd/HH
time only: 2026-06-27/12, dt=2026-04-05/hh=12 — startIndex 0
prefix + time: region=us/2026-06-27/12, region=us/dt=2026-04-05/hh=12 — startIndex 1
time + suffix: 2026-06-27/12/source=app, dt=2026-04-05/hh=12/source=app — startIndex 0
prefix + time + suffix: region=us/dt=2026-04-05/hh=12/source=app — startIndex 1

Hour, format=yyyyMMdd/HH
time only: 20260627/12, dt=20260405/hh=12 — startIndex 0
prefix + time: region=us/20260627/12, region=us/dt=20260405/hh=12 — startIndex 1
time + suffix: 20260627/12/source=app, dt=20260405/hh=12/source=app — startIndex 0
prefix + time + suffix: region=us/dt=20260405/hh=12/source=app — startIndex 1

Hive-style key names are not constrained: dt=, day=, event_date=, hh=, hour= all work; only the value after = is parsed.

Hive-style is honored from the table's hoodie.datasource.write.hive_style_partitioning flag (authoritative table-level setting, not guessed per-segment).

Impact

  • Public API: none broken. New enum value, three new configs (all with defaults), one new class — strictly additive.
  • User-facing: a new opt-in TTL strategy. Existing strategies and configs are untouched. Users adopt it by setting hoodie.partition.ttl.management.strategy.type=KEEP_BY_EVENT_TIME.
  • Performance: negligible — partition path parsing is in-memory string ops, runs only inside the existing TTL replace commit, no extra metadata reads.

Risk Level

low

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR adds a KEEP_BY_EVENT_TIME partition TTL strategy that derives the partition's reference time from the date encoded in the partition path, decoupling TTL from write/commit timing. The change is cleanly additive (new enum value, three defaulted configs, one new class) and comes with good unit coverage of the parsing logic. A couple of edge cases worth double-checking in the inline comments — primarily the timezone handling and the fail-fast behavior across the partition set. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here.

@wangxianghu wangxianghu force-pushed the event_time_ttl branch 2 times, most recently from c8df5d8 to d64fbc2 Compare June 27, 2026 04:20
@github-actions github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Jun 27, 2026
@wangxianghu

Copy link
Copy Markdown
Contributor Author

hi @stream2000 do you have time to take a look ?

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR adds a new KEEP_BY_EVENT_TIME partition TTL strategy that derives a partition's expiry from the date encoded in its path rather than from commit metadata, which is a useful complement to the existing commit-time strategies. The path-parsing logic is well-covered by the parameterized tests. A couple of correctness edge cases are worth double-checking in the inline comments — particularly how the inherited retention-to-millis math behaves at larger day values, and the multi-segment hive-default handling. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. One minor style inconsistency in the new strategy class — the rest of the implementation is clean and well-documented.

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for addressing the feedback! The latest change adds the three java.time imports (DateTimeException, Instant, LocalDate) and swaps the fully-qualified names in parseEventMillis for the simple names, which fully resolves the import-consistency nit from the last pass. The imports are valid with no naming conflicts and the method logic is unchanged, so the new changes look clean and introduce no new issues.

For tracking, a few prior threads remain open but are not re-raised here: the inherited ttlInMilis 32-bit overflow in the parent KeepByTimeStrategy is still present in code, but you've indicated it's fixed in PR #19075 and will be picked up on rebase — that's a reasonable plan; the question about the per-segment __HIVE_DEFAULT_PARTITION__ check dropping otherwise-recent multi-segment partitions, and the throw-inside-parallel-stream behavior, are still awaiting confirmation on their existing threads. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here.

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR adds a KEEP_BY_EVENT_TIME partition TTL strategy that derives a partition's expiry from the date/hour encoded in its path rather than commit metadata, plus a UTC-aware parseDateFromInstantTime overload so the cutoff and parsed event time share one axis. I traced the parsing paths (day vs. hour formats, hive-style vs. raw segments, the HIVE_DEFAULT_PARTITION branch, and start-index/segment-count bounds) and did not surface a new correctness issue in this pass; the parallel filter is safe since the comparison is a pure static function over an immutable formatter. The correctness items still worth closing out (the inherited 32-bit ttlInMilis overflow in KeepByTimeStrategy, which bites the large retention windows this feature targets, and the UTC vs. hoodie.table.timeline.timezone alignment) are already captured in the existing inline threads and tracked for the #19075 rebase, so I have not re-flagged them. A Hudi committer or PMC member can take it from here for a final review. One minor naming-consistency suggestion on the new builder method; otherwise the code is clean and well-documented.

@wangxianghu wangxianghu force-pushed the event_time_ttl branch 2 times, most recently from 4c61db1 to b3b73b7 Compare July 1, 2026 01:55

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR adds a KeepByEventTimeStrategy that derives partition TTL from the event date encoded in the partition path, plus a UTC-aware overload of parseDateFromInstantTime. The main thing worth double-checking is the hour-granularity parse path flagged inline — it looks like it may fall through to a start-of-day fallback and drop the hour, which would undercut the hour-format support this PR advertises. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of minor naming and consistency suggestions below.

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR adds a KeepByEventTimeStrategy that derives partition TTL from the date encoded in the partition path (via a configurable format + start-index), decoupling expiry from write/commit timing. I traced the parsing logic across all documented day/hour shapes, the hive-style handling, the default-partition branch, the cutoff comparison, and the factory/config wiring, and independently confirmed the key technical point from earlier rounds — Instant.from(parsed) does resolve for zone-less hour patterns when the formatter carries withZone(UTC). No new correctness issues surfaced in this automated pass; the previously-flagged items (UTC cutoff drift, parallel-stream fail-fast, default-partition deletion semantics) appear addressed or intentionally documented. A Hudi committer or PMC member can take it from here for a final review. A couple of minor naming/consistency suggestions — hardcoded config-key strings in error messages and a logger-convention mismatch; otherwise the code is clean and well-documented.

cc @yihua

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR adds a KeepByEventTimeStrategy that derives partition TTL from the event date encoded in the partition path, decoupling expiry from commit timing. I traced the parsing logic across the documented path shapes (day/hour, hive/non-hive, prefix/suffix), the UTC cutoff derivation, and the HoodieInstantTimeGenerator overload, and confirmed the earlier ttlInMilis overflow concern is resolved by the base-class fix already in main. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of naming/consistency suggestions below; overall the code is clean and unusually well documented.

cc @yihua

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR adds a KEEP_BY_EVENT_TIME partition TTL strategy that derives a partition's lifetime from the date encoded in its path rather than from commit metadata, plus a UTC-aware parseDateFromInstantTime overload to keep the cutoff and event time on the same axis. I traced the full flow (Spark executor → inherited getExpiredPartitionPaths → the overridden event-time filter → parsing helpers), and checked thread-safety of the parallel stream, the factory wiring, config defaults, and backward compatibility of the enum/config/overload additions. The substantive items from earlier rounds (UTC vs. timeline-timezone skew, hard-abort on an unparseable partition, and default-partition handling) already have discussion threads, so I'm not re-raising them. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. A couple of small naming and API-ergonomics nits below; otherwise the code is clean and well-documented.

cc @yihua

@wangxianghu wangxianghu force-pushed the event_time_ttl branch 2 times, most recently from cca9539 to 079bc1a Compare July 2, 2026 07:57

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds a KEEP_BY_EVENT_TIME partition TTL strategy that derives a partition's lifetime from the date/hour encoded in the partition path (in UTC), decoupling expiry from commit timing, backfill, and late-arriving writes. I traced the dispatch wiring (enum/factory/reflection constructor), the shared parseDateFromInstantTime overload (legacy no-arg behavior preserved), the UTC cutoff-vs-event-time alignment, the hour-format parsing path (Instant.from correctly resolves yyyy-MM-dd/HH via the override zone, so no silent midnight collapse), and the parse failure/hive-default handling (all fail-safe — they throw rather than delete wrong data). No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. A couple of minor naming nits, but overall the code is clean and well-documented.

cc @yihua

@hudi-bot

hudi-bot commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR adds a new KEEP_BY_EVENT_TIME partition TTL strategy that derives a partition's expiry from the event time encoded in its path (day/hour shapes, hive and non-hive), rather than from commit metadata. I traced the parent guard flow, the UTC cutoff derivation, the date/hour parsing (including the Instant.from vs LocalDate fallback), and the __HIVE_DEFAULT_PARTITION__ handling, and the fixes from earlier rounds (day-retain overflow, UTC-consistent cutoff, single post-loop default-segment check) appear to be in place. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. A couple of boolean naming nits below; otherwise the code is clean and the documentation is excellent.

cc @yihua

private final boolean deleteHiveDefaultPartition;
private final boolean hiveStylePartitioning;

public KeepByEventTimeStrategy(HoodieTable hoodieTable, String instantTime) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: could you rename this to shouldDeleteHiveDefaultPartition? Boolean fields in Hudi conventionally read as predicates (is/has/should), so if (deleteHiveDefaultPartition) reads a little like a method call rather than a state check.

⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.

}

public boolean getPartitionTTLEventTimeDeleteHiveDefaultPartition() {
return getBooleanOrDefault(HoodieTTLConfig.EVENT_TIME_DELETE_HIVE_DEFAULT_PARTITION);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: boolean getters in Hudi typically use an is/should prefix — could you rename this to shouldPartitionTTLEventTimeDeleteHiveDefaultPartition() (or a shorter variant)? get doesn't signal the return type at the call site the way a predicate prefix does.

⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants