feat(partition-ttl): Introduce KeepByEventTimeStrategy to expire partitions by event time#19081
feat(partition-ttl): Introduce KeepByEventTimeStrategy to expire partitions by event time#19081wangxianghu wants to merge 1 commit into
Conversation
hudi-agent
left a comment
There was a problem hiding this comment.
Thanks for working on this! This PR adds a KEEP_BY_EVENT_TIME partition TTL strategy that derives the partition's reference time from the date encoded in the partition path, decoupling TTL from write/commit timing. The change is cleanly additive (new enum value, three defaulted configs, one new class) and comes with good unit coverage of the parsing logic. A couple of edge cases worth double-checking in the inline comments — primarily the timezone handling and the fail-fast behavior across the partition set. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here.
c8df5d8 to
d64fbc2
Compare
16d6d0b to
b301702
Compare
|
hi @stream2000 do you have time to take a look ? |
hudi-agent
left a comment
There was a problem hiding this comment.
Thanks for working on this! This PR adds a new KEEP_BY_EVENT_TIME partition TTL strategy that derives a partition's expiry from the date encoded in its path rather than from commit metadata, which is a useful complement to the existing commit-time strategies. The path-parsing logic is well-covered by the parameterized tests. A couple of correctness edge cases are worth double-checking in the inline comments — particularly how the inherited retention-to-millis math behaves at larger day values, and the multi-segment hive-default handling. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. One minor style inconsistency in the new strategy class — the rest of the implementation is clean and well-documented.
hudi-agent
left a comment
There was a problem hiding this comment.
Thanks for addressing the feedback! The latest change adds the three java.time imports (DateTimeException, Instant, LocalDate) and swaps the fully-qualified names in parseEventMillis for the simple names, which fully resolves the import-consistency nit from the last pass. The imports are valid with no naming conflicts and the method logic is unchanged, so the new changes look clean and introduce no new issues.
For tracking, a few prior threads remain open but are not re-raised here: the inherited ttlInMilis 32-bit overflow in the parent KeepByTimeStrategy is still present in code, but you've indicated it's fixed in PR #19075 and will be picked up on rebase — that's a reasonable plan; the question about the per-segment __HIVE_DEFAULT_PARTITION__ check dropping otherwise-recent multi-segment partitions, and the throw-inside-parallel-stream behavior, are still awaiting confirmation on their existing threads. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here.
hudi-agent
left a comment
There was a problem hiding this comment.
Thanks for working on this! This PR adds a KEEP_BY_EVENT_TIME partition TTL strategy that derives a partition's expiry from the date/hour encoded in its path rather than commit metadata, plus a UTC-aware parseDateFromInstantTime overload so the cutoff and parsed event time share one axis. I traced the parsing paths (day vs. hour formats, hive-style vs. raw segments, the HIVE_DEFAULT_PARTITION branch, and start-index/segment-count bounds) and did not surface a new correctness issue in this pass; the parallel filter is safe since the comparison is a pure static function over an immutable formatter. The correctness items still worth closing out (the inherited 32-bit ttlInMilis overflow in KeepByTimeStrategy, which bites the large retention windows this feature targets, and the UTC vs. hoodie.table.timeline.timezone alignment) are already captured in the existing inline threads and tracked for the #19075 rebase, so I have not re-flagged them. A Hudi committer or PMC member can take it from here for a final review. One minor naming-consistency suggestion on the new builder method; otherwise the code is clean and well-documented.
4c61db1 to
b3b73b7
Compare
hudi-agent
left a comment
There was a problem hiding this comment.
Thanks for working on this! This PR adds a KeepByEventTimeStrategy that derives partition TTL from the event date encoded in the partition path, plus a UTC-aware overload of parseDateFromInstantTime. The main thing worth double-checking is the hour-granularity parse path flagged inline — it looks like it may fall through to a start-of-day fallback and drop the hour, which would undercut the hour-format support this PR advertises. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of minor naming and consistency suggestions below.
hudi-agent
left a comment
There was a problem hiding this comment.
Thanks for working on this! This PR adds a KeepByEventTimeStrategy that derives partition TTL from the date encoded in the partition path (via a configurable format + start-index), decoupling expiry from write/commit timing. I traced the parsing logic across all documented day/hour shapes, the hive-style handling, the default-partition branch, the cutoff comparison, and the factory/config wiring, and independently confirmed the key technical point from earlier rounds — Instant.from(parsed) does resolve for zone-less hour patterns when the formatter carries withZone(UTC). No new correctness issues surfaced in this automated pass; the previously-flagged items (UTC cutoff drift, parallel-stream fail-fast, default-partition deletion semantics) appear addressed or intentionally documented. A Hudi committer or PMC member can take it from here for a final review. A couple of minor naming/consistency suggestions — hardcoded config-key strings in error messages and a logger-convention mismatch; otherwise the code is clean and well-documented.
cc @yihua
hudi-agent
left a comment
There was a problem hiding this comment.
Thanks for working on this! This PR adds a KeepByEventTimeStrategy that derives partition TTL from the event date encoded in the partition path, decoupling expiry from commit timing. I traced the parsing logic across the documented path shapes (day/hour, hive/non-hive, prefix/suffix), the UTC cutoff derivation, and the HoodieInstantTimeGenerator overload, and confirmed the earlier ttlInMilis overflow concern is resolved by the base-class fix already in main. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of naming/consistency suggestions below; overall the code is clean and unusually well documented.
cc @yihua
30f3811 to
9a0dd0a
Compare
hudi-agent
left a comment
There was a problem hiding this comment.
Thanks for working on this! This PR adds a KEEP_BY_EVENT_TIME partition TTL strategy that derives a partition's lifetime from the date encoded in its path rather than from commit metadata, plus a UTC-aware parseDateFromInstantTime overload to keep the cutoff and event time on the same axis. I traced the full flow (Spark executor → inherited getExpiredPartitionPaths → the overridden event-time filter → parsing helpers), and checked thread-safety of the parallel stream, the factory wiring, config defaults, and backward compatibility of the enum/config/overload additions. The substantive items from earlier rounds (UTC vs. timeline-timezone skew, hard-abort on an unparseable partition, and default-partition handling) already have discussion threads, so I'm not re-raising them. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. A couple of small naming and API-ergonomics nits below; otherwise the code is clean and well-documented.
cc @yihua
cca9539 to
079bc1a
Compare
hudi-agent
left a comment
There was a problem hiding this comment.
Thanks for the contribution! This PR adds a KEEP_BY_EVENT_TIME partition TTL strategy that derives a partition's lifetime from the date/hour encoded in the partition path (in UTC), decoupling expiry from commit timing, backfill, and late-arriving writes. I traced the dispatch wiring (enum/factory/reflection constructor), the shared parseDateFromInstantTime overload (legacy no-arg behavior preserved), the UTC cutoff-vs-event-time alignment, the hour-format parsing path (Instant.from correctly resolves yyyy-MM-dd/HH via the override zone, so no silent midnight collapse), and the parse failure/hive-default handling (all fail-safe — they throw rather than delete wrong data). No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. A couple of minor naming nits, but overall the code is clean and well-documented.
cc @yihua
…itions by event time
ba37c5a to
bc98b2d
Compare
hudi-agent
left a comment
There was a problem hiding this comment.
Thanks for working on this! This PR adds a new KEEP_BY_EVENT_TIME partition TTL strategy that derives a partition's expiry from the event time encoded in its path (day/hour shapes, hive and non-hive), rather than from commit metadata. I traced the parent guard flow, the UTC cutoff derivation, the date/hour parsing (including the Instant.from vs LocalDate fallback), and the __HIVE_DEFAULT_PARTITION__ handling, and the fixes from earlier rounds (day-retain overflow, UTC-consistent cutoff, single post-loop default-segment check) appear to be in place. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. A couple of boolean naming nits below; otherwise the code is clean and the documentation is excellent.
cc @yihua
| private final boolean deleteHiveDefaultPartition; | ||
| private final boolean hiveStylePartitioning; | ||
|
|
||
| public KeepByEventTimeStrategy(HoodieTable hoodieTable, String instantTime) { |
There was a problem hiding this comment.
🤖 nit: could you rename this to shouldDeleteHiveDefaultPartition? Boolean fields in Hudi conventionally read as predicates (is/has/should), so if (deleteHiveDefaultPartition) reads a little like a method call rather than a state check.
| } | ||
|
|
||
| public boolean getPartitionTTLEventTimeDeleteHiveDefaultPartition() { | ||
| return getBooleanOrDefault(HoodieTTLConfig.EVENT_TIME_DELETE_HIVE_DEFAULT_PARTITION); |
There was a problem hiding this comment.
🤖 nit: boolean getters in Hudi typically use an is/should prefix — could you rename this to shouldPartitionTTLEventTimeDeleteHiveDefaultPartition() (or a shorter variant)? get doesn't signal the return type at the call site the way a predicate prefix does.
Describe the issue this Pull Request addresses
Today Hudi ships two partition TTL strategies, both keyed off processing time:
In practice, what most users actually want is event-time semantics: the date encoded in the partition path (e.g. dt=2026-04-24) is the business event date, and anything older than N days should be dropped —
regardless of when/how writes land.
This PR introduces a new strategy KeepByEventTimeStrategy that derives the partition's reference time from the partition path itself, decoupling TTL from write behavior.
Summary and Changelog
For users: a new KEEP_BY_EVENT_TIME partition TTL strategy that expires partitions based on the date encoded in the partition path. TTL is no longer affected by backfill, late-arriving data, or commit timing.
Supported partition path shapes (v1 scope):
Day, format=yyyy-MM-dd
time only: 2026-06-27, dt=2026-06-27 — startIndex 0
prefix + time: region=us/2026-06-27, region=us/dt=2026-06-27 — startIndex 1
time + suffix: 2026-06-27/source=app, dt=2026-06-27/source=app — startIndex 0
prefix + time + suffix: region=us/dt=2026-06-27/source=app — startIndex 1
Day, format=yyyyMMdd
time only: 20260627, dt=20260627 — startIndex 0
prefix + time: region=us/20260627, region=us/dt=20260627 — startIndex 1
time + suffix: 20260627/source=app, dt=20260627/source=app — startIndex 0
prefix + time + suffix: region=us/dt=20260627/source=app — startIndex 1
Hour, format=yyyy-MM-dd/HH
time only: 2026-06-27/12, dt=2026-04-05/hh=12 — startIndex 0
prefix + time: region=us/2026-06-27/12, region=us/dt=2026-04-05/hh=12 — startIndex 1
time + suffix: 2026-06-27/12/source=app, dt=2026-04-05/hh=12/source=app — startIndex 0
prefix + time + suffix: region=us/dt=2026-04-05/hh=12/source=app — startIndex 1
Hour, format=yyyyMMdd/HH
time only: 20260627/12, dt=20260405/hh=12 — startIndex 0
prefix + time: region=us/20260627/12, region=us/dt=20260405/hh=12 — startIndex 1
time + suffix: 20260627/12/source=app, dt=20260405/hh=12/source=app — startIndex 0
prefix + time + suffix: region=us/dt=20260405/hh=12/source=app — startIndex 1
Hive-style key names are not constrained: dt=, day=, event_date=, hh=, hour= all work; only the value after = is parsed.
Hive-style is honored from the table's hoodie.datasource.write.hive_style_partitioning flag (authoritative table-level setting, not guessed per-segment).
Impact
Risk Level
low
Documentation Update
none
Contributor's checklist