Skip to content

feat: add scheduler observability metrics for missed actions#10503

Merged
chaptersix merged 7 commits into
temporalio:mainfrom
chaptersix:sch-callback-latency
Jun 5, 2026
Merged

feat: add scheduler observability metrics for missed actions#10503
chaptersix merged 7 commits into
temporalio:mainfrom
chaptersix:sch-callback-latency

Conversation

@chaptersix
Copy link
Copy Markdown
Contributor

@chaptersix chaptersix commented Jun 3, 2026

Summary

Add and improve scheduler metrics to detect missed actions caused by system instability, slow callback delivery, and overlap policy behavior. Changes apply to both v1 (workflow-based) and v2 (CHASM) schedulers.

What is a missed action?

A scheduled action is "missed" when the scheduler decides not to execute an action that was due according to the schedule spec. There are three scenarios where this happens:

  1. Catchup window exceeded. The scheduler was delayed in processing and the action's scheduled execution time is now older than the configured catchup window. The action is permanently dropped. This can be caused by a previous action still running or by system instability (worker downtime, shard movement, task queue backlog). Tracked by schedule_missed_catchup_window.

  2. Overlap policy skip. A previous action is still running and the overlap policy is SKIP (or BUFFER_ONE with a full buffer). The scheduler intentionally drops the action because the policy says not to start concurrent runs. This is expected behavior, not a system issue -- but a high skip rate may indicate action duration exceeds the schedule interval. Tracked by schedule_overlap_skipped.

  3. Buffer overrun. The internal action buffer is full (bounded by MaxBufferSize). New actions are dropped regardless of overlap policy. Tracked by schedule_buffer_overruns.

Scenarios 1 and 2 are the focus of this PR. Scenario 3 already had metrics.

Catchup window vs. overlap policy: v1 and v2 differences

An action is only ever counted in one of the above scenarios, but the evaluation order and the possible outcomes differ between v1 and v2.

Both v1 and v2 have the same two-phase structure: a generation phase that checks the catchup window and buffers actions, followed by a buffer processing phase that evaluates overlap policy and executes actions. The difference is in what happens to buffered actions that can't be executed immediately.

V1 (workflow-based):

  1. processTimeRange (generation) checks the catchup window. Actions whose nominal time is older than the window are dropped (schedule_missed_catchup_window{reason="not_buffered"}). Surviving actions are added to the buffer.
  2. processBuffer (invocation) refreshes running action state via PollMutableState, then evaluates overlap policy on the entire buffer, including actions deferred on previous ticks. Actions are either started, skipped by overlap policy (schedule_overlap_skipped), or deferred (BUFFER_ONE/BUFFER_ALL).

V1 has no reason=buffer_expired case because it never re-checks the catchup window on buffered actions. Once an action passes the initial catchup window check and enters the buffer, it will never be dropped for being too old. If a BUFFER_ONE schedule has a long-running action, the deferred action waits indefinitely across ticks until the running action completes, at which point it is started regardless of how much time has passed since its original schedule time.

V2 (CHASM):

  1. The generator task checks the catchup window. Same as v1 -- actions past the window are dropped (schedule_missed_catchup_window{reason="not_buffered"}). Surviving actions are buffered.
  2. The invoker's ProcessBuffer task evaluates overlap policy. Actions are either marked for execution, skipped by overlap (schedule_overlap_skipped), or deferred (BUFFER_ONE with a running action sets Attempt = -1).
  3. Actions that survive overlap policy evaluation are checked against the catchup window a second time. If the action was buffered in time but has now expired -- due to the gap between CHASM tasks, overlap policy deferral, or start retries -- it is dropped (schedule_missed_catchup_window{reason="buffer_expired"}).

This second deadline check is unique to v2. It means a BUFFER_ONE action that waits too long for a slot can be dropped, unlike v1 where it would wait indefinitely. The action_running tag on reason=buffer_expired distinguishes whether the expiration was caused by a long-running action blocking the slot (action_running=true) or by system delay in CHASM task processing (action_running=false). For starts deferred by BUFFER_ONE/BUFFER_ALL, action_running=true is preserved when the start expires because it waited behind a running action, even if that action has completed by the time the metric is emitted.

New metrics

schedule_overlap_skipped (counter)

Tags: namespace, scheduler_backend, schedule_overlap_policy

Actions are skipped when the overlap policy prevents starting a new action while a previous one is still running. Today this is only tracked as a running total in Info.OverlapSkipped (visible via DescribeSchedule) with no time-series visibility.

This counter is emitted for each action dropped by ProcessBuffer due to overlap policy. The schedule_overlap_policy tag carries the overlap policy that was actually applied when that action was evaluated (e.g., SCHEDULE_OVERLAP_POLICY_SKIP or SCHEDULE_OVERLAP_POLICY_BUFFER_ONE).

One subtlety is that buffered actions do not all resolve overlap policy the same way:

  • v1: normal scheduled actions are buffered with UNSPECIFIED and resolved later in ProcessBuffer against the schedule's current policy. If UpdateSchedule changes the overlap policy while actions are already buffered, those existing buffered scheduled actions will be evaluated under the new policy.
  • v2: the generator resolves overlap policy before buffering and stores the resolved policy on each BufferedStart. If UpdateSchedule changes the overlap policy later, already-buffered scheduled actions keep the policy they were generated with.

This distinction is why the metric is emitted per skipped action using the policy that was actually applied at evaluation time, rather than once per ProcessBuffer pass with a single schedule-level tag.

How to use:

  • rate(schedule_overlap_skipped[5m]) shows how often actions are being dropped due to overlap.
  • Group by schedule_overlap_policy to see which policies are causing the most skips.
  • Compare with schedule_action_e2e_delay to see whether skipped actions correlate with actions whose duration exceeds the schedule interval, pushing the total schedule-to-start delay higher.

Emitted in v1 (processBuffer) and v2 (InvokerProcessBufferTaskHandler.Execute).

schedule_callback_latency (timer, v2 only)

Tags: namespace, scheduler_backend

The v2 scheduler learns about action completions via Nexus completion callbacks. This timer records time.Since(closeTime) when HandleNexusCompletion processes a callback, measuring the end-to-end delay from action completion to the scheduler receiving the result.

This latency includes: callback task creation, callback task queue processing, cross-shard RPC delivery, and handler execution.

Why v2 only: The v1 and v2 schedulers use fundamentally different mechanisms to discover action completions, and only v2's is worth measuring independently.

In v1, completion discovery is tick-driven. The scheduler wakes on each scheduled time and does a synchronous PollMutableState refresh of all running actions before making overlap decisions. For a SKIP schedule with a 1s interval, the scheduler discovers completions at most 1s after they happen (on the next tick). The long-poll watcher activity exists but is only useful for BUFFER_ONE/BUFFER_ALL where starts sit in the buffer waiting for a completion -- for SKIP, there are no buffered starts and the watcher is irrelevant. The completion discovery latency in v1 is inherently bounded by the schedule interval itself, so there's nothing additional to measure.

In v2, completion discovery is event-driven via Nexus callbacks. The scheduler has no periodic refresh; it only learns about completions when the callback is delivered. This means callback delivery latency directly gates when the next action can start and when overlap policies are re-evaluated. A slow callback pipeline silently degrades scheduler throughput in a way that has no v1 equivalent.

How to use:

  • High values indicate the callback delivery pipeline is slow, which directly delays the next action for schedules using BUFFER_ONE or BUFFER_ALL policies.
  • Compare with schedule_action_delay to understand how much of the total action delay is attributable to callback delivery vs. other factors.
  • Useful for diagnosing scenarios where v2 schedules appear to "stall" between actions.

Emitted in Scheduler.HandleNexusCompletion.

Modified metrics

schedule_generate_latency (timer) -- added to v1, fixed in both

Previously v2-only. This timer measures how late the scheduler is in processing a due action: now - scheduledExecutionTime (the action's ActualTime, after jitter).

Changes:

  1. Added to v1 scheduler. The v1 processTimeRange now emits this metric, matching the existing v2 behavior.
  2. Emit only for the first action per batch. When the scheduler catches up over a large time range (e.g., after being down), it processes many due actions in a single pass. Previously every action got its own latency sample against the same now. Now only the first action is recorded, representing the actual system delay.
  3. Moved before the catchup window check. Previously (v2) the metric was emitted after the catchup window check, meaning the worst latencies -- the ones that actually miss the window -- were never recorded. Now it's emitted before, so all latencies including missed actions are captured.

How to use:

  • histogram_quantile(0.99, schedule_generate_latency) shows worst-case scheduling delay.
  • A rising trend is an early warning of system instability before actions start being dropped.

schedule_missed_catchup_window (counter) -- added reason and action_running tags

Actions can miss the catchup window at two points in the v2 scheduler:

  1. Generator: the action's scheduled execution time is older than the catchup window when the generator processes the time range. The action is never buffered.
  2. Invoker (process buffer): the action was buffered in time but then sat too long waiting for execution (e.g., due to overlap policy deferral, start retries, or system delay between the gen task and process buffer task).

Previously both cases emitted the same untagged counter, making it impossible to distinguish root cause. Now the counter carries two new tags:

reason tag:

  • reason=not_buffered -- the action's scheduled execution time was already past the catchup window when the generator processed the time range. It was never buffered for execution. Present in both v1 and v2.
  • reason=buffer_expired -- the action was buffered in time but expired before execution (e.g., due to overlap deferral, retries, or system delay between CHASM tasks). V2 only.

V1 only has the not_buffered case because generation and buffer processing happen synchronously in the same workflow task -- there is no gap where a buffered start can sit and expire.

action_running tag:

  • action_running=true -- a previous action was running when the buffered start was deferred, or was still running when the start was dropped. This indicates overlap wait contributed to the missed action.
  • action_running=false -- no action was blocking the start when it expired, so the miss was due to other factors (system delay, scheduler processing time).

Present only on reason=buffer_expired (v2). Not added to reason=not_buffered because: in v2, the generator doesn't have access to the invoker's running action state; in v1, running action state is stale at the point where the catchup window check runs (the refresh happens later in processBuffer).

How to use:

  • schedule_missed_catchup_window{reason="not_buffered"} means the scheduler was too late to process the action. This check is independent of whether a previous action is running.
  • schedule_missed_catchup_window{reason="buffer_expired"} indicates actions were buffered on time but the invoker was delayed in executing them. V2 only.
  • schedule_missed_catchup_window{reason="buffer_expired", action_running="true"} narrows the cause to a long-running previous action blocking execution of buffered starts, including starts that expire after the previous action completes but before they can be executed. For v2 it's possible that completion callbacks are not being delivered in time.
  • schedule_missed_catchup_window{reason="buffer_expired", action_running="false"} means no previous action was blocking execution -- the delay was in CHASM task processing between the generator and invoker.
  • Comparing the two rates shows whether the bottleneck is in waking the scheduler vs. executing buffered actions.

Existing metrics (not modified, included for context)

schedule_action_delay (timer)

Tags: namespace, scheduler_backend, schedule_action

Measures the time between when an action became eligible for execution and when it was actually started: actualStartTime - desiredTime.

desiredTime is either:

  • The action's scheduled execution time (ActualTime), for the first action after the scheduler wakes up.
  • The CloseTime of the previous action, for subsequent buffered starts (set in processWatcherResult in v1, recordCompletedAction in v2). This means that for BUFFER_ONE/BUFFER_ALL, the delay measures only the gap between the previous action completing and the next one starting, not the total wait from the original schedule time.

This metric captures the delay from "action eligible" to "action started," which includes buffer processing, rate limiting, and the actual start RPC. However, it does not capture time spent waiting for a previous action to complete under BUFFER_ONE/BUFFER_ALL overlap policies, because DesiredTime is reset to the previous action's CloseTime when it completes (in processWatcherResult for v1, recordCompletedAction for v2). This means for buffered starts, schedule_action_delay measures only the gap between the slot opening and the next action starting -- not the total wall-clock wait from the original schedule time.

For the full wall-clock delay including overlap policy wait, use schedule_action_e2e_delay.

Relationship to other metrics:

  • schedule_generate_latency measures a subset: how late the scheduler was in waking up to process the action. This is always <= schedule_action_delay.
  • schedule_callback_latency (v2) measures a different leg: how long it took to learn about the previous action's completion. For BUFFER_ONE, high callback latency contributes to high action delay on the next start.
  • schedule_action_delay is the metric most directly visible to users as "how fast did we start after the slot opened."
  • schedule_action_e2e_delay is the metric most directly visible to users as "how late was my scheduled action" -- it includes overlap policy wait time that schedule_action_delay excludes.

schedule_action_e2e_delay (timer) -- NEW

Tags: namespace, scheduler_backend, schedule_action

Measures the total wall-clock delay from the action's original schedule time to when it was actually started: actualStartTime - ActualTime. Unlike schedule_action_delay, this metric always uses the original schedule time as the baseline and is never reset when a previous action completes.

This means for BUFFER_ONE/BUFFER_ALL schedules where an action was waiting in the buffer for a previous one to finish, this metric includes the full overlap policy wait time. For the first action after the scheduler wakes (where DesiredTime has not been reset), this metric will equal schedule_action_delay.

How to use:

  • Compare with schedule_action_delay to isolate how much delay is caused by overlap policy wait vs. other factors. If schedule_action_e2e_delay is high but schedule_action_delay is low, the bottleneck is the previous action's duration.
  • histogram_quantile(0.99, schedule_action_e2e_delay) shows worst-case total delay from schedule time to actual start, as experienced by users.

Emitted in v1 (startWorkflow) and v2 (InvokerExecuteTaskHandler.startWorkflow).

Detecting missed/delayed actions due to system instability

These metrics form a pipeline from early warning to confirmed impact. The available signals differ between v1 and v2 due to their architectural differences.

V1 (workflow-based scheduler)

The v1 scheduler runs as a single workflow. Generation and buffer processing happen synchronously in the same workflow task. The scheduler discovers action completions via a synchronous PollMutableState refresh on each tick.

Available metrics: schedule_generate_latency, schedule_action_delay, schedule_action_e2e_delay, schedule_missed_catchup_window{reason="not_buffered"}, schedule_overlap_skipped

Diagnosis flow:

  1. Early warning: schedule_generate_latency
    The first metric to move. Measures how late the scheduler is in processing due actions. Under normal conditions this should be near-zero. A rising trend means the scheduler is delayed in processing -- the worker is under load, matching is slow, or the task queue is backed up.

    histogram_quantile(0.99, rate(schedule_generate_latency_bucket{scheduler_backend="legacy"}[5m]))
    
  2. Confirmed drops: schedule_missed_catchup_window{reason="not_buffered"}
    Once generate latency exceeds the catchup window, actions are permanently lost. In v1 this is always reason=not_buffered -- generation and buffer processing are synchronous, so there is no gap where a buffered start can expire independently.

    rate(schedule_missed_catchup_window{scheduler_backend="legacy", reason="not_buffered"}[5m])
    
  3. User-facing impact: schedule_action_delay and schedule_action_e2e_delay
    schedule_action_delay measures the delay from when the action became eligible (after any overlap wait) to when it started. Absent high schedule_generate_latency, a high schedule_action_delay points to buffer processing -- rate limiting or the start RPC itself.

    schedule_action_e2e_delay measures total delay from the original schedule time, including overlap policy wait. Absent high schedule_action_delay, a high schedule_action_e2e_delay indicates the bottleneck is previous action duration, not system instability.

  4. Overlap context: schedule_overlap_skipped
    Absent high schedule_callback_latency (v2), a high skip rate suggests actions are taking longer than the schedule interval (application-level issue). In v2, delayed callback delivery can also cause skips -- the scheduler still sees the previous action as running even though it has completed, because the completion callback hasn't arrived yet.

V2 (CHASM scheduler)

The v2 scheduler splits work across independent CHASM tasks: a generator task buffers actions, and an invoker process-buffer task resolves and executes them. The scheduler learns about action completions via Nexus callbacks, not polling. This creates additional failure modes between the generator and invoker, and in the callback delivery pipeline.

Available metrics: schedule_generate_latency, schedule_action_delay, schedule_action_e2e_delay, schedule_missed_catchup_window{reason="not_buffered"|"buffer_expired"}, schedule_callback_latency, schedule_overlap_skipped

Diagnosis flow:

  1. Early warning: schedule_generate_latency
    Same as v1 -- measures how late the generator task is in processing due actions.

    histogram_quantile(0.99, rate(schedule_generate_latency_bucket{scheduler_backend="chasm"}[5m]))
    
  2. Confirmed drops: schedule_missed_catchup_window
    The reason tag identifies where the drop occurred:

    • reason=not_buffered: The generator was too late to buffer the action. Same as v1 -- the clearest signal that the system is to blame.
    • reason=buffer_expired: The action was buffered on time but expired waiting for execution. The action_running tag distinguishes cause:
      • action_running=true: A previous action was still running, or had deferred the buffered start long enough that it expired before execution. Absent high schedule_callback_latency, this points to action duration exceeding the catchup window (application-level issue). If schedule_callback_latency is also high, delayed callback delivery may be the actual cause -- the previous action completed but the scheduler hasn't learned about it yet.
      • action_running=false: No action was blocking it. The delay was in CHASM task processing between the generator and invoker -- system instability.
    # system-caused drops at the generator (strongest signal)
    rate(schedule_missed_catchup_window{scheduler_backend="chasm", reason="not_buffered"}[5m])
    
    # system-caused drops at the invoker (no action blocking)
    rate(schedule_missed_catchup_window{scheduler_backend="chasm", reason="buffer_expired", action_running="false"}[5m])
    
    # drops caused by long-running actions
    rate(schedule_missed_catchup_window{scheduler_backend="chasm", reason="buffer_expired", action_running="true"}[5m])
    
  3. Callback pipeline health: schedule_callback_latency
    For schedules using BUFFER_ONE or BUFFER_ALL, the callback pipeline is on the critical path between one action completing and the next starting. Absent high schedule_generate_latency, a high schedule_callback_latency indicates the callback delivery pipeline is the bottleneck. A slow callback pipeline means the scheduler doesn't learn about completions promptly, delaying the next buffered action and potentially causing overlap skips or buffer expirations.

    histogram_quantile(0.99, rate(schedule_callback_latency_bucket[5m]))
    
  4. User-facing impact: schedule_action_delay and schedule_action_e2e_delay
    schedule_action_delay measures delay from when the action became eligible to when it started. schedule_action_e2e_delay measures total delay from the original schedule time. If the latter is high, work backwards:

    • Is schedule_generate_latency high? -> Generator task is delayed (system instability).
    • Is schedule_callback_latency high? -> Completion callbacks are slow (system instability in the callback pipeline).
    • Is schedule_missed_catchup_window{reason="buffer_expired", action_running="false"} firing? -> Invoker task is delayed (system instability between CHASM tasks).
    • Is schedule_action_e2e_delay high but schedule_action_delay low? -> Previous action duration is the bottleneck (application-level issue).
    • Is schedule_overlap_skipped spiking? -> Overlap policy is dropping actions because previous actions run too long (application-level issue).
  5. Overlap context: schedule_overlap_skipped
    Same as v1. Absent high schedule_callback_latency, a high skip rate suggests action duration exceeds the schedule interval (application-level issue). If schedule_callback_latency is also high, the skips may be caused by delayed completion callbacks rather than actual action duration.

Detecting missed/delayed actions due to schedule configuration

Actions can be missed or delayed due to the schedule's own configuration -- for example, when action duration exceeds the schedule interval. These are not caused by system instability but may still require attention.

Symptoms:

  • schedule_overlap_skipped is elevated, absent high schedule_callback_latency (v2). Actions are being dropped because the previous action hasn't completed by the time the next one is due.
  • schedule_action_e2e_delay is high but schedule_action_delay is low. The total delay from schedule time to start is large, but the scheduler started the action promptly once the slot opened. The gap is overlap policy wait.
  • schedule_missed_catchup_window{reason="buffer_expired", action_running="true"} is firing (v2 only), absent high schedule_callback_latency. Buffered actions are expiring because they spent too long waiting behind a previous action.

Diagnosis:

  1. Compare schedule_action_e2e_delay with schedule_action_delay. A large gap between them indicates time spent waiting for overlap policy to allow execution.
  2. Check schedule_overlap_skipped grouped by schedule_overlap_policy. A high rate on SKIP or BUFFER_ONE indicates the schedule interval is shorter than the action duration.
  3. For BUFFER_ONE/BUFFER_ALL schedules, check whether schedule_missed_catchup_window{reason="buffer_expired", action_running="true"} is firing. This means buffered actions are waiting so long for the slot that they exceed the catchup window.

Resolution: These issues are addressed by changing the schedule configuration -- increasing the interval, switching to a more permissive overlap policy, or reducing action duration. They do not indicate a problem with the scheduler or the platform.

schedule_buffer_overruns (counter)

Tags: namespace, scheduler_backend

Counts actions dropped because the internal action buffer is full (bounded by MaxBufferSize).

Emitted in v1 (processTimeRange and signal handling) and v2 (GeneratorTaskHandler.Execute).

Backfills and manual triggers

Backfills and manual triggers (TriggerImmediately) generate actions with the Manual flag set. Each backfill request can specify its own overlap policy -- if unspecified, it resolves to the schedule's configured policy. Backfill requests can use ALLOW_ALL to run all backfilled actions concurrently.

Not emitted for manual/backfill actions:

  • schedule_generate_latency -- backfill actions are generated against a historical time range, so now - ActualTime is meaningless as a system delay indicator.
  • schedule_missed_catchup_window -- backfill actions skip the catchup window check entirely. The time range is intentionally in the past.
  • schedule_action_delay and schedule_action_e2e_delay -- the delay between a historical schedule time and now is not meaningful for backfills.

Emitted for manual/backfill actions:

  • schedule_overlap_skipped -- backfill actions go through ProcessBuffer and are subject to overlap policy. A backfill with SKIP policy against a running action will increment this counter. The schedule_overlap_policy tag reflects the backfill's own policy, not the schedule's configured policy.
  • schedule_buffer_overruns -- backfills can fill the action buffer, especially large backfills over wide time ranges. In v1, backfills are limited to half the buffer (MaxBufferSize/2) and processed incrementally (BackfillsPerIteration per tick). In v2, each backfill request creates a separate Backfiller component that generates actions independently.
  • schedule_callback_latency (v2) -- recorded when any action's completion callback arrives, regardless of whether it was manually triggered.

Interaction with regular actions: Whether backfills contend with regular actions depends on the backfill's overlap policy:

  • ALLOW_ALL (typical for backfills): Actions started with ALLOW_ALL are not tracked as running. This means they do not trigger overlap policy evaluation on regular actions -- a SKIP schedule will continue to start regular actions normally even while ALLOW_ALL backfill actions are running. Buffer contention is the only possible interference: in v1, backfills self-limit to half the buffer (MaxBufferSize/2) and add at most BackfillsPerIteration (default 10) per tick, with the generator running first each tick. Buffer contention is unlikely unless backfill-started actions from previous ticks accumulate in the buffer faster than they can be executed.

  • Non-ALLOW_ALL (SKIP, BUFFER_ONE, etc.): Actions started with these policies are tracked as running. A backfill-started action that is still running will cause regular actions to be skipped or deferred by overlap policy. This can show up as elevated schedule_overlap_skipped or increased schedule_action_e2e_delay on regular actions while the backfill is in progress. This is the primary contention scenario -- the backfill's running actions block the schedule's normal execution.

Glossary

  • Scheduled action: The operation the scheduler executes at each scheduled time. Currently this is always starting a workflow, but the scheduler is designed to support other primitives (e.g., activities) in the future.

  • Nominal time (NominalTime): The exact time that matches the schedule spec, before jitter is applied. Used to generate the workflow ID (ensuring deterministic naming) and the request ID. Corresponds to GetNextTimeResult.Nominal.

  • Actual time (ActualTime): The nominal time with jitter applied. This is the time the action is actually scheduled to execute. Corresponds to GetNextTimeResult.Next. Stored on BufferedStart.ActualTime.

  • Desired time (DesiredTime): The baseline time used to compute schedule_action_delay. Initially nil. When a previous action completes, DesiredTime on the next pending buffered start is set to the previous action's CloseTime. If nil at metric emission time, falls back to ActualTime. This reset means schedule_action_delay measures the gap from "slot opened" to "action started" for buffered starts, not the total wait from the original schedule time.

  • Catchup window: A configurable duration (default: 365 days, minimum: 10 seconds) that determines how old an action can be before it is permanently dropped. If now - ActualTime > catchupWindow, the action is considered too stale to execute. Checked in the generator (both v1 and v2) and again in the v2 invoker for buffered actions.

  • Overlap policy: Determines what happens when a new action is due but a previous action is still running.

    • SKIP: Drop the new action. No concurrent or queued actions. This is the default policy.
    • BUFFER_ONE: Queue at most one pending action. When the running action completes, the queued action starts. If another action comes due while one is already queued, the new one is dropped.
    • BUFFER_ALL: Queue all pending actions. Every due action is queued and will eventually execute in order after the running action completes.
    • CANCEL_OTHER: Cancel the running action and start the new one immediately.
    • TERMINATE_OTHER: Terminate the running action and start the new one immediately.
  • Generator: The phase that iterates over the schedule spec's time range, checks the catchup window, and buffers actions for execution. In v1 this is processTimeRange; in v2 this is the GeneratorTaskHandler.

  • Invoker / buffer processing: The phase that evaluates overlap policy on the buffer and executes actions. In v1 this is processBuffer; in v2 this is the InvokerProcessBufferTaskHandler followed by InvokerExecuteTaskHandler.

  • Deferred action: A buffered action that was evaluated by ProcessBuffer but could not be executed because a previous action is still running under BUFFER_ONE or BUFFER_ALL policy. The action stays in the buffer and is re-evaluated on the next processing pass. In v2, deferred actions are marked with Attempt = -1 to distinguish them from newly enqueued actions. In v1, deferred actions remain in the buffer with their original state and are re-evaluated on each tick.

Add metrics to detect missed actions caused by system instability, slow
callback delivery, and overlap policy behavior across v1 and v2 schedulers.

New metrics:
- schedule_overlap_skipped: counter with overlap policy tag
- schedule_callback_latency: v2 Nexus callback delivery latency

Modified metrics:
- schedule_generate_latency: added to v1, emit first action only, moved
  before catchup window check
- schedule_missed_catchup_window: added reason tag (not_buffered vs
  buffer_expired) and workflow_running tag (v2 buffer_expired only)
…w_running tag

- Add schedule_action_start_to_close_delay timer measuring total delay
  from original schedule time to action start, including overlap wait
- Rename workflow_running tag to action_running for consistency
- Update metric descriptions to use "action" instead of "workflow"
…ction

- Replace generatedCount int with recordedGenerateLatency bool
- Initialize OverlapSkippedByPolicy map upfront, remove nil checks
- Rename hadRunningWorkflow to actionRunning, set once in processBuffer
- Rename workflow_running tag to action_running
- Update comments to use action terminology
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands scheduler observability by adding new metrics and enriching existing ones to better detect and diagnose “missed” schedule actions, across both the legacy (workflow-based) scheduler and the CHASM (v2) scheduler.

Changes:

  • Add overlap-policy skip visibility via a new schedule_overlap_skipped counter (tagged by resolved overlap policy) in both v1 and v2.
  • Improve catchup-window miss attribution by tagging schedule_missed_catchup_window with reason (and action_running for v2 buffer-expiry cases), and ensure schedule_generate_latency is emitted earlier and only once per batch.
  • Add latency visibility for v2 completion callbacks via schedule_callback_latency, and record a new “schedule-time → actual start” timer for scheduled actions.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
service/worker/scheduler/workflow.go Adds v1 generate-latency emission, tags catchup-window misses with reason, emits overlap-skip counter by policy, and records the new total schedule-time → start delay timer.
service/worker/scheduler/buffer.go Extends buffer processing results to track overlap skips broken down by resolved overlap policy.
service/worker/scheduler/buffer_test.go Adds test assertions for the new per-policy overlap-skip tracking and a mixed-policy resolution case.
common/metrics/metric_defs.go Defines new scheduler metric/tag constants and registers new metric defs (schedule_overlap_skipped, schedule_callback_latency, and the new delay timer).
chasm/lib/scheduler/spec_processor.go Updates v2 generator to emit schedule_generate_latency once per batch and tags catchup-window misses with reason=not_buffered.
chasm/lib/scheduler/scheduler.go Emits schedule_callback_latency when processing Nexus completion callbacks.
chasm/lib/scheduler/invoker.go Extends invoker process-buffer result to track overlap skips by policy and whether an action was running during processing.
chasm/lib/scheduler/invoker_tasks.go Emits v2 overlap-skip metrics by policy and emits tagged catchup-window misses for buffered actions that expired (reason=buffer_expired, action_running=...), plus records the new total schedule-time → start delay timer.

Comment thread common/metrics/metric_defs.go Outdated
Comment thread common/metrics/metric_defs.go Outdated
Comment thread chasm/lib/scheduler/scheduler.go
- Fix indentation on ScheduleCallbackLatency metric definition
- Rename schedule_action_start_to_close_delay to schedule_action_e2e_delay
  to avoid confusion with Temporal's start-to-close timeout terminology
- Use ctx.Now(s) instead of time.Since for schedule_callback_latency to
  use a consistent time source, clamp negative durations from clock skew
- Compute action_running tag per-start using DesiredTime as evidence that
  a running action held the start past its catchup deadline, instead of a
  single boolean for the whole batch
- Update recordCompletedAction to set DesiredTime on the first deferred
  start specifically, not whichever pending start comes first
- Replace generatedCount int with recordedGenerateLatency bool
- Initialize OverlapSkippedByPolicy map upfront, remove nil checks
- Update comments to use action terminology
- Revert DesiredTime update to original pattern (first pending start
  after re-enabling deferred starts) with added comment about
  action_running attribution
- Remove unnecessary if-guard around overlapSkippedByPolicy iteration
  in both CHASM and legacy paths
- Replace strconv.FormatBool with fmt.Sprintf for action_running tag
- Add generate latency comment to legacy processTimeRange
- Expand legacy DesiredTime comment explaining difference from CHASM
- Add zero-time safety comment on actionRunning attribution
@chaptersix chaptersix marked this pull request as ready for review June 3, 2026 23:38
@chaptersix chaptersix requested review from a team as code owners June 3, 2026 23:38
// Update Scheduler metadata.
var totalMissedCatchup int64
for _, count := range result.missedCatchupByActionRunning {
totalMissedCatchup += count
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, isn't this just?

totalMissedCatchup := len(result.missedCatchupByActionRunning)

Comment thread service/worker/scheduler/workflow.go
Copy link
Copy Markdown
Contributor

@davidporter-id-au davidporter-id-au left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to the extent my limited knowledge is useful here, lgtm

Resolve conflict in invoker_tasks.go: keep upstream's reordering
(deadline check before useScheduledAction) with our metric additions.
…mments

- Gate buffer_expired metric emission on useScheduledAction(false) so
  paused and action-exhausted schedules don't emit misleading metrics
- Fix V1 not_buffered comment: action_running is omitted because
  running state is irrelevant (not because it's stale)
@chaptersix chaptersix enabled auto-merge (squash) June 5, 2026 21:08
@chaptersix chaptersix merged commit 71dd8ed into temporalio:main Jun 5, 2026
56 of 58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants