Skip to content

[Dataflow Streaming] [Multi Key] MultiKey failure handling + Integration #38919

Open
arunpandianp wants to merge 26 commits into
apache:masterfrom
arunpandianp:multikey_failure
Open

[Dataflow Streaming] [Multi Key] MultiKey failure handling + Integration #38919
arunpandianp wants to merge 26 commits into
apache:masterfrom
arunpandianp:multikey_failure

Conversation

@arunpandianp

Copy link
Copy Markdown
Contributor

The change connects adds failure handling for multi key commits.
Integrates StreamingWorkScheduler and multikey commit methods.
Updates StreamingModeExecutionContext::advance to pull in more items from BoundedWorkQueue

All changes are behind the experiment unstable_enable_multi_key_bundle and does not affect default logic.

…lients

- Add MultiKeyWorkItemCommitRequest to windmill.proto.
- Support MultiKey commits in Commit model and StreamingEngineWorkCommitter.
- Update GrpcCommitWorkStream to batch and stream MultiKey commit requests.
# Conflicts:
#	runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/StreamingModeExecutionContext.java
Resolved conflicts in StreamingModeExecutionContext.java and StreamingModeExecutionContextTest.java.
Fixed compilation error in Work.java by removing duplicate getComputationId() method.

TAG=agy
CONV=143daaa5-e902-4d26-820d-cf1af2babb84
@arunpandianp

Copy link
Copy Markdown
Contributor Author

R: @scwhittle This change is on top of #38814 and #38768 PTAL

@github-actions

Copy link
Copy Markdown
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces multi-key commit support for Dataflow streaming, allowing multiple work items to be committed in a single transaction. It enhances failure handling by enabling re-execution of specific work items within a batch if a retryable failure occurs. Additionally, it optimizes the execution context to pull more work items from the BoundedWorkQueue, thereby increasing processing efficiency for multi-key bundles.

Highlights

  • Multi-Key Commit Failure Handling: Introduced robust failure handling for multi-key commits, allowing for partial re-execution of work items upon retryable failures.
  • Streaming Scheduler Integration: Integrated StreamingWorkScheduler with multi-key commit methods to enable efficient batching of work items.
  • StreamingModeExecutionContext Updates: Updated StreamingModeExecutionContext::advance to pull more items from the BoundedWorkQueue, improving throughput for multi-key bundles.
  • Experimental Feature Flag: All changes are guarded by the 'unstable_enable_multi_key_bundle' experiment flag, ensuring no impact on default logic.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces multi-key bundle support for streaming in the Google Cloud Dataflow Java worker, enabling batching and advancing through multiple keys within a key group, as well as transactionally committing multi-key work batches. Feedback suggests adding robust error handling when parsing experimental options to prevent worker crashes, and implementing defensive null and bounds checks when accessing read operation receivers to avoid potential runtime exceptions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +252 to +259
String batchSizeStr =
ExperimentalOptions.getExperimentValue(options, WINDMILL_MAX_KEY_GROUP_BATCH_SIZE);
this.maxKeyGroupBatchSize = batchSizeStr != null ? Integer.parseInt(batchSizeStr) : 100;

String batchTimeStr =
ExperimentalOptions.getExperimentValue(options, WINDMILL_MAX_KEY_GROUP_BATCH_TIME_MS);
this.maxKeyGroupBatchTimeNanos =
TimeUnit.MILLISECONDS.toNanos(batchTimeStr != null ? Long.parseLong(batchTimeStr) : 100);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Parsing user-provided experimental options directly using Integer.parseInt and Long.parseLong without error handling can cause the worker to crash if the values are malformed. It is safer to wrap these in a try-catch block and fall back to the default values with a warning log.

    String batchSizeStr =
        ExperimentalOptions.getExperimentValue(options, WINDMILL_MAX_KEY_GROUP_BATCH_SIZE);
    int batchSize = 100;
    if (batchSizeStr != null) {
      try {
        batchSize = Integer.parseInt(batchSizeStr);
      } catch (NumberFormatException e) {
        LOG.warn("Failed to parse {} as integer, using default of 100", WINDMILL_MAX_KEY_GROUP_BATCH_SIZE, e);
      }
    }
    this.maxKeyGroupBatchSize = batchSize;

    String batchTimeStr =
        ExperimentalOptions.getExperimentValue(options, WINDMILL_MAX_KEY_GROUP_BATCH_TIME_MS);
    long batchTimeMs = 100;
    if (batchTimeStr != null) {
      try {
        batchTimeMs = Long.parseLong(batchTimeStr);
      } catch (NumberFormatException e) {
        LOG.warn("Failed to parse {} as long, using default of 100", WINDMILL_MAX_KEY_GROUP_BATCH_TIME_MS, e);
      }
    }
    this.maxKeyGroupBatchTimeNanos = TimeUnit.MILLISECONDS.toNanos(batchTimeMs);

Comment on lines +714 to +718
HashMap<String, ElementCounter> counters =
((DataflowMapTaskExecutor) workExecutor)
.getReadOperation()
.receivers[0]
.getOutputCounters();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Defensive programming: Accessing receivers[0] directly without checking if getReadOperation() is null, or if receivers is null or empty, can lead to NullPointerException or ArrayIndexOutOfBoundsException. Adding appropriate guards ensures robust execution.

    DataflowMapTaskExecutor mapTaskExecutor = (DataflowMapTaskExecutor) workExecutor;
    if (mapTaskExecutor.getReadOperation() == null
        || mapTaskExecutor.getReadOperation().receivers == null
        || mapTaskExecutor.getReadOperation().receivers.length == 0) {
      return 0L;
    }
    HashMap<String, ElementCounter> counters =
        mapTaskExecutor.getReadOperation().receivers[0].getOutputCounters();
    if (counters == null) {
      return 0L;
    }

@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.44884% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.39%. Comparing base (1cf3545) to head (0df9f54).
⚠️ Report is 8 commits behind head on master.

Files with missing lines Patch % Lines
...e/beam/runners/dataflow/worker/streaming/Work.java 69.44% 10 Missing and 1 partial ⚠️
...dataflow/worker/StreamingModeExecutionContext.java 94.59% 8 Missing and 2 partials ⚠️
...nners/dataflow/worker/WindowingWindmillReader.java 81.08% 3 Missing and 4 partials ⚠️
...ers/dataflow/worker/util/BoundedQueueExecutor.java 75.00% 0 Missing and 3 partials ⚠️
...nners/dataflow/worker/StreamingDataflowWorker.java 60.00% 1 Missing and 1 partial ⚠️
...unners/dataflow/worker/WorkCancelingException.java 71.42% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             master   #38919       +/-   ##
=============================================
- Coverage     59.33%   55.39%    -3.95%     
+ Complexity    16593     2243    -14350     
=============================================
  Files          2845     1104     -1741     
  Lines        291334   171355   -119979     
  Branches      14421     1437    -12984     
=============================================
- Hits         172859    94915    -77944     
+ Misses       111065    74000    -37065     
+ Partials       7410     2440     -4970     
Flag Coverage Δ
java 75.76% <88.44%> (+9.65%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant