Skip to content

[compiler] eliminate hasStreamParam branching by reusing produceIterator#15372

Draft
patrick-schultz wants to merge 3 commits into
hail-is:mainfrom
patrick-schultz:ps/push-mspslttpsvkk
Draft

[compiler] eliminate hasStreamParam branching by reusing produceIterator#15372
patrick-schultz wants to merge 3 commits into
hail-is:mainfrom
patrick-schultz:ps/push-mspslttpsvkk

Conversation

@patrick-schultz
Copy link
Copy Markdown
Member

Change Description

Lift produceIterator from a nested function inside EmitStream.produce to a private[ir] method on object EmitStream, and rewrite the stream compilation path in Compile to use it instead of the bespoke compileStepper machinery. This removes the hasStreamParam branching that selected between TMPStepFunction and TableStageToRVDStepFunction, making stream compilation fully generic.

Key changes:

  • EmitStream.produceIterator is now a top-level method that takes an EmitContext and derives elementPType internally. Stream production is emitted in the constructor with mb=next, so the producer's labels are bound to the next() method while setup runs once in the ctor.
  • To avoid a complication with let-bound streams (details below), Compile.Impl now always runs one last ForwardLets pass. When I'm done reworking our LoweringPass infrastructure, this should be handled there instead.
  • Compile.Impl handles TStream by calling produceIterator and wrapping the result in a NoBoxLongIteratorAdapter.
  • CompileIterator object deleted entirely.

Let-bound stream issue

This change exposed a pre-existing bug involving streams bound in a Block. For background, EmitStream.produce takes both a CodeBuilder cb and a MethodBuilder mb. cb is where setup code is emitted (e.g. the computation of i in StreamIota(i)), while mb is the method which will contain the stream, i.e. where the stream's labels will be defined, and therefore where the stream must be consumed.

The problem is that when encountering a bound stream in emitBlock, we assumed mb = cb.emb, i.e. the stream will be consumed in the current method. This isn't a safe assumption, and in fact it's hard to know which method the stream will be consumed in, especially as emitBlock also does method splitting.

Rather than introducing significant complexity to emitBlock, I think the right fix is to enforce the invariant that when we call Emit, there can be no let-bound streams. This is easily enforced by ForwardLets, because a bound stream is always forwardable.

Security Assessment

  • This change cannot impact the Hail Batch instance as deployed by Broad Institute in GCP

Lift `produceIterator` from a nested function inside `EmitStream.produce`
to a `private[ir]` method on `object EmitStream`, and rewrite the stream
compilation path in `Compile` to use it instead of the bespoke
`compileStepper` machinery. This removes the `hasStreamParam` branching
that selected between `TMPStepFunction` and `TableStageToRVDStepFunction`,
making stream compilation fully generic.

Key changes:
- `EmitStream.produceIterator` is now a top-level method that takes an
  `EmitContext` and derives `elementPType` internally. Stream production
  is emitted in the constructor with `mb=next`, so the producer's labels
  are bound to the `next()` method while setup runs once in the ctor.
- To make this safe, `emitBlock` gains an overload that accepts an
  explicit `streamMb: EmitMethodBuilder[_]`, and `EmitStream.produce`
  threads the outer `mb` through both the `Block` case and the `emit`
  helper. This ensures stream-typed `Let` bindings get their producers
  bound to the correct method (`next`), not `cb.emb` (which is `ctor`),
  preventing `SStreamControlFlow` method-mismatch assertions.
- `Compile.Impl` handles `TStream` by calling `produceIterator` and
  wrapping the result in a `NoBoxLongIteratorAdapter`.
- `CompileIterator` object deleted entirely (~280 lines), including
  `compileStepper`, `compileStream`, `LongIteratorWrapper`, and the
  deprecated `forTableStageToRVD`/`forTableMapPartitions` methods.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cjllanwarne
Copy link
Copy Markdown
Collaborator

If #15395 merges, the test batch for this PR would only have run the following steps:

[
    'check_hail',
    'compile_hail_213',
    'merge_code',
    'test_batch_invariants',
    'test_hail_java',
    'test_hail_python',
    'test_hail_python_local_backend',
    'test_hail_python_service_backend_gcp',
    'test_hail_python_unchecked_allocator',
    'test_hail_scala_fs',
    'test_hail_services_java',
    'test_hail_spark_conf_requester_pays_parsing',
    'test_hailgenetics_hail_image',
    'test_hailgenetics_hail_image_python_3_10',
    'test_hailgenetics_hail_image_python_3_12',
    'test_hailgenetics_hail_image_python_3_13',
    'test_python_docs',
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants