From d459867e6a63d4d50804d03b54d4c483e9c5ed26 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Sat, 13 Jun 2026 08:17:40 -0600 Subject: [PATCH] docs: adopt issue #4419 terminology in Understanding Comet Plans guide Replace bare uses of "native" with precise terms from the terminology framework: "Rust-implemented" / "implemented in Rust" for the implementation language, "Comet pipeline" for pipeline membership, and "runs in Comet" / "falls back to Spark" for fallback. Keep the compound forms the issue allows (native Rust, native shuffle, and operator names such as CometNativeScan). --- .../latest/understanding-comet-plans.md | 69 ++++++++++--------- 1 file changed, 35 insertions(+), 34 deletions(-) diff --git a/docs/source/user-guide/latest/understanding-comet-plans.md b/docs/source/user-guide/latest/understanding-comet-plans.md index 8adccc0929..07c1086624 100644 --- a/docs/source/user-guide/latest/understanding-comet-plans.md +++ b/docs/source/user-guide/latest/understanding-comet-plans.md @@ -27,13 +27,14 @@ inspect that behavior. When Comet is enabled, the `CometSparkSessionExtensions` rules walk the physical plan bottom-up and replace Spark operators with Comet equivalents -where possible. Consecutive native operators are combined into a single block -that is serialized as protobuf and executed by DataFusion on the executor. +where possible. Consecutive Rust-implemented operators are combined into a +single block that is serialized as protobuf and executed by DataFusion on the +executor. Operators that Comet does not support remain as their original Spark form. As a result, a plan can mix three kinds of nodes: -- **`Comet*` nodes** that run natively in Rust (for example `CometProject`, +- **`Comet*` nodes** that are implemented in Rust (for example `CometProject`, `CometHashAggregate`). - **`Comet*` nodes that run on the JVM** but are still part of the Comet pipeline (for example `CometBroadcastExchange`, `CometColumnarExchange`). @@ -62,9 +63,9 @@ the same plan is shown in the Spark SQL UI. When reading a plan, look for: ## Fallback -A "fallback" happens when Comet cannot translate part of a plan into native -execution. Fallback can be partial (a subtree falls back while the rest stays -native) or full (no Comet nodes appear). +A "fallback" happens when Comet cannot run part of a plan in the Comet +pipeline. Fallback can be partial (a subtree falls back to Spark while the rest +stays in the Comet pipeline) or full (no Comet nodes appear). Common reasons: @@ -88,15 +89,15 @@ They serve different purposes and produce output in different places. | Config | Output destination | What you see | | ---------------------------------------- | ---------------------------------- | --------------------------------------------------------------------------------------------- | -| `spark.comet.explainFallback.enabled` | Driver log (only when fallback) | A WARN with the list of reasons each query stage could not run natively. | +| `spark.comet.explainFallback.enabled` | Driver log (only when fallback) | A WARN with the list of reasons each query stage could not run in Comet. | | `spark.comet.logFallbackReasons.enabled` | Driver log | One WARN per fallback reason as it is encountered, without surrounding plan context. | | `spark.comet.explain.format` | Spark SQL UI (Spark 4.0 and newer) | Annotated plan or fallback-reason list, depending on `verbose` (default) or `fallback` value. | -| `spark.comet.explain.native.enabled` | Executor logs, per task | The DataFusion native plan with metrics, useful for inspecting native execution. | +| `spark.comet.explain.native.enabled` | Executor logs, per task | The DataFusion plan with metrics, useful for inspecting Rust execution. | ### `spark.comet.explainFallback.enabled` Logs a single WARN listing the reasons each query stage could not be executed -natively. Nothing is logged when the entire stage runs in Comet. Useful as a +in Comet. Nothing is logged when the entire stage runs in Comet. Useful as a low-noise check that fallback is or is not happening. ### `spark.comet.logFallbackReasons.enabled` @@ -131,12 +132,12 @@ the config has no effect there. ### `spark.comet.explain.native.enabled` -When enabled, each executor task logs the DataFusion native plan it executes, +When enabled, each executor task logs the DataFusion plan it executes, along with metrics. This is verbose because there is one plan per task, but it -is the only way to see the native plan as DataFusion sees it (including how +is the only way to see the plan as DataFusion sees it (including how operators were arranged after Comet's serialization). See the -[Metrics Guide](metrics.md) for details on the native metrics that appear in -this output. +[Metrics Guide](metrics.md) for details on the DataFusion metrics that appear +in this output. ## Programmatic Access to Fallback Reasons @@ -193,14 +194,14 @@ by role. Names match what is shown in the plan output. | Node | Description | | ------------------------ | --------------------------------------------------------------------------------------------- | | `CometBatchScan` | DataSource V2 scan, including Iceberg Parquet, that produces Arrow batches consumed by Comet. | -| `CometNativeScan` | Fully native Parquet scan that runs entirely in DataFusion. | -| `CometIcebergNativeScan` | Fully native Iceberg Parquet scan. | -| `CometCsvNativeScan` | Fully native CSV scan (experimental). | +| `CometNativeScan` | Parquet scan that runs entirely in Rust via DataFusion. | +| `CometIcebergNativeScan` | Iceberg Parquet scan that runs entirely in Rust via DataFusion. | +| `CometCsvNativeScan` | CSV scan that runs entirely in Rust via DataFusion (experimental). | -### Native Execution Operators +### Native Rust Operators -These run natively in DataFusion. When several appear consecutively in a plan, -they execute as a single fused native block. +These are implemented in Rust and run in DataFusion. When several appear +consecutively in a plan, they execute as a single fused block. | Node | Spark equivalent | | ------------------------------ | ----------------------------------------------- | @@ -223,13 +224,13 @@ they execute as a single fused native block. These keep their data on the JVM but participate in the Comet pipeline. -| Node | Notes | -| ------------------------ | ------------------------------------------------------------------------------------- | -| `CometUnion` | JVM-side union of Comet inputs. The native side reads each branch as a separate scan. | -| `CometCoalesce` | JVM-side partition coalesce. | -| `CometCollectLimit` | JVM-side collect limit, equivalent to `CollectLimitExec`. | -| `CometBroadcastExchange` | Broadcast exchange producing serialized Arrow batches that the consumer can decode. | -| `CometSubqueryBroadcast` | Companion to `CometBroadcastExchange` for dynamic partition pruning subqueries. | +| Node | Notes | +| ------------------------ | ----------------------------------------------------------------------------------- | +| `CometUnion` | JVM-side union of Comet inputs. The Rust side reads each branch as a separate scan. | +| `CometCoalesce` | JVM-side partition coalesce. | +| `CometCollectLimit` | JVM-side collect limit, equivalent to `CollectLimitExec`. | +| `CometBroadcastExchange` | Broadcast exchange producing serialized Arrow batches that the consumer can decode. | +| `CometSubqueryBroadcast` | Companion to `CometBroadcastExchange` for dynamic partition pruning subqueries. | ### Shuffle Operators @@ -239,8 +240,8 @@ use: - **`CometExchange`** is the **native shuffle** path. The child must already be a Comet operator producing columnar Arrow batches; the node calls `executeColumnar()` on its child and the partition, encode, and compress - steps run in native code. Hash and range partitioning **keys** must be - primitive types because native hashing and ordering do not support complex + steps run in Rust. Hash and range partitioning **keys** must be + primitive types because the Rust hashing and ordering do not support complex types, but the data columns themselves can include `StructType`, `ArrayType`, and `MapType` since batches are serialized via the Arrow IPC writer. @@ -265,12 +266,12 @@ Comet inserts these nodes wherever data has to cross the columnar/row boundary. Multiple implementations exist because the optimal strategy depends on what produced the columnar data. -| Node | Direction | Notes | -| ------------------------------ | ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `CometColumnarToRow` | columnar → row | JVM-based row conversion. A fork of Spark's `ColumnarToRowExec` that includes the SPARK-50235 fix. | -| `CometNativeColumnarToRow` | columnar → row | Native row conversion that decodes broadcast Arrow batches via `NativeColumnarToRowConverter`. Used downstream of `CometBroadcastExchange`. Zero-copy for variable-length types and avoids an extra JVM materialization step. | -| `CometSparkColumnarToColumnar` | columnar → columnar | Converts a Spark columnar input (a non-Comet `ColumnarBatch`) into Comet's Arrow batches. | -| `CometSparkRowToColumnar` | row → columnar | Converts a Spark row input into Comet's Arrow batches. | +| Node | Direction | Notes | +| ------------------------------ | ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `CometColumnarToRow` | columnar → row | JVM-based row conversion. A fork of Spark's `ColumnarToRowExec` that includes the SPARK-50235 fix. | +| `CometNativeColumnarToRow` | columnar → row | Rust-based row conversion that decodes broadcast Arrow batches via `NativeColumnarToRowConverter`. Used downstream of `CometBroadcastExchange`. Zero-copy for variable-length types and avoids an extra JVM materialization step. | +| `CometSparkColumnarToColumnar` | columnar → columnar | Converts a Spark columnar input (a non-Comet `ColumnarBatch`) into Comet's Arrow batches. | +| `CometSparkRowToColumnar` | row → columnar | Converts a Spark row input into Comet's Arrow batches. | The two `CometSpark*` names come from a single `CometSparkToColumnarExec` operator that picks the node name based on whether its child supports