Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions docs/contributor-guide/flownode/batching_mode.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
keywords: [streaming process, flow management, Flownode components, Flownode limitations, batching mode]
description: Overview of Flownode's batching mode, a component providing continuous data aggregation capabilities to the database, including its architecture and query execution flow.
keywords: [batching mode, flow management, Flownode components, Flownode limitations, continuous aggregation]
description: Overview of Flownode's batching mode, the active execution mode for continuous data aggregation, including its architecture and query execution flow.
---

# Flownode Batching Mode Developer Guide
Expand All @@ -9,7 +9,7 @@ This guide provides a brief overview of the batching mode in `flownode`. It's in

## Overview

The batching mode in `flownode` is designed for continuous data aggregation. It periodically executes a user-defined SQL query over small, discrete time windows. This is in contrast to a streaming mode where data is processed as it arrives.
The batching mode in `flownode` is designed for continuous data aggregation. It periodically executes a user-defined SQL query over small, discrete time windows. This is in contrast to the original streaming mode, now deprecated, where data was processed as it arrived.

The core idea is to:
1. Define a `flow` with a SQL query that aggregates data from a source table into a sink table.
Expand Down Expand Up @@ -71,4 +71,4 @@ Here's a simplified step-by-step walkthrough of how a query is executed in batch
- Rewrites the original SQL query to include this new filter, ensuring that only the necessary data is processed.
5. **Execution**: The modified query plan is sent to the `Frontend` for execution. The database processes the aggregation on the filtered data.
6. **Upsert**: The results are inserted into the sink table. The sink table is typically defined with a primary key that includes the time window column, so new results for an existing window will overwrite (upsert) the old ones.
7. **State Update**: The `DirtyTimeWindows` set is cleared of the windows that were just processed. The task then goes back to sleep until the next interval.
7. **State Update**: The `DirtyTimeWindows` set is cleared of the windows that were just processed. The task then goes back to sleep until the next interval.
16 changes: 8 additions & 8 deletions docs/contributor-guide/flownode/overview.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,26 @@
---
keywords: [streaming process, flow management, standalone mode, Flownode components, Flownode limitations]
description: Overview of Flownode, a component providing streaming process capabilities to the database, including its components and current limitations.
keywords: [continuous aggregation, flow management, standalone mode, Flownode components, Flownode limitations]
description: Overview of Flownode, a component providing Flow computation capabilities to the database, including batching mode, deprecated streaming mode, and core components.
---

# Flownode

## Introduction


`Flownode` provides a simple streaming process (known as `flow`) ability to the database.
`Flownode` provides Flow computation capabilities to the database.
`Flownode` manages `flows` which are tasks that receive data from the `source` and send data to the `sink`.

`Flownode` support both `standalone` and `distributed` mode. In `standalone` mode, `Flownode` runs in the same process as the database. In `distributed` mode, `Flownode` runs in a separate process and communicates with the database through the network.

There are two execution modes for a flow:
- **Streaming Mode**: The original mode where data is processed as it arrives.
- **Batching Mode**: A newer mode designed for continuous data aggregation. It periodically executes a user-defined SQL query over small, discrete time windows. All aggregation queries now use this mode. For more details, see the [Batching Mode Developer Guide](./batching_mode.md).
- **Batching Mode**: The active mode for continuous data aggregation. It periodically executes a user-defined SQL query over small, discrete time windows. Aggregation and TQL queries use this mode. For more details, see the [Batching Mode Developer Guide](./batching_mode.md).
- **Streaming Mode (deprecated)**: The original mode where data is processed as it arrives. It is kept for legacy compatibility and is not recommended for new workloads.

## Components

A `Flownode` contains all the components needed to execute a flow. The specific components involved depend on the execution mode (Streaming vs. Batching). At a high level, the key parts are:
A `Flownode` contains all the components needed to execute a flow. The specific components involved depend on the execution mode. At a high level, the key parts are:

- **Flow Manager**: A central component responsible for managing the lifecycle of all flows.
- **Task Executor**: The runtime environment where the flow logic is executed. In streaming mode, this is typically a `FlowWorker`; in batching mode, it's a `BatchingTask`.
- **Flow Task**: Represents a single, independent data flow, containing the logic for transforming data from a source to a sink.
- **Task Executor**: The runtime environment where the flow logic is executed. In batching mode, this is a `BatchingTask`; in the deprecated streaming mode, this is typically a `FlowWorker`.
- **Flow Task**: Represents a single, independent data flow, containing the logic for transforming data from a source to a sink.
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ The GreptimeDB Enterprise Edition deployment includes the following components:
- Meta:Database cluster metadata management component
- Datanode:Data node
- Frontend:Entry point and protocol parsing node
- Flownode(optional): Stream computing node
- Flownode(optional): Flow computation node for continuous aggregation workloads
- Vector Sidecar:Metrics collection agent
- GreptimeDB Standalone: Cluster self-monitoring storage node

Expand Down
4 changes: 2 additions & 2 deletions docs/reference/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ The query processing layer in GreptimeDB's distributed architecture that serves


### Flow Engine
GreptimeDB's real-time stream processing system that enables continuous, incremental computation on streaming data. Flow Engine works like an intelligent materialized view that automatically updates result tables as new data arrives in source tables. It processes data at configurable intervals (default: one second) with minimal computational overhead, making it ideal for ETL processes, downsampling, real-time analytics, and continuous aggregation scenarios.
GreptimeDB's continuous aggregation engine that enables incremental computation on incoming time-series data. Flow Engine works like an intelligent materialized view that automatically updates result tables as new data arrives in source tables. Aggregation and TQL workloads use batching mode; the original streaming mode is deprecated and is not recommended for new workloads.

---

Expand Down Expand Up @@ -183,7 +183,7 @@ The capability of a database system to handle growing volumes of data and increa
A standardized programming language used for managing and manipulating relational databases. GreptimeDB supports SQL, allowing users to query metrics, logs, and events efficiently.

### Stream Processing
The continuous, real-time processing of data streams as they arrive. In GreptimeDB, stream processing is implemented through the Flow Engine, which performs incremental computation on streaming time-series data. This enables instant filtering, computing, and aggregation of metrics, logs, and events, providing actionable insights with minimal latency.
The continuous, real-time processing of data streams as they arrive. GreptimeDB's Flow Engine provides continuous aggregation through batching mode; its original streaming mode is deprecated and is not recommended for new workloads.

---

Expand Down
2 changes: 1 addition & 1 deletion docs/reference/sql/create.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,7 @@ The `ttl` value can be one of the following:

- A [duration](/reference/time-durations.md) like `1hour 12min 5s`.
- `forever`, `NULL`, an empty string `''` and `0s` (or any zero length duration, like `0d`), means the data will never be deleted.
- `instant`, note that database's TTL can't be set to `instant`. `instant` means the data will be deleted instantly when inserted, useful if you want to send input to a flow task without saving it, see more details in [flow management documents](/user-guide/flow-computation/manage-flow.md#manage-flows).
- `instant`, note that database's TTL can't be set to `instant`. `instant` means the data will be deleted instantly when inserted. Avoid using `instant` TTL source tables for new Flow workloads because they fall back to the deprecated streaming mode; see the [flow management documents](/user-guide/flow-computation/manage-flow.md#create-a-source-table).
- Unset, `ttl` can be unset by using `ALTER TABLE <table-name> UNSET 'ttl'`, which means the table will inherit the database's ttl policy (if any).

If a table has its own TTL policy, it will take precedence over the database TTL policy.
Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/concepts/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ GreptimeDB has three core components in distributed mode, and one optional compo
- [**Metasrv**](/contributor-guide/metasrv/overview.md): Metadata and routing control plane. It manages catalogs/schemas/tables/regions, coordinates scheduling, and serves routing data to other nodes.
- [**Frontend**](/contributor-guide/frontend/overview.md): Stateless access layer. It accepts client protocols, authenticates requests, plans/distributes queries, and routes writes/reads using metadata from Metasrv.
- [**Datanode**](/contributor-guide/datanode/overview.md): Storage and execution layer. It stores table regions, handles reads/writes, persists WAL, and flushes data files to object storage.
- [**Flownode (optional)**](/contributor-guide/flownode/overview.md): Streaming/continuous computation runtime for [Flow Computation](/user-guide/flow-computation/overview.md). It is used when flow workloads run as a separate service in distributed deployments.
- [**Flownode (optional)**](/contributor-guide/flownode/overview.md): Continuous aggregation runtime for [Flow Computation](/user-guide/flow-computation/overview.md). Flow aggregation workloads use batching mode; the original streaming mode is deprecated.

In standalone mode, you run one GreptimeDB process instead of managing these services separately.

Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/concepts/why-greptimedb.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ GreptimeDB takes a different approach: one database engine for all three signal
GreptimeDB unifies the processing of metrics, logs, and traces through:
- A consistent [data model](./data-model.md) that treats all observability data as timestamped wide events with context
- Native support for both [SQL](/user-guide/query-data/sql.md) and [PromQL](/user-guide/query-data/promql.md) queries
- Built-in stream processing capabilities ([Flow](/user-guide/flow-computation/overview.md)) for real-time aggregation and analytics
- Built-in continuous aggregation capabilities ([Flow](/user-guide/flow-computation/overview.md)) for real-time aggregation and analytics
- Seamless correlation analysis across different types of observability data (read the [SQL example](/getting-started/quick-start.md#correlate-metrics-logs-and-traces) for detailed info)

It replaces complex legacy data stacks with a high-performance single solution.
Expand Down
12 changes: 9 additions & 3 deletions docs/user-guide/flow-computation/manage-flow.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@ Each `flow` is a continuous aggregation query in GreptimeDB.
It continuously updates the aggregated data based on the incoming data.
This document describes how to create, and delete a flow.

:::note
Flow uses batching mode for aggregation and TQL workloads. Simple non-aggregation Flow queries currently use the deprecated streaming mode and are not recommended for new workloads.
:::

## Create a Source Table

Before creating a flow, you need to create a source table to store the raw data. Like this:
Expand All @@ -22,7 +26,9 @@ CREATE TABLE temp_sensor_data (
PRIMARY KEY(sensor_id, loc)
);
```
However, if you don't want to store the raw data, you can use a temporary table as the source table by creating table using `WITH ('ttl' = 'instant')` table option:
Avoid using `WITH ('ttl' = 'instant')` for new Flow source tables. Source tables with `ttl='instant'` fall back to the deprecated streaming mode. Keep the source data with an appropriate TTL instead, so aggregation and TQL Flow workloads can run in batching mode.

For existing legacy streaming-mode deployments, a source table may use `WITH ('ttl' = 'instant')`:

```sql
CREATE TABLE temp_sensor_data (
Expand All @@ -34,7 +40,7 @@ CREATE TABLE temp_sensor_data (
) WITH ('ttl' = 'instant');
```

Setting `'ttl'` to `'instant'` will make the table a temporary table, which means it will automatically discard all inserted data and the table will always be empty, only sending them to flow task for computation.
Setting `'ttl'` to `'instant'` makes the table discard inserted data immediately and only sends rows to a legacy streaming-mode flow task. This pattern is deprecated for new workloads.

## Create a Sink Table

Expand Down Expand Up @@ -152,7 +158,7 @@ It is important to note that the `EXPIRE AFTER` clause does not delete data from
It only controls how the flow engine processes the data.
If you want to delete data from the source or sink table, please [set the `TTL` option](/user-guide/manage-data/overview.md#manage-data-retention-with-ttl-policies) when creating tables.

Setting a reasonable time interval for `EXPIRE AFTER` is helpful to limit state size and avoid memory overflow. This is somewhere similar to the ["Watermarks"](https://docs.risingwave.com/processing/watermarks) concept in streaming processing.
Setting a reasonable time interval for `EXPIRE AFTER` is helpful to limit how far back the batching engine needs to recompute results and to avoid excessive resource usage. It serves a similar purpose to bounding lateness in stream processing systems, but new Flow workloads should use batching mode.

For example, if the flow engine processes the aggregation at 10:00:00 and the `'1 hour'::INTERVAL` is set,
any input data that arrive now with a time index older than 1 hour (before 09:00:00) will expire and be ignore.
Expand Down
15 changes: 9 additions & 6 deletions docs/user-guide/flow-computation/overview.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
---
keywords: [Flow engine, real-time computation, ETL, data streams, user agent statistics, nginx logs]
description: Discover how GreptimeDB's Flow engine enables real-time computation of data streams for ETL processes and on-the-fly calculations. Learn about its programming model, use cases, and a quick start example for calculating user agent statistics from nginx logs.
keywords: [Flow engine, real-time computation, ETL, continuous aggregation, user agent statistics, nginx logs]
description: Discover how GreptimeDB's Flow engine enables real-time continuous aggregations on incoming data for ETL processes and analytics. Learn about its batching execution model, use cases, and a quick start example for calculating user agent statistics from nginx logs.
---

# Flow Computation

GreptimeDB's Flow engine enables real-time computation of data streams.
It is particularly beneficial for Extract-Transform-Load (ETL) processes or for performing on-the-fly filtering, calculations and queries such as sum, average, and other aggregations.
GreptimeDB's Flow engine enables real-time computation on incoming data.
It is particularly beneficial for Extract-Transform-Load (ETL) processes or for performing continuous aggregations such as sum, average, and other time-window calculations.
The Flow engine ensures that data is processed incrementally and continuously,
updating the final results as new streaming data arrives.
updating the final results as new data arrives.
You can think of it as a clever materialized views that know when to update result view table and how to update it with minimal effort.

Use cases include:
Expand All @@ -18,6 +18,10 @@ Use cases include:

## Programming Model

:::note
Flow uses batching mode for aggregation and TQL workloads. Simple non-aggregation Flow queries currently use the deprecated streaming mode and are not recommended for new workloads.
:::

Upon data insertion into the source table,
the data is concurrently ingested to the Flow engine.
At each trigger interval (one second),
Expand Down Expand Up @@ -122,4 +126,3 @@ The query results will display the total count of each user agent in the `user_a
- [Continuous Aggregation](./continuous-aggregation.md): Explore the primary scenario in time-series data processing, with three common use cases for continuous aggregation.
- [Manage Flow](manage-flow.md): Gain insights into the mechanisms of the Flow engine and the SQL syntax for defining a Flow.
- [Expressions](expressions.md): Learn about the expressions supported by the Flow engine for data transformation.

3 changes: 1 addition & 2 deletions docs/user-guide/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ GreptimeDB supports PromQL for querying data. Refer to the [PromQL Documentation

### Flow Computation

For real-time data processing and analysis, GreptimeDB provides [Flow Computation](./flow-computation/overview.md), enabling complex computations on data streams.
For real-time data processing and analysis, GreptimeDB provides [Flow Computation](./flow-computation/overview.md), enabling continuous aggregations on incoming time-series data.

## Accelerating Queries with Indexes

Expand All @@ -70,4 +70,3 @@ Follow the step-by-step instructions in the [Migration Documentation](./migrate-
## Administering and Deploying GreptimeDB

When you're ready to deploy GreptimeDB, consult the [Deployment & Administration Documentation](/user-guide/deployments-administration/overview.md) for detailed guidance on deployment and management.

Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
keywords: [流处理, flow 管理, Flownode 组件, Flownode 限制, 批处理模式]
description: Flownode 批处理模式概述,一个为数据库提供持续数据聚合能力的组件,包括其架构和查询执行流程。
keywords: [批处理模式, flow 管理, Flownode 组件, Flownode 限制, 持续聚合]
description: Flownode 批处理模式概述,这是持续数据聚合当前使用的执行模式,包括其架构和查询执行流程。
---

# Flownode 批处理模式开发者指南
Expand All @@ -9,7 +9,7 @@ description: Flownode 批处理模式概述,一个为数据库提供持续数

## 概述

`flownode` 中的批处理模式专为持续数据聚合而设计。它在离散的、微小的时间窗口上周期性地执行用户定义的 SQL 查询。这与数据在到达时即被处理的流处理模式形成对比
`flownode` 中的批处理模式专为持续数据聚合而设计。它在离散的、微小的时间窗口上周期性地执行用户定义的 SQL 查询。这与原始的流处理模式形成对比;流处理模式现在已经废弃,在该模式下数据会在到达时即被处理

其核心思想是:
1. 定义一个带有 SQL 查询的 `flow`,该查询将数据从源表聚合到目标表。
Expand Down Expand Up @@ -71,4 +71,4 @@ description: Flownode 批处理模式概述,一个为数据库提供持续数
- 重写原始 SQL 查询以包含此新过滤器,确保只处理必要的数据。
5. **执行**: 修改后的查询计划被发送到 `Frontend` 执行。数据库处理已过滤数据的聚合。
6. **Upsert**: 结果被插入到目标表中。目标表通常定义了一个包含时间窗口列的主键,因此现有窗口的新结果将覆盖(upsert)旧结果。
7. **状态更新**: `DirtyTimeWindows` 集合中刚刚处理过的窗口被清除。然后任务返回睡眠状态,直到下一个时间间隔。
7. **状态更新**: `DirtyTimeWindows` 集合中刚刚处理过的窗口被清除。然后任务返回睡眠状态,直到下一个时间间隔。
Loading
Loading