VectraFlow: Long-Horizon Semantic Processing over Data and Event Streams with LLMs

Deepti Raghavan; Junhan Liu; Shu Chen; Ugur Cetintemel

arxiv: 2604.03855 · v1 · submitted 2026-04-04 · 💻 cs.DB

VectraFlow: Long-Horizon Semantic Processing over Data and Event Streams with LLMs

Shu Chen , Junhan Liu , Deepti Raghavan , Ugur Cetintemel This is my paper

Pith reviewed 2026-05-13 16:41 UTC · model grok-4.3

classification 💻 cs.DB

keywords semantic streamingLLM operatorscomplex event processingunstructured text streamscontinuous queriestemporal pattern matchingdataflow engine

0 comments

The pith

VectraFlow extends relational streaming operators to free-text data using LLMs for continuous semantic processing and event pattern detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VectraFlow as a streaming engine that applies LLMs to handle ongoing flows of unstructured text. It adds operators such as filter, map, aggregate, join, group-by, and window that run on free text, with options to choose LLM, embedding, or combined methods for balancing speed and precision. These operators support stateful work over long sequences, and a special pattern operator extracts events from documents then matches temporal rules using nondeterministic finite automata. This setup lets users detect meaningful signals in raw document streams like clinical notes, something traditional systems cannot do because they require pre-structured input.

Core claim

VectraFlow extends traditional relational operators with LLM-powered execution over free-text streams, offering a suite of continuous semantic operators -- filter, map, aggregate, join, group-by, and window -- each with configurable throughput-accuracy tradeoffs across LLM-based, embedding-based, and hybrid implementations. Building on this, a semantic event pattern operator lifts complex event processing to unstructured document streams, combining LLM-based event extraction with NFA-based temporal rule matching for stateful reasoning over sequences of semantic events.

What carries the argument

Continuous semantic operators executed through LLM, embedding, or hybrid paths, together with LLM event extraction followed by NFA-based temporal rule matching on the resulting semantic events.

If this is right

Users can compile natural-language intents into executable graphs of semantic operators over live text streams.
Stateful temporal patterns become detectable directly on sequences of events extracted from unstructured documents.
Each operator can be tuned independently for higher throughput or higher accuracy depending on workload needs.
End-to-end processing moves from raw text input to matched event cohorts without requiring prior data structuring.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be layered onto existing stream engines to add semantic capabilities to mixed structured and unstructured workloads.
Real-time monitoring of logs, news, or sensor text might become feasible at scale if the tradeoff knobs prove stable under bursty input.
Extending the NFA matching to include learned temporal models rather than hand-written rules would be a natural next test.

Load-bearing premise

LLM-based, embedding-based, and hybrid operator implementations can deliver practical throughput-accuracy tradeoffs on real unstructured streams without prohibitive latency or accuracy collapse under load.

What would settle it

A sustained high-volume run of clinical documents through the full operator graph where measured per-operator latency exceeds a target bound or accuracy on event extraction and pattern matching falls below a usable threshold.

Figures

Figures reproduced from arXiv: 2604.03855 by Deepti Raghavan, Junhan Liu, Shu Chen, Ugur Cetintemel.

**Figure 2.** Figure 2: Semantic group-by implementations on the MiDe22 [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: VectraFlow interactive interface. (a) NL & Config View: Natural language to executable pipeline compilation. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Monitoring continuous data for meaningful signals increasingly demands long-horizon, stateful reasoning over unstructured streams. However, today's LLM frameworks remain stateless and one-shot, and traditional Complex Event Processing (CEP) systems, while capable of temporal pattern detection, assume structured, typed event streams that leave unstructured text out of reach. We demonstrate VectraFlow, a semantic streaming dataflow engine, to address both gaps. VectraFlow extends traditional relational operators with LLM-powered execution over free-text streams, offering a suite of continuous semantic operators -- filter, map, aggregate, join, group-by, and window -- each with configurable throughput-accuracy tradeoffs across LLM-based, embedding-based, and hybrid implementations. Building on this, a semantic event pattern operator lifts complex event processing to unstructured document streams, combining LLM-based event extraction with NFA-based temporal rule matching for stateful reasoning over sequences of semantic events. In this demonstration, users will interact with VectraFlow's live query interface to compose semantic pipelines over clinical document streams. Attendees will compile natural language intents into executable operator graphs, inspect intermediate stateful outputs, and observe end-to-end temporal pattern detection, from raw text to matched event cohorts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VectraFlow sketches a workable architecture for LLM-augmented streaming operators on text but supplies no measurements to show the claimed tradeoffs actually work.

read the letter

VectraFlow describes a streaming engine that adds LLM and embedding calls to standard relational operators so they can run continuously over free-text streams, then uses an NFA to match temporal patterns on the extracted semantic events. The main new piece is the explicit set of continuous semantic operators—filter, map, aggregate, join, group-by, window—each offered in full-LLM, embedding-only, or hybrid form, plus the direct lift of CEP-style pattern detection to unstructured documents via LLM extraction followed by NFA matching. The paper lays out how natural-language intents compile into an operator graph and how the live interface lets users inspect state and see end-to-end matches on clinical streams. That design is coherent and addresses a real gap between stateless LLM calls and rigid structured CEP systems. The description of the graph and the NFA integration is clear enough to follow without extra background. The obvious limitation is the total lack of numbers. No latency, throughput, accuracy, or scaling results appear for any operator or for the three implementation modes, so the configurable tradeoff claim stays at the level of an architectural promise. The stress-test note is accurate on this point. The work is aimed at database and streaming researchers who are already thinking about adding semantic layers to dataflow engines. A reader who wants design ideas or a concrete example of how the pieces could fit together will get value; anyone looking for validated performance or a ready-to-use system will not. I would bring it to a reading group to talk through the operator choices and where the latency bottlenecks are likely to sit. It deserves peer review as a system demo because the idea is well-motivated and internally consistent, even though any serious referee will ask for empirical support.

Referee Report

2 major / 2 minor

Summary. VectraFlow is presented as a semantic streaming dataflow engine that extends relational operators with LLM-powered execution over free-text streams. It defines continuous semantic operators (filter, map, aggregate, join, group-by, window) each supporting LLM-based, embedding-based, and hybrid implementations with claimed configurable throughput-accuracy tradeoffs. A semantic event pattern operator combines LLM-based event extraction with NFA-based temporal rule matching to enable complex event processing on unstructured document streams. The work is demonstrated via a live query interface allowing natural-language compilation of pipelines over clinical document streams, with inspection of intermediate state and end-to-end pattern detection.

Significance. If the architectural claims hold and the operators deliver practical tradeoffs, VectraFlow would provide a concrete bridge between stateless LLM frameworks and stateful streaming systems, extending CEP to unstructured text. The demonstration of natural-language intent compilation and live state inspection could lower barriers for semantic stream processing in domains such as clinical monitoring. The absence of any quantitative results, however, leaves the practical significance as a hypothesis rather than a demonstrated advance.

major comments (2)

[Abstract / operator suite description] Abstract and operator description: the central claim that each continuous semantic operator offers 'configurable throughput-accuracy tradeoffs across LLM-based, embedding-based, and hybrid implementations' is unsupported by any latency, throughput, accuracy, or scaling measurements. No tables, figures, or experimental sections report these quantities, so the configurability assertion remains an unverified architectural promise.
[Semantic event pattern operator] Semantic event pattern operator: the combination of LLM-based event extraction with NFA-based temporal rule matching is described at a high level, but no details are given on state management for long-horizon streams, memory bounds, or how extraction errors propagate through the NFA. This is load-bearing for the claim of lifting CEP to unstructured streams.

minor comments (2)

The manuscript would benefit from a short related-work paragraph contrasting VectraFlow with existing LLM streaming frameworks (e.g., LangChain streaming chains) and traditional CEP engines (e.g., Apache Flink CEP) to clarify the precise novelty of the hybrid operator implementations.
Figure captions and the live-demo interface description should explicitly label which implementation mode (LLM, embedding, or hybrid) is active in each illustrated pipeline to make the configurability concrete for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review. As this is a demonstration paper focused on the live query interface and operator architecture, we address the concerns about empirical support and implementation details by clarifying the demo scope and committing to targeted expansions in the revision.

read point-by-point responses

Referee: Abstract and operator description: the central claim that each continuous semantic operator offers 'configurable throughput-accuracy tradeoffs across LLM-based, embedding-based, and hybrid implementations' is unsupported by any latency, throughput, accuracy, or scaling measurements. No tables, figures, or experimental sections report these quantities.

Authors: We acknowledge the absence of quantitative measurements in the current demonstration manuscript. The configurability is shown qualitatively via the live interface, where attendees select LLM/embedding/hybrid modes and observe differences in output quality and latency on clinical streams. To address the gap, the revised version will include a new 'Preliminary Evaluation' subsection with micro-benchmark results (throughput in docs/sec, accuracy via F1 on labeled subsets) for the core operators under varying configurations. revision: yes
Referee: Semantic event pattern operator: the combination of LLM-based event extraction with NFA-based temporal rule matching is described at a high level, but no details are given on state management for long-horizon streams, memory bounds, or how extraction errors propagate through the NFA.

Authors: We agree more detail is required for the long-horizon claim. The revision will expand this section with: (1) state management via bounded NFA with windowed history and confidence-based path pruning; (2) explicit memory bounds enforced by configurable max-state size and eviction policies; (3) error propagation modeled as weighted transitions where LLM extraction confidence modulates transition probabilities, allowing tolerance of noisy extractions without state explosion. We will add pseudocode and a state-transition diagram. revision: yes

Circularity Check

0 steps flagged

No circularity: purely descriptive system architecture without derivations or self-referential claims

full rationale

The manuscript describes VectraFlow's operator suite and event processing capabilities but provides no equations, parameter fittings, or derivation steps. Claims regarding throughput-accuracy tradeoffs are presented as design features rather than results derived from prior steps within the paper. No self-citations or uniqueness theorems are used to justify core components. The system is self-contained as an engineering demonstration, with no reduction of predictions to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5517 in / 1112 out tokens · 51100 ms · 2026-05-13T16:41:51.425244+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VectraFlow extends traditional relational operators with LLM-powered execution over free-text streams, offering a suite of continuous semantic operators -- filter, map, aggregate, join, group-by, and window -- each with configurable throughput-accuracy tradeoffs across LLM-based, embedding-based, and hybrid implementations.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

sem_pattern ... combining LLM-based event extraction with NFA-based temporal rule matching for stateful reasoning over sequences of semantic events.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 2 internal anchors

[1]

Abadi, Don Carney, Uğur Çetintemel, Mitch Cherniack, Christian Con- vey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik

Daniel J. Abadi, Don Carney, Uğur Çetintemel, Mitch Cherniack, Christian Con- vey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. 2003. Aurora: A New Model and Architecture for Data Stream Management.The VLDB Journal12, 2 (2003), 120–139

work page 2003
[2]

GPT-4 Technical Report

Josh Achiam, Scott Adler, Sandhini Agarwal, Liane Ahmad, Ilge Akkaya, Fe- lipe L. Aleman, Daniel Almeida, Johannes Altenschmidt, Sam Altman, Shan- tanu Anadkat, and Rafael Avila. 2023. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Jagrati Agrawal, Yanlei Diao, Daniel Gyllstrom, and Neil Immerman. 2008. Ef- ficient Pattern Matching over Event Streams. InProceedings of the 2008 ACM SIGMOD International Conference on Management of Data. 147–160

work page 2008
[4]

Mert Akdere, Uğur Çetintemel, and Nesime Tatbul. 2008. Plan-Based Complex Event Detection across Distributed Sources.PVLDB1, 1 (2008), 66–77

work page 2008
[5]

Apache Flink. 2024. FlinkCEP — Complex Event Processing. https://nightlies. apache.org/flink/flink-docs-release-1.20/docs/libs/cep/. Accessed: 2024

work page 2024
[6]

Shu Chen, Deepti Raghavan, and Uğur Çetintemel. 2025. Continuous Prompts: LLM-Augmented Pipeline Processing over Unstructured Streams.arXiv preprint arXiv:2512.03389(2025). https://arxiv.org/abs/2512.03389

work page arXiv 2025
[7]

EsperTech. 2023. Esper Reference Documentation — Event Pattern Oper- ators. http://esper.espertech.com/release-9.0.0/reference-esper/html/event_ patterns.html. Accessed: 2024

work page 2023
[8]

Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Steven Horng, Leo Anthony Celi, and Roger Mark. 2020. MIMIC-IV. https://physionet.org/content/mimiciv/1.0/. Accessed: 2021-08-23

work page 2020
[9]

Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baille Chen, Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, Rana Shahout, and Gerardo Vitagliano. 2025. Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing. InProceedings of the 15th Conference on Innovative Data Systems Research (CIDR)

work page 2025
[10]

Duo Lu, Siming Feng, Jonathan Zhou, Franco Solleza, Malte Schwarzkopf, and Uğur Çetintemel. 2025. VectraFlow: Integrating Vectors into Stream Processing. InProceedings of the 15th Annual Conference on Innovative Data Systems Research

work page 2025
[11]

Liana Patel, Siddharth Jha, Melissa Pan, Harshit Gupta, Parth Asawa, Carlos Guestrin, and Matei Zaharia. 2025. Semantic operators and their optimization: Enabling LLM-powered analytics.Proceedings of the VLDB Endowment (PVLDB) 18, 3 (2025), 4171–4184. https://doi.org/10.14778/3749646.3749685

work page doi:10.14778/3749646.3749685 2025
[12]

Parameswaran, and Eugene Wu

Shreya Shankar, Tristan Chambers, Tarak Shah, Aditya G. Parameswaran, and Eugene Wu. 2025. DocETL: Agentic query rewriting and evaluation for complex document processing.Proceedings of the VLDB Endowment18, 9 (2025). https: //doi.org/10.14778/3746405.3746426

work page doi:10.14778/3746405.3746426 2025
[13]

Snowflake Inc. 2024. MATCH_RECOGNIZE: Snowflake Documentation. https: //docs.snowflake.com/en/sql-reference/constructs/match_recognize. Accessed: 2024

work page 2024
[14]

Cagri Toraman, Oguzhan Ozcelik, Furkan Sahinuç, and Fazli Can. 2024. MiDe22: An Annotated Multi-Event Tweet Dataset for Misinformation Detection. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ELRA and ICCL, 11283– 11295. https://aclanthology.org/2024.lrec-main.986

work page 2024
[15]

Eugene Wu, Yanlei Diao, and Shariq Rizvi. 2006. High-Performance Complex Event Processing over Streams. InProceedings of the 2006 ACM SIGMOD Interna- tional Conference on Management of Data. 407–418

work page 2006
[16]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chao Gao, Cheng Huang, Chen Lv, Chen Zheng, and Zhenyu Qiu

work page
[17]

Qwen3 Technical Report.arXiv preprint arXiv:2505.09388(2025). 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Abadi, Don Carney, Uğur Çetintemel, Mitch Cherniack, Christian Con- vey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik

Daniel J. Abadi, Don Carney, Uğur Çetintemel, Mitch Cherniack, Christian Con- vey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. 2003. Aurora: A New Model and Architecture for Data Stream Management.The VLDB Journal12, 2 (2003), 120–139

work page 2003

[2] [2]

GPT-4 Technical Report

Josh Achiam, Scott Adler, Sandhini Agarwal, Liane Ahmad, Ilge Akkaya, Fe- lipe L. Aleman, Daniel Almeida, Johannes Altenschmidt, Sam Altman, Shan- tanu Anadkat, and Rafael Avila. 2023. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Jagrati Agrawal, Yanlei Diao, Daniel Gyllstrom, and Neil Immerman. 2008. Ef- ficient Pattern Matching over Event Streams. InProceedings of the 2008 ACM SIGMOD International Conference on Management of Data. 147–160

work page 2008

[4] [4]

Mert Akdere, Uğur Çetintemel, and Nesime Tatbul. 2008. Plan-Based Complex Event Detection across Distributed Sources.PVLDB1, 1 (2008), 66–77

work page 2008

[5] [5]

Apache Flink. 2024. FlinkCEP — Complex Event Processing. https://nightlies. apache.org/flink/flink-docs-release-1.20/docs/libs/cep/. Accessed: 2024

work page 2024

[6] [6]

Shu Chen, Deepti Raghavan, and Uğur Çetintemel. 2025. Continuous Prompts: LLM-Augmented Pipeline Processing over Unstructured Streams.arXiv preprint arXiv:2512.03389(2025). https://arxiv.org/abs/2512.03389

work page arXiv 2025

[7] [7]

EsperTech. 2023. Esper Reference Documentation — Event Pattern Oper- ators. http://esper.espertech.com/release-9.0.0/reference-esper/html/event_ patterns.html. Accessed: 2024

work page 2023

[8] [8]

Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Steven Horng, Leo Anthony Celi, and Roger Mark. 2020. MIMIC-IV. https://physionet.org/content/mimiciv/1.0/. Accessed: 2021-08-23

work page 2020

[9] [9]

Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baille Chen, Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, Rana Shahout, and Gerardo Vitagliano. 2025. Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing. InProceedings of the 15th Conference on Innovative Data Systems Research (CIDR)

work page 2025

[10] [10]

Duo Lu, Siming Feng, Jonathan Zhou, Franco Solleza, Malte Schwarzkopf, and Uğur Çetintemel. 2025. VectraFlow: Integrating Vectors into Stream Processing. InProceedings of the 15th Annual Conference on Innovative Data Systems Research

work page 2025

[11] [11]

Liana Patel, Siddharth Jha, Melissa Pan, Harshit Gupta, Parth Asawa, Carlos Guestrin, and Matei Zaharia. 2025. Semantic operators and their optimization: Enabling LLM-powered analytics.Proceedings of the VLDB Endowment (PVLDB) 18, 3 (2025), 4171–4184. https://doi.org/10.14778/3749646.3749685

work page doi:10.14778/3749646.3749685 2025

[12] [12]

Parameswaran, and Eugene Wu

Shreya Shankar, Tristan Chambers, Tarak Shah, Aditya G. Parameswaran, and Eugene Wu. 2025. DocETL: Agentic query rewriting and evaluation for complex document processing.Proceedings of the VLDB Endowment18, 9 (2025). https: //doi.org/10.14778/3746405.3746426

work page doi:10.14778/3746405.3746426 2025

[13] [13]

Snowflake Inc. 2024. MATCH_RECOGNIZE: Snowflake Documentation. https: //docs.snowflake.com/en/sql-reference/constructs/match_recognize. Accessed: 2024

work page 2024

[14] [14]

Cagri Toraman, Oguzhan Ozcelik, Furkan Sahinuç, and Fazli Can. 2024. MiDe22: An Annotated Multi-Event Tweet Dataset for Misinformation Detection. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ELRA and ICCL, 11283– 11295. https://aclanthology.org/2024.lrec-main.986

work page 2024

[15] [15]

Eugene Wu, Yanlei Diao, and Shariq Rizvi. 2006. High-Performance Complex Event Processing over Streams. InProceedings of the 2006 ACM SIGMOD Interna- tional Conference on Management of Data. 407–418

work page 2006

[16] [16]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chao Gao, Cheng Huang, Chen Lv, Chen Zheng, and Zhenyu Qiu

work page

[17] [17]

Qwen3 Technical Report.arXiv preprint arXiv:2505.09388(2025). 4

work page internal anchor Pith review Pith/arXiv arXiv 2025