Recognition: no theorem link
From Static Inference to Dynamic Interaction: A Survey of Streaming Large Language Models
Pith reviewed 2026-05-15 16:12 UTC · model grok-4.3
The pith
Streaming LLMs gain a unified definition based on continuous data flow and dynamic interaction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By grounding the definition of streaming LLMs in data flow and dynamic interaction, the authors separate streaming generation, streaming inputs, and interactive streaming architectures, then organize current methods into a taxonomy that reveals their shared mechanisms and distinct trade-offs.
What carries the argument
The unified definition of streaming LLMs organized by data flow and dynamic interaction, which supplies the organizing principle for the taxonomy of methodologies.
If this is right
- Distinguishes streaming generation from interactive streaming so that new systems can be designed with the right data-flow properties in mind.
- Provides a shared vocabulary that lets researchers compare methods that previously appeared unrelated.
- Highlights gaps in coverage of real-world dynamic scenarios that future work can target.
- Supports more consistent evaluation protocols across studies of live LLM applications.
Where Pith is reading between the lines
- The taxonomy could be tested by applying it to recently released models not included in the survey to check whether new interaction patterns still fit.
- It may help identify which streaming techniques transfer most readily to multimodal settings such as video or audio streams.
- Standardized categories could accelerate the creation of benchmarks that measure responsiveness under continuous input rather than one-shot accuracy.
Load-bearing premise
That scattered existing definitions are inconsistent enough for one taxonomy built on data flow and dynamic interaction to cover them without leaving out important cases or creating fresh overlaps.
What would settle it
A published streaming LLM whose core operation cannot be placed in any category of the proposed taxonomy or whose behavior is better explained by a different organizing axis such as latency targets alone.
Figures
read the original abstract
Standard Large Language Models (LLMs) are predominantly designed for static inference with pre-defined inputs, which limits their applicability in dynamic, real-time scenarios. To address this gap, the streaming LLM paradigm has emerged. However, existing definitions of streaming LLMs remain fragmented, conflating streaming generation, streaming inputs, and interactive streaming architectures, while a systematic taxonomy is still lacking. This paper provides a comprehensive overview and analysis of streaming LLMs. First, we establish a unified definition of streaming LLMs based on data flow and dynamic interaction to clarify existing ambiguities. Building on this definition, we propose a systematic taxonomy of current streaming LLMs and conduct an in-depth discussion on their underlying methodologies. Furthermore, we explore the applications of streaming LLMs in real-world scenarios and outline promising research directions to support ongoing advances in streaming intelligence. We maintain a continuously updated repository of relevant papers at https://github.com/EIT-NLP/Awesome-Streaming-LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper surveys streaming Large Language Models (LLMs) as an emerging paradigm for dynamic, real-time scenarios, contrasting them with static inference in standard LLMs. It identifies fragmentation in prior definitions that conflate streaming generation, inputs, and interactive architectures. The central contribution is a unified definition of streaming LLMs grounded in data flow and dynamic interaction, which underpins a proposed systematic taxonomy. The manuscript analyzes underlying methodologies in depth, explores real-world applications, outlines promising research directions, and maintains a continuously updated GitHub repository of relevant papers.
Significance. If the unified definition and taxonomy hold up under scrutiny, this survey would provide a valuable organizational structure for an emerging subfield, helping to resolve definitional ambiguities and guide coherent future work on dynamic LLM interactions. The combination of methodological discussion, application coverage, and the live repository strengthens its utility as a reference resource for researchers working on streaming intelligence.
minor comments (2)
- The abstract states that the taxonomy is 'systematic' and the discussion of methodologies is 'in-depth,' but without explicit criteria for inclusion/exclusion of papers or a count of surveyed works, readers cannot easily assess the taxonomy's completeness or potential gaps.
- The repository link is provided, but the manuscript does not describe its structure, update process, or how new papers will be categorized according to the proposed taxonomy; adding this would improve reproducibility and long-term value.
Simulated Author's Rebuttal
We thank the referee for their positive and constructive review, which accurately summarizes our contributions in providing a unified definition of streaming LLMs based on data flow and dynamic interaction, along with a systematic taxonomy, methodological analysis, applications, and the live GitHub repository. We appreciate the recommendation for minor revision.
Circularity Check
No significant circularity
full rationale
The paper is a survey establishing a unified definition of streaming LLMs organized around data flow and dynamic interaction, followed by taxonomy and applications discussion. No mathematical derivations, equations, fitted parameters, predictions, or self-citation chains appear in the provided text or abstract. The central claim is organizational rather than deductive, with no load-bearing step that reduces by construction to its own inputs under any of the enumerated patterns. The derivation is self-contained as a literature review.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard Large Language Models are predominantly designed for static inference with pre-defined inputs
Forward citations
Cited by 1 Pith paper
-
Engagement Process: Rethinking the Temporal Interface of Action and Observation
Engagement Process decouples actions and observations into separate time-based event streams within a POMDP structure to explicitly model timing mismatches, deliberation latency, and multi-rate interactions.
Reference graph
Works this paper leans on
-
[1]
Accelerating large language model decod- ing with speculative sampling.arXiv preprint arXiv:2302.01318. Jian Cheng, Haidong Kang, Yuxin Shao, Nan Li, Pengjun Chen, Rui Wang, Saiqin Long, Xiaochun Yang, and Lianbo Ma. 2025a. Survey on efficient large language models: Principles, algorithms, ap- plications, and open issues.IEEE Transactions on Neural Networ...
Pith/arXiv arXiv 2024
-
[2]
Simuls2s-llm: Unlocking simultaneous infer- ence of speech llms for speech-to-speech translation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16718–16734. Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, et al. 2025a....
Pith/arXiv arXiv 2025
-
[3]
Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. InProceedings of the 61st An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11575– 11596. Jiadong Hao, Bohan Zhang, Yuchen Lu, Chengcheng Zhang, and Kunda Yang. Stylle: Style learning and latent editing...
arXiv 2024
-
[4]
Javier Iranzo-Sánchez, Jorge Iranzo-Sánchez, Adrià Giménez, Jorge Civera, and Alfons Juan
Let’s predict sentence by sentence.arXiv preprint arXiv:2505.22202. Javier Iranzo-Sánchez, Jorge Iranzo-Sánchez, Adrià Giménez, Jorge Civera, and Alfons Juan. 2024. Segmentation-free streaming machine translation. Transactions of the Association for Computational Linguistics, 12:1104–1121. Doohyuk Jang, Sihwan Park, June Yong Yang, Yeon- sung Jung, Jihun ...
arXiv 2024
-
[5]
Discrete diffusion for generative model- ing of text-aligned speech tokens.arXiv preprint arXiv:2509.20060. Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple sub- word candidates.arXiv preprint arXiv:1804.10959. Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword to...
arXiv 2018
-
[6]
InProceedings of the 40th Interna- tional Conference on Machine Learning (ICML)
Fast inference from transformers via spec- ulative decoding. InProceedings of the 40th Interna- tional Conference on Machine Learning (ICML). Bohan Li, Zhihan Li, Haoran Wang, Hanglei Zhang, Yiwei Guo, Hankun Wang, Xie Chen, and Kai Yu. 2025a. Robust and efficient autoregressive speech synthesis with dynamic chunk-wise prediction policy. arXiv preprint ar...
arXiv 2025
-
[7]
InAdvances in Neural Information Process- ing Systems (NeurIPS)
Diffusion-LM improves controllable text gen- eration. InAdvances in Neural Information Process- ing Systems (NeurIPS). Yang Li, Qiang Sheng, Yehan Yang, Xueyao Zhang, and Juan Cao. 2025i. From judgment to interference: Early stopping llm harmful outputs via streaming con- tent monitoring.arXiv preprint arXiv:2506.09996. Ying Li, chengfei lv, and Huan Wang...
arXiv 2025
-
[8]
Accelerating autoregressive speech synthesis inference with speech speculative decoding.arXiv preprint arXiv:2505.15380. Aiwei Liu, Minghua He, Shaoxun Zeng, Sijun Zhang, Linhao Zhang, Chuhan Wu, Wei Jia, Yuan Liu, Xiao Zhou, and Jie Zhou. 2025a. Wedlm: Rec- onciling diffusion language models with standard causal attention for fast inference.arXiv preprin...
arXiv 2025
-
[9]
Latent speech-text transformer.arXiv preprint arXiv:2510.06195. Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, Zhongjun He, Hairong Liu, Xing Li, et al. 2019. Stacl: Simultaneous translation with implicit antici- pation and controllable latency using prefix-to-prefix framework. InProceedings of the 57th Annual...
arXiv 2019
-
[10]
arXiv preprint arXiv:2305.18893
Where’s the point? self-supervised multi- lingual punctuation-agnostic sentence segmentation. arXiv preprint arXiv:2305.18893. Tan Dat Nguyen, Ji-Hoon Kim, Jeongsoo Choi, Shuk- jae Choi, Jinseok Park, Younglo Lee, and Joon Son Chung. 2025. Accelerating codec-based speech syn- thesis with multi-token prediction and speculative decoding. InICASSP 2025-2025 ...
arXiv 2025
-
[11]
Next block prediction: Video generation via semi-autoregressive modeling.arXiv preprint arXiv:2502.07737. Paul K Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, et al. 2023. Audiopalm: A large language model that can speak and listen.arXiv ...
arXiv 2023
-
[12]
Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Yao, and Gongyi Wang
Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525. Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Yao, and Gongyi Wang. 2024. Razo- rattention: Efficient kv cache compression through retrieval heads.arXiv preprint arXiv:2407.15891. Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan ...
Pith/arXiv arXiv 2024
-
[13]
Speechtokenizer: Unified speech tokenizer for speech language models. InProc. Int. Conf. Learn. Representations, pages 1–21. Boxun Xu, Yu Wang, Zihu Wang, and Peng Li. 2025a. Ams-kv: Adaptive kv caching in multi-scale vi- sual autoregressive transformers.arXiv preprint arXiv:2511.16047. Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keq...
arXiv 2024
-
[14]
Dynamickv: Task-aware adaptive kv cache compression for long context llms.arXiv preprint arXiv:2412.14838. Qianchao Zhu, Jiangfei Duan, Chang Chen, Siran Liu, Xiuhong Li, Guanyu Feng, Xin Lv, Xiao Chuanfu, Dahua Lin, and Chao Yang. 2025. Sampleattention: Near-lossless acceleration of long context llm infer- ence with adaptive structured sparse attention.P...
arXiv 2025
-
[15]
Vargpt: Unified understanding and generation in a visual autoregressive multimodal large language model.arXiv preprint arXiv:2501.12327. A Survey Scope and Positioning A.1 Motivation and Necessity of This Survey The motivation for this survey stems from three key observations regarding the current landscape of Large Language Models (LLMs): the paradigm sh...
arXiv 2025
-
[16]
Compression: quantization, pruning, distillation, low-rank; and
-
[17]
KV-cache management: selection / eviction, cache compression, offloading, sliding-window / hierarchical cache. Prior surveys treat compression and KV-cache optimization as separate threads; we unify them under streaming interaction, highlighting online constraints and dynamic runtime budgeting. Survey Category: Multimodal LLMs (Zhang et al., 2024a) (Caffa...
2024
-
[18]
Prior MLLM surveys assume fixed inputs and emphasize alignment and benchmarked capabilities
Encoder + Projector + LLM, alignment module, tokenizer; and 2) multimodal pretraining & instruction tuning. Prior MLLM surveys assume fixed inputs and emphasize alignment and benchmarked capabilities. We focus on streaming interaction with token stream abstraction, concurrent IO, incremental perception, and online memory and budget control. Survey Categor...
-
[19]
Prior surveys focus on enlarging a fixed context window for offline inputs or read then write inference
Position extrapolation / interpolation; 2) efficient long-sequence attention and architectures; 3) KV-cache management (compression, eviction, and offloading); and 4) workflow-level augmentation (prompt compression, retrieval/external memory). Prior surveys focus on enlarging a fixed context window for offline inputs or read then write inference. We study...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.