arxiv: 2605.10343 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

EvoStreaming: Your Offline Video Model Is a Natively Streaming Assistant

Zichen Wen , Boxue Yang , Junlong Ke , Jiajie Huang , Chenfei Liao , Junxi Wang , Xuyang Liu , Linfeng Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords streaming video understandingVideoLLMself-evolutionRealStreamEvalinteraction policydata-efficient tuningoffline to streaming adaptationresponse timing

0 comments

The pith

Offline video models can become effective streaming assistants through self-evolution using just 1,000 self-generated samples and no architecture changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that strong offline video-language models already hold useful visual understanding but lack a policy for deciding when to respond during live streams. It introduces RealStreamEval, a frame-level multi-turn protocol that feeds models sequential observations and penalizes unnecessary replies. To supply the missing policy, EvoStreaming lets the base model generate its own streaming trajectories, annotate relevance, and simulate roll-outs without any external data or supervision. This self-evolution raises overall RealStreamEval scores by up to 10.8 points across five different VideoLLM backbones while keeping offline video performance largely intact, using 139 times less data than prior streaming tuning methods. A reader cares because the result points to a practical route for turning existing static models into responsive, real-time assistants.

Core claim

Strong offline VideoLLMs retain useful visual understanding but lack an interaction policy for deciding when to respond. Using the base model itself as data generator, relevance annotator, and roll-out policy, EvoStreaming synthesizes streaming trajectories without external supervision. With only 1,000 self-generated samples and no architectural changes, the approach improves the overall RealStreamEval score by up to 10.8 points across five open VideoLLM backbones while largely preserving offline video performance.

What carries the argument

EvoStreaming, a self-evolved streaming adaptation framework in which the base model acts as data generator, relevance annotator, and roll-out policy to synthesize streaming trajectories.

If this is right

VideoLLMs can acquire effective streaming interaction policies through self-generated data rather than large external datasets.
The timing and verbosity policy can be improved separately from the core visual understanding already present in offline models.
The self-evolution method generalizes across multiple open VideoLLM architectures including Qwen2/2.5/3-VL, InternVL-3.5, and MiniCPM-V4.5.
Data efficiency reaches 139 times less than leading prior streaming instruction-tuning approaches while maintaining offline capabilities.
Existing offline models can be turned into streaming assistants without any model architecture modifications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Self-evolution loops of this kind might allow models to iteratively refine their own streaming policies across multiple rounds of data generation.
The same principle could apply to other real-time decision tasks in multimodal systems such as audio or image sequences.
Standard offline benchmarks may hide interaction weaknesses that only appear under frame-by-frame evaluation protocols.
This approach reduces dependence on human-annotated streaming data for adapting future video models.

Load-bearing premise

The base model's self-generated streaming trajectories, relevance annotations, and roll-out policies are sufficiently accurate and unbiased to train an improved interaction policy without external supervision or validation.

What would settle it

If human-verified response timings on new live video streams show no improvement or actual worsening in the adapted models' timing decisions compared with the original offline versions, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.10343 by Boxue Yang, Chenfei Liao, Jiajie Huang, Junlong Ke, Junxi Wang, Linfeng Zhang, Xuyang Liu, Zichen Wen.

**Figure 1.** Figure 1: EvoStreaming lets a VideoLLM teach itself when to speak in a video stream (top), evaluated by RealStreamEval, which scores both correctness and response timing (bottom). Streaming adaptation is costly in practice. Architecture-oriented methods add streaming-specific modules and re-train for online interaction (Zhang et al., 2024a; Qian et al., 2025; Chen et al., 2024a), which incurs substantial engineeri… view at source ↗

**Figure 2.** Figure 2: Overview of RealStreamEval. We first align questions to sampled frames, then run strict online inference where the model decides when to respond, and finally score responses with a verbosity penalty to discourage redundant outputs. EvoStreaming materializes this insight by preserving the base architecture and shifting the adaptation burden to a self-evolved timing policy: the base model itself generates 1,… view at source ↗

**Figure 3.** Figure 3: Multi-stage data generation pipeline. Raw videos are converted into streaming training data through five stages: taxonomy balancing, type-consistent question construction, segmentation with relevance annotation, sliding-window pruning under partial observations, and conversation standardization. Algorithm 1 EvoStreaming. Each stage reuses the base model M in a different role, so no external annotator is … view at source ↗

**Figure 4.** Figure 4: Streaming performance analysis. Left: EvoStreaming improves overall OVOBench performance across base models. Right: EvoStreaming improves Forward Active Responding, i.e., deciding when proactive responses are needed [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: EvoStreaming reduces verbosity while improving accuracy. (a) Average tokens per turn for three Qwen-family backbones, with 6×–28× fewer tokens. (b) Accuracy gains per backbone from vanilla to EvoStreaming. Detailed numbers are in Appendix [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Dataset distribution overview. Left: task distribution; Right: video duration [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: A case of Forward Active Responding. The model remains silent during most [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗

**Figure 8.** Figure 8: A case of Backward Tracing. The model maintains silence while observing, then [PITH_FULL_IMAGE:figures/full_fig_p032_8.png] view at source ↗

**Figure 9.** Figure 9: A case of Real-Time Visual Perception. The model provides immediate answers to [PITH_FULL_IMAGE:figures/full_fig_p032_9.png] view at source ↗

read the original abstract

Streaming video understanding demands more than watching longer videos: assistants must decide when to speak in real time, balancing responsiveness against verbosity. Yet most video-language models (VideoLLMs) are trained for offline inference, and existing streaming benchmarks externalize this timing decision to the evaluator. We address this gap with RealStreamEval, a frame-level multi-turn evaluation protocol that exposes models to sequential observations and penalizes unnecessary responses. Under this protocol, we observed that strong offline VideoLLMs retain useful visual understanding but lack an interaction policy for deciding when to respond. Motivated by this observation, we propose EvoStreaming, a self-evolved streaming adaptation framework in which the base model itself acts as data generator, relevance annotator, and roll-out policy to synthesize streaming trajectories without external supervision. With only $1{,}000$ self-generated samples ($139\times$ less than the leading streaming instruction-tuning approach) and no architectural changes, EvoStreaming consistently improves the overall RealStreamEval score by up to $10.8$ points across five open VideoLLM backbones (Qwen2/2.5/3-VL, InternVL-3.5, MiniCPM-V4.5) while largely preserving offline video performance. These results suggest that data-efficient interaction tuning is a practical path for adapting existing VideoLLMs to streaming assistants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvoStreaming gets real gains on a new streaming eval by having the base model generate its own 1000 training trajectories, but the self-supervised loop is the part that needs the most checking.

read the letter

The paper's main contribution is a practical self-evolution setup that turns offline VideoLLMs into streaming assistants. They create RealStreamEval, a frame-by-frame protocol that makes the model decide when to answer instead of letting the benchmark handle timing. Then EvoStreaming has the model itself produce the streaming data, label relevance, and define roll-out policies, all without outside labels. With just those 1000 samples they report up to 10.8-point lifts on five different backbones while keeping most offline capability intact. That data efficiency and the no-architecture-change claim are the parts that stand out as useful for people who already have a working VideoLLM and want to add real-time behavior quickly.

Referee Report

2 major / 2 minor

Summary. The paper introduces RealStreamEval, a frame-level multi-turn evaluation protocol for streaming video understanding that exposes models to sequential frame observations and penalizes unnecessary responses, and EvoStreaming, a self-evolved adaptation framework in which the base VideoLLM serves as data generator, relevance annotator, and roll-out policy to synthesize streaming trajectories without external supervision. The central claim is that fine-tuning on only 1,000 such self-generated samples (139× fewer than prior streaming instruction-tuning methods) and without architectural changes yields up to 10.8-point gains on RealStreamEval across five open VideoLLM backbones (Qwen2/2.5/3-VL, InternVL-3.5, MiniCPM-V4.5) while largely preserving offline video performance.

Significance. If the gains prove robust, the work demonstrates a practical route for data-efficient interaction tuning of existing VideoLLMs into streaming assistants. Strengths include the small data regime, preservation of offline capabilities, evaluation on multiple diverse backbones, and the shift to an internalized timing decision in the benchmark. These elements could encourage further exploration of self-supervised policy learning for interactive multimodal models.

major comments (2)

[EvoStreaming framework and data synthesis] The data-generation procedure (described as the base model acting simultaneously as generator, annotator, and policy): because the abstract states that strong offline VideoLLMs lack an interaction policy for deciding when to respond, the 1,000 self-generated trajectories, relevance annotations, and roll-out policies are produced by a model without the very capability being learned. This is load-bearing for the 10.8-point claim; without external validation (human checks, held-out annotations, or comparison to supervised trajectories), measured improvements could reflect better self-imitation of the model's initial suboptimal timing rather than acquisition of a genuinely improved policy.
[Experiments] Experiments section: the reported gains are presented as consistent across five backbones, yet no standard deviations, multiple random seeds, or statistical significance tests are mentioned for the RealStreamEval scores. Given that both the training signal and the new evaluation protocol are introduced in the same work, this omission makes it difficult to assess whether the maximum 10.8-point improvement is stable or sensitive to data-generation choices.

minor comments (2)

[Abstract] The '139× less' comparison in the abstract would be clearer with an explicit citation or footnote stating the sample count of the leading streaming instruction-tuning baseline being referenced.
[Results tables/figures] Figure or table captions that display per-backbone RealStreamEval scores should include the offline video performance numbers side-by-side to make the 'largely preserving' claim immediately verifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The comments raise important points about the self-evolution mechanism and experimental robustness. We address each major comment below with clarifications and indicate where revisions will be made.

read point-by-point responses

Referee: [EvoStreaming framework and data synthesis] The data-generation procedure (described as the base model acting simultaneously as generator, annotator, and policy): because the abstract states that strong offline VideoLLMs lack an interaction policy for deciding when to respond, the 1,000 self-generated trajectories, relevance annotations, and roll-out policies are produced by a model without the very capability being learned. This is load-bearing for the 10.8-point claim; without external validation (human checks, held-out annotations, or comparison to supervised trajectories), measured improvements could reflect better self-imitation of the model's initial suboptimal timing rather than acquisition of a genuinely improved policy.

Authors: We appreciate the referee's careful reading of the tension between the base model's limitations and its role in data synthesis. While the abstract notes that offline VideoLLMs lack a dedicated streaming interaction policy, these models still encode substantial visual-linguistic knowledge that can be elicited via prompting to produce candidate responses and timing decisions. EvoStreaming exploits this by having the model generate diverse trajectories, annotate relevance, and select roll-outs that improve the balance between responsiveness and verbosity. The resulting fine-tuning signal is therefore not pure self-imitation of the original policy; the 10.8-point gains on RealStreamEval, achieved while largely preserving offline performance across five distinct backbones, indicate that the synthesized data teaches a more effective timing strategy. To make this argument more transparent, we will add a dedicated paragraph in the method section explaining the prompting strategy used for policy roll-out and include qualitative examples contrasting base-model and EvoStreaming timing decisions on the same video streams. revision: partial
Referee: [Experiments] Experiments section: the reported gains are presented as consistent across five backbones, yet no standard deviations, multiple random seeds, or statistical significance tests are mentioned for the RealStreamEval scores. Given that both the training signal and the new evaluation protocol are introduced in the same work, this omission makes it difficult to assess whether the maximum 10.8-point improvement is stable or sensitive to data-generation choices.

Authors: We agree that the absence of variability measures and statistical tests limits the strength of the empirical claims. In the revised manuscript we will rerun the data-generation and fine-tuning pipelines with at least three random seeds for each backbone, report mean and standard deviation on RealStreamEval, and add paired statistical significance tests (with p-values) between base and EvoStreaming models. These additions will directly address concerns about stability and sensitivity to data-generation choices. revision: yes

Circularity Check

1 steps flagged

Self-evolution derives training trajectories, annotations, and roll-out policies from the base model itself, so measured gains on RealStreamEval may reduce to better mimicry of the model's initial outputs

specific steps

self definitional [Abstract]
"we propose EvoStreaming, a self-evolved streaming adaptation framework in which the base model itself acts as data generator, relevance annotator, and roll-out policy to synthesize streaming trajectories without external supervision. With only 1,000 self-generated samples (139× less than the leading streaming instruction-tuning approach) and no architectural changes, EvoStreaming consistently improves the overall RealStreamEval score by up to 10.8 points across five open VideoLLM backbones"

The interaction policy being improved is trained exclusively on trajectories, annotations, and roll-out decisions produced by the identical base model that the abstract states 'lack[s] an interaction policy for deciding when to respond.' The measured improvement is therefore generated from the model's own initial outputs rather than from any external source of correct timing behavior.

full rationale

The paper's central derivation is: offline VideoLLMs lack an interaction policy (observed under RealStreamEval) → EvoStreaming lets the same base model generate its own 1,000 streaming trajectories, relevance annotations, and roll-out policies without external supervision → fine-tune the model on this self-data → report up to 10.8-point RealStreamEval gains while preserving offline performance. This chain is partially circular because the data-generation step (which supplies the only training signal) is performed by the identical model whose policy deficiencies are being corrected; no independent oracle, human validation, or external dataset breaks the loop. RealStreamEval is an independent benchmark and therefore supplies partial grounding, but it does not falsify the possibility that the fine-tuned policy simply learns to reproduce the base model's own (suboptimal) timing decisions more consistently. The 139× data-efficiency claim and the 'no architectural changes' statement do not alter the self-referential nature of the training signal.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The method rests on the assumption that the base model can reliably self-supervise its own improvement and on two newly introduced components whose validity is not independently verified outside the paper.

free parameters (1)

number of self-generated samples = 1000
The choice of exactly 1000 samples is presented as a hyperparameter demonstrating efficiency but is not derived from first principles.

axioms (1)

domain assumption The base VideoLLM can generate high-quality streaming trajectories and accurate relevance annotations without external supervision.
Invoked as the core mechanism enabling data-efficient adaptation.

invented entities (2)

RealStreamEval no independent evidence
purpose: Frame-level multi-turn evaluation protocol that penalizes unnecessary responses
New benchmark introduced to measure streaming behavior.
EvoStreaming no independent evidence
purpose: Self-evolved streaming adaptation framework
New training procedure proposed in the paper.

pith-pipeline@v0.9.0 · 5560 in / 1493 out tokens · 76777 ms · 2026-05-12T04:59:39.812269+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

EvoStreaming... lets the base model itself act as data generator, relevance annotator, and roll-out policy to synthesize streaming trajectories... With only 1,000 self-generated samples... improves the overall RealStreamEval score by up to 10.8 points
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction and embed_strictMono unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RealStreamEval... frame-level multi-turn evaluation protocol that exposes models to sequential observations and penalizes unnecessary responses

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

This matches the consistent gains we observe across Qwen2/2.5/3-VL, InternVL-3.5, and MiniCPM-V4.5 in Table 2

Strong-encoder regime.When the base model is a competent VideoLLM on the target domain, ϵV is small and the noise-inflation factor 1/(1 − 2ϵV)2 stays close to 1, so a small budget n≈ 103 already activates the timing policy. This matches the consistent gains we observe across Qwen2/2.5/3-VL, InternVL-3.5, and MiniCPM-V4.5 in Table 2

work page
[2]

when to speak

Weak-encoder regime.As ϵV → 1/2 (e.g., medical or industrial video that is far from the encoder’s pretraining distribution), the noise-inflation factor diverges and self-evolution must be paired with external supervision before becoming effective. This is consistent with the limitations we discuss in Section C.3, and is, to our knowledge, the first quanti...

work page 2023
[3]

Analyze the sample captions and identify which question has the most relevant YES evidence across segments

work page
[4]

Prefer questions with concrete, observable information rather than ambiguous or N/A evidence

work page
[5]

Output format.Return JSON with selected question idx, an initial task prompt for track- ing that question, and a briefreasoningfield

For Temporal Aggregation, prefer repeated actions with count information; for Dynamic Event Description, prefer step-by-step processes; for Anticipatory Monitoring, prefer changing states or reveal chains. Output format.Return JSON with selected question idx, an initial task prompt for track- ing that question, and a briefreasoningfield. Table 14: Prompt ...

work page
[6]

Focus only on annotation lines corresponding to the tracked question

work page
[7]

Check the last response to avoid repeating information already reported

work page
[8]

Decide whether there is enough new evidence to produce an update; otherwise return SILENT

work page
[9]

Task-specific response rule

Balance responsiveness and sparsity: avoid spamming, but do not stay silent if the user would miss a key count, step transition, or reveal by waiting. Task-specific response rule. • Temporal Aggregation (TA):respond when one or more new repetitions are completed, maintaining the running count from the last response. • Dynamic Event Description (DED):respo...

work page
[10]

Carefully compare the model’s answer with the ground truth answer

work page
[11]

Determine if the model’s answer is correct

work page
[12]

For multiple choice questions, check if the model selected the correct option (either by letter or by content)

work page
[13]

Respond with a JSON object in exactly this format: { ”correct”: true or false, ”reasoning”: ”Brief explanation of your judgment”} Only output the JSON object, nothing else

For open-ended questions, check if the model’s answer captures the same meaning as the ground truth. Respond with a JSON object in exactly this format: { ”correct”: true or false, ”reasoning”: ”Brief explanation of your judgment”} Only output the JSON object, nothing else. F.2 Repetition Detection Judge for Penalty This prompt is used to implement the rep...

work page
[14]

Analyze the Context provided above

work page
[15]

Determine if the agent repeats the answer to the question multiple times unnecessar- ily within this context

work page
[16]

I see red

If the answer appears more than once (e.g., ’It is red. I see red. It is red’), mark it as repeated

work page
[17]

Respond with a JSON object in exactly this format: { ”is repeated”: true or false, ”reasoning”: ”Why you think it is repeated or not”} Only output the JSON object

If the agent answers once and then stays silent or moves to the next topic, it is NOT repeated. Respond with a JSON object in exactly this format: { ”is repeated”: true or false, ”reasoning”: ”Why you think it is repeated or not”} Only output the JSON object. Table 17: Prompt template used to determine whether a CRR output contains a substantive response ...

work page
[18]

Evaluate how well the model’s response matches the expected answer

work page
[19]

F .4.2 LLM Accuracy Judge Prompt for SSR T ask Table 20 shows the SSR stage consistency judge used for Forward Active Responding

Respond with ONLY a score between 0.0 and 0.5, where: • 0.5 = Good match with minor differences • 0.3 = Related but somewhat differences • 0.0 = Completely wrong or irrelevant Only output the numerical score. F .4.2 LLM Accuracy Judge Prompt for SSR T ask Table 20 shows the SSR stage consistency judge used for Forward Active Responding. Table 20: Prompt t...

work page
[20]

F .4.3 LLM Accuracy Judge Prompt for REC T ask Table 21 shows the REC count consistency judge used for Forward Active Responding

Respond with ONLY a score between 0.0 and 0.5, where: • 0.5 = Perfect stage match, fully consistent • 0.2 = Wrong stage but somewhat related activity • 0.0 = Completely wrong stage or irrelevant Only output the numerical score. F .4.3 LLM Accuracy Judge Prompt for REC T ask Table 21 shows the REC count consistency judge used for Forward Active Responding....

work page 2023
[21]

Determine if the model correctly reports that the activity has occurred {expected count}time(s)

work page
[22]

Who did I talk to?

Respond with ONLY a score between 0.0 and 0.5, where: • 0.5 = Count is explicitly correct • 0.3 = Count is approximately correct (off by 1) • 0.0 = No count mentioned or completely wrong Only output the numerical score. determines whether a candidate response matches the expected answer within the protocol- defined temporal context, and whether repeated r...

work page 2023