hub

Flash-vstream: Memory-based real-time understanding for long video streams

Flash-vstream: Memory-based real-time understanding for long video streams , author= · 2024 · arXiv 2406.08085

25 Pith papers cite this work. Polarity classification is still indexing.

25 Pith papers citing it

read on arXiv browse 25 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 baseline 1

citation-polarity summary

background 2 baseline 1

representative citing papers

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

cs.CV · 2026-06-02 · accept · novelty 7.0

OVO-S-Bench provides 1680 human-annotated questions on 348 videos to measure streaming spatial intelligence in MLLMs across instantaneous perception, spatiotemporal tracking, spatial simulation, and allocentric mapping.

Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

cs.CV · 2026-05-17 · conditional · novelty 7.0

Omni-DuplexEval creates a new benchmark and LLM-as-a-Judge framework for real-time duplex omni-modal interaction, revealing that current models score below 40% overall and struggle especially with proactive responses.

Don't Pause! Every prediction matters in a streaming video

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

SPOT-Bench tests real-time streaming video perception with timeliness metrics, exposing limitations in current models and introducing AsynKV as an improved baseline.

OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning

cs.CV · 2026-04-18 · unverdicted · novelty 7.0

OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.

VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

VSAS-Bench offers temporally dense annotations and synchronous/asynchronous protocols to evaluate streaming VLMs on timeliness, consistency, accuracy, and latency trade-offs, showing that adapted conventional VLMs can outperform specialized streaming models.

From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

cs.CV · 2026-03-02 · unverdicted · novelty 7.0

MM-Mem distills video input through a hierarchical memory of sensory buffer, episodic stream, and symbolic schema, optimized by a semantic information bottleneck and SIB-GRPO, to achieve SOTA on long-horizon video benchmarks.

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

cs.CV · 2026-02-24 · unverdicted · novelty 7.0

LongVideo-R1 trains a reasoning agent on 33K trajectories to intelligently select informative video clips via iterative refinement and RL, achieving better accuracy-efficiency tradeoffs on long video QA benchmarks.

Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

cs.CV · 2025-11-27 · unverdicted · novelty 7.0

Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.

Zero-Shot 3D Question Answering via Hierarchical View-to-Token Transportation

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

KeyVT improves zero-shot 3D question answering by hierarchically selecting semantically and geometrically relevant views and using optimal transport to extract representative tokens from them.

IPIBench: Evaluating Interactive Proactive Intelligence of MLLMs under Continuous Streams

cs.CV · 2026-05-26 · unverdicted · novelty 6.0

IPIBench evaluates MLLMs on interactive proactive intelligence in streaming videos, identifies unstable triggering and poor coordination, and proposes the training-free IPI-Agent framework to improve performance across settings.

OProver: A Unified Framework for Agentic Formal Theorem Proving

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.

From Priors to Perception: Grounding Video-LLMs in Physical Reality

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard LoRA fine-tuning.

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

cs.CV · 2026-01-21 · unverdicted · novelty 6.0

HERMES organizes the KV cache into a hierarchical memory to enable real-time streaming video understanding in MLLMs, achieving 10x faster TTFT and up to 11.4% accuracy gains on streaming benchmarks with 68% fewer tokens.

Streaming Video Instruction Tuning

cs.CV · 2025-12-24 · unverdicted · novelty 6.0

Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.

VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

cs.RO · 2025-05-24 · conditional · novelty 6.0

VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.

LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

cs.CV · 2025-05-21 · unverdicted · novelty 6.0

LiveVLM introduces VSB and PaR to compress and retrieve KV cache in streaming video LLMs, enabling LLaVA-OneVision to reach SOTA accuracy among training-free query-agnostic and training-based online models.

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

cs.CV · 2024-11-04 · unverdicted · novelty 6.0

PPLLaVA uses CLIP-based alignment and prompt-guided convolution-style pooling to reduce visual tokens 18x in Video LLMs, achieving SOTA results on captioning, QA, and long-form reasoning benchmarks with higher throughput.

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

cs.CV · 2024-08-19 · unverdicted · novelty 6.0

LongVILA scales visual-language models from 8 to 2048 video frames with 99.8% needle-in-a-haystack accuracy using long-context extension, supervised fine-tuning, and multi-modal sequence parallelism on up to 256 GPUs.

Linear Scaling Video VLMs for Long Video Understanding

cs.CV · 2026-05-29 · unverdicted · novelty 5.0

StateKV is an inference-time technique that replaces quadratic self-attention prefill in video VLMs with a fixed-capacity importance-based recurrent state, keeping accuracy near full attention on long-video benchmarks without retraining.

PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning

cs.MA · 2026-05-16 · unverdicted · novelty 5.0

PyraVid is a hierarchical multimodal memory system that structures long videos into pyramids to improve long-horizon reasoning and evidence aggregation.

LATERN: Test-Time Context-Aware Explainable Video Anomaly Detection

cs.CV · 2026-05-14 · unverdicted · novelty 5.0

LATERN reformulates video anomaly detection as temporal evidence aggregation via context-aware scoring (CEA) and recursive aggregation (REA) to improve accuracy and coherence for frozen VLMs on benchmarks like UCF-Crime.

Decouple and Cache: KV Cache Construction for Streaming Video Understanding

cs.CV · 2026-05-03 · unverdicted · novelty 5.0

DSCache decouples cumulative past and instant KV caches with position-agnostic encoding to adapt offline VideoVLLMs to streaming video, delivering 2.5% average accuracy gains on QA benchmarks.

StreamMeCo: Long-Term Agent Memory Compression for Efficient Streaming Video Understanding

cs.CV · 2026-04-10 · unverdicted · novelty 5.0

StreamMeCo compresses agent memory by 70% in streaming video understanding, yielding 1.87x faster retrieval and 1% higher average accuracy on benchmarks.

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

cs.CV · 2025-01-22 · unverdicted · novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction cs.CV · 2026-05-17 · conditional · none · ref 27
Omni-DuplexEval creates a new benchmark and LLM-as-a-Judge framework for real-time duplex omni-modal interaction, revealing that current models score below 40% overall and struggle especially with proactive responses.
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning cs.RO · 2025-05-24 · conditional · none · ref 87
VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.

Flash-vstream: Memory-based real-time understanding for long video streams

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer