MAVEN pipeline generates multi-scale spatio-temporal event descriptions from videos using agentic adaptation and refinement, then produces training data that lets a fine-tuned 8B model outperform Gemini baselines on private CCTV and AccidentBench tasks.
Surveillancevqa-589k: A benchmark for comprehensive surveillance video-language understanding with large mod- els
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 3representative citing papers
CrashSight is a new infrastructure-focused benchmark showing that state-of-the-art vision-language models can describe crash scenes but fail at temporal and causal reasoning.
ProcObject-10K is the first benchmark for object-centric procedural reasoning in videos that exposes a large gap where models answer questions plausibly but fail to ground their answers in the correct video segments.
citing papers explorer
-
MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks
MAVEN pipeline generates multi-scale spatio-temporal event descriptions from videos using agentic adaptation and refinement, then produces training data that lets a fine-tuned 8B model outperform Gemini baselines on private CCTV and AccidentBench tasks.
-
CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning
CrashSight is a new infrastructure-focused benchmark showing that state-of-the-art vision-language models can describe crash scenes but fail at temporal and causal reasoning.
-
ProcObject-10K: Benchmarking Object-Centric Procedural Understanding in Instructional Videos
ProcObject-10K is the first benchmark for object-centric procedural reasoning in videos that exposes a large gap where models answer questions plausibly but fail to ground their answers in the correct video segments.