QVAD: A Question-Centric Agentic Framework for Efficient and Training-Free Video Anomaly Detection
Pith reviewed 2026-05-13 20:17 UTC · model grok-4.3
The pith
An agentic framework lets smaller vision-language models reach state-of-the-art video anomaly detection by iteratively refining queries through dynamic dialogue with a language model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
QVAD treats VLM-LLM interaction as an ongoing dialogue in which the LLM agent updates prompts on the basis of accumulating visual context. The resulting prompt-updating mechanism elicits high-fidelity captions and precise anomaly reasoning from lightweight VLMs without parameter changes, delivering state-of-the-art performance on UCF-Crime, XD-Violence, and UBNormal while using only a fraction of the parameters required by prior methods and maintaining high inference speed with small memory footprints.
What carries the argument
The LLM agent's iterative query-refinement loop, which updates prompts dynamically from visual feedback to steer the VLM.
If this is right
- State-of-the-art accuracy on UCF-Crime, XD-Violence, and UBNormal with far fewer parameters than competing methods.
- High inference speed and minimal memory use that supports deployment on edge devices.
- Strong transfer to the single-scene ComplexVAD dataset without retraining.
- No need for parameter updates or domain-specific fine-tuning.
Where Pith is reading between the lines
- The same dynamic-refinement pattern could raise accuracy in other open-set vision tasks that currently rely on static prompts.
- Prompt interaction may reduce reliance on ever-larger foundation models for narrow, high-stakes applications.
- Real-time surveillance systems could become practical on modest hardware once query refinement replaces model scaling.
Load-bearing premise
An LLM agent can consistently produce useful query refinements from visual context without adding errors or failing on ambiguous cases.
What would settle it
On a new video dataset containing previously unseen anomaly types, measure whether QVAD accuracy falls below that of static-prompt baselines or larger-model competitors.
Figures
read the original abstract
Video Anomaly Detection (VAD) is a fundamental challenge in computer vision, particularly due to the open-set nature of anomalies. While recent training-free approaches utilizing Vision-Language Models (VLMs) have shown promise, they typically rely on massive, resource-intensive foundation models to compensate for the ambiguity of static prompts. We argue that the bottleneck in VAD is not necessarily model capacity, but rather the static nature of inquiry. We propose QVAD, a question-centric agentic framework that treats VLM-LLM interaction as a dynamic dialogue. By iteratively refining queries based on visual context, our LLM agent guides smaller VLMs to produce high-fidelity captions and precise semantic reasoning without parameter updates. This ``prompt-updating" mechanism effectively unlocks the latent capabilities of lightweight models, enabling state-of-the-art performance on UCF-Crime, XD-Violence, and UBNormal using a fraction of the parameters required by competing methods. We further demonstrate exceptional generalizability on the single-scene ComplexVAD dataset. Crucially, QVAD achieves high inference speeds with minimal memory footprints, making advanced VAD capabilities deployable on resource-constrained edge devices.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes QVAD, a question-centric agentic framework for training-free video anomaly detection. It posits that static prompts are the primary bottleneck rather than model capacity, and introduces an LLM agent that iteratively refines queries based on visual context to guide smaller VLMs toward high-fidelity captions and precise semantic reasoning without any parameter updates or fine-tuning. The work claims state-of-the-art performance on UCF-Crime, XD-Violence, and UBNormal while using a fraction of the parameters of competing methods, with high inference speeds, minimal memory footprints, and strong generalizability on the single-scene ComplexVAD dataset.
Significance. If the performance and efficiency claims hold after proper validation, the result would be significant for practical VAD deployment: it offers a path to advanced anomaly detection on edge devices by leveraging dynamic agentic interaction to unlock lightweight VLMs rather than scaling model size or requiring training. The approach also highlights a general strategy for prompt-updating in open-set vision tasks.
major comments (2)
- [Abstract] Abstract: The manuscript asserts state-of-the-art results on UCF-Crime, XD-Violence, and UBNormal 'using a fraction of the parameters required by competing methods' together with high inference speeds and minimal memory, yet supplies no quantitative metrics (AUC, mAP), baseline tables, ablation studies, or error analysis, making the central performance claim impossible to evaluate.
- [Framework description] Framework description (implicit in abstract and method overview): The load-bearing claim is that iterative LLM-driven query refinement unlocks latent capabilities in smaller VLMs beyond what static prompts achieve. No controlled ablation isolating the iterative agentic loop versus a single well-engineered static prompt (or fixed few-shot prompt) on the identical VLM is described, leaving open the possibility that gains derive from VLM choice and prompt quality alone rather than the proposed dynamic dialogue.
minor comments (1)
- [Abstract] Abstract: The phrase 'prompt-updating' mechanism is placed in quotes but is not formally defined or contrasted with existing iterative prompting or chain-of-thought techniques in the VLM literature.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments. These points highlight opportunities to improve the clarity of our performance claims and to more rigorously isolate the contribution of the iterative agentic mechanism. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The manuscript asserts state-of-the-art results on UCF-Crime, XD-Violence, and UBNormal 'using a fraction of the parameters required by competing methods' together with high inference speeds and minimal memory, yet supplies no quantitative metrics (AUC, mAP), baseline tables, ablation studies, or error analysis, making the central performance claim impossible to evaluate.
Authors: We agree that the abstract, due to length constraints, does not embed specific numerical values. The full manuscript contains detailed experimental sections with AUC and mAP results on UCF-Crime, XD-Violence, and UBNormal, along with direct comparisons of parameter counts, inference latency, and memory usage against competing methods. In the revised version we will expand the abstract to include the key quantitative metrics (e.g., AUC scores and relative parameter reduction) while preserving conciseness, and we will ensure the baseline tables and efficiency results are explicitly cross-referenced. revision: yes
-
Referee: [Framework description] Framework description (implicit in abstract and method overview): The load-bearing claim is that iterative LLM-driven query refinement unlocks latent capabilities in smaller VLMs beyond what static prompts achieve. No controlled ablation isolating the iterative agentic loop versus a single well-engineered static prompt (or fixed few-shot prompt) on the identical VLM is described, leaving open the possibility that gains derive from VLM choice and prompt quality alone rather than the proposed dynamic dialogue.
Authors: This observation is fair and directly targets the core hypothesis. While the manuscript already shows that QVAD outperforms prior training-free VAD approaches (which rely on static prompts and larger models), it does not contain a controlled head-to-head ablation of iterative refinement versus a single, carefully engineered static prompt on the exact same lightweight VLM backbone. We will add this ablation study to the revised manuscript, reporting performance differences on the same VLM to isolate the benefit of the dynamic, context-aware query loop. revision: yes
Circularity Check
No circularity; procedural framework with no equations or self-referential derivations
full rationale
The paper describes a new agentic workflow for video anomaly detection that combines existing VLMs and LLMs through iterative query refinement. No mathematical derivations, equations, fitted parameters, or uniqueness theorems appear in the provided text. The central claims rest on empirical benchmark results rather than any step that reduces by construction to its own inputs. Self-citations, if present, are not load-bearing for the method itself, which is presented as a training-free procedural composition of off-the-shelf models. This is the normal case of a non-circular systems paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Smaller VLMs possess latent capabilities for high-fidelity captioning and semantic reasoning that can be unlocked via iterative LLM-guided prompt refinement without training.
invented entities (1)
-
QVAD agentic framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose QVAD, a question-centric agentic framework that treats VLM-LLM interaction as a dynamic dialogue. By iteratively refining queries based on visual context, our LLM agent guides smaller VLMs to produce high-fidelity captions and precise semantic reasoning without parameter updates.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
This “prompt-updating” mechanism effectively unlocks the latent capabilities of lightweight models, enabling state-of-the-art performance on UCF-Crime, XD-Violence, and UBNormal using a fraction of the parameters required by competing methods.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks
MAVEN pipeline generates multi-scale spatio-temporal event descriptions from videos using agentic adaptation and refinement, then produces training data that lets a fine-tuned 8B model outperform Gemini baselines on p...
Reference graph
Works this paper leans on
-
[1]
Answer directly in 1 single sentence if possible (max 2)
-
[2]
14 QV AD: Question-Centric Agentic Framework for Video Anomaly Detection B.5
Do not describe the whole scene, only answer the specific question. 14 QV AD: Question-Centric Agentic Framework for Video Anomaly Detection B.5. LLM Prompting Strategy The LLM is employed in two complementary roles: (i) anomaly scoring and (ii) clarifying question generation. While the prompt used for question generation remains fixed across datasets, th...
-
[3]
Target specific ambiguous details (objects, actions, interactions)
-
[4]
Help distinguish between normal and criminal behavior
-
[5]
Be answerable by observing the video footage
-
[6]
Focus on concrete, observable elements
-
[7]
Do not ask questions that force VLM to hallucinate a crime If the question needs historical context (e.g., “Are there patterns?”), phrase it naturally – the system will automatically provide relevant context. Generate ONLY the question text, no additional commentary. B.5.3. ANOMALYDEFINITION ONOTHERDATASETS To accommodate varying definitions of “anomaly,”...
-
[8]
ANOMALY = 1 (Immediate or Emerging Public Safety Risk) •A) HUMAN CRIMINALITY:Assault/Fighting (punches, kicks); Shooting (weapons, panic); Stealing/Burglary (concealment, force); Vandalism. • B) HAZARDOUS SITUATIONS:RoadAccidents:Collision; Vehicle on sidewalk/grass; Blocking active lanes; Stopped in traffic; Visible damage.Explosion/Arson: Flash/blast; T...
-
[9]
If persistence/danger not explicit, classify as 0
NO ANOMALY = 0:Walking/standing briefly; Jogging; Cars at red lights/congestion; Parked legally; Light fog/rain. If persistence/danger not explicit, classify as 0. Figure 7.Exact anomaly scoring criteria for UBNormal. B.6. Hyperparameters and Configuration Table 12 summarizes the key hyperparameters used in our experiments. B.7. Parallel Processing Archit...
-
[10]
Vector Memory Isolation:Each video receives its own VectorMemory instance, preventing cross-video context contamination
-
[11]
Shared Encoder:The sentence embedding model is loaded once and shared across threads (read-only during inference), reducing memory overhead. 3.Model Locks:Thread-safe locks (vlm lock,llm lock) serialize access to the VLM and LLM during generation. This architecture enables efficient parallel processing of multiple videos while guaranteeing zero cross-vide...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.