QVAD: A Question-Centric Agentic Framework for Efficient and Training-Free Video Anomaly Detection

Hamza Karim; Lokman Bekit; Nghia T Nguyen; Yasin Yilmaz

arxiv: 2604.03040 · v1 · submitted 2026-04-03 · 💻 cs.CV

QVAD: A Question-Centric Agentic Framework for Efficient and Training-Free Video Anomaly Detection

Lokman Bekit , Hamza Karim , Nghia T Nguyen , Yasin Yilmaz This is my paper

Pith reviewed 2026-05-13 20:17 UTC · model grok-4.3

classification 💻 cs.CV

keywords video anomaly detectionvision-language modelsagentic frameworktraining-freequery refinementUCF-CrimeXD-Violence

0 comments

The pith

An agentic framework lets smaller vision-language models reach state-of-the-art video anomaly detection by iteratively refining queries through dynamic dialogue with a language model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims the main obstacle in training-free video anomaly detection is not model size but the fixed nature of prompts. It introduces a question-centric setup in which a language-model agent refines queries step by step using the visual content returned so far, directing a compact vision-language model to produce detailed captions and accurate semantic decisions. This process yields top results on UCF-Crime, XD-Violence, and UBNormal while using far fewer parameters, running at high speed, and fitting in limited memory. The same approach also transfers to the single-scene ComplexVAD dataset without any retraining.

Core claim

QVAD treats VLM-LLM interaction as an ongoing dialogue in which the LLM agent updates prompts on the basis of accumulating visual context. The resulting prompt-updating mechanism elicits high-fidelity captions and precise anomaly reasoning from lightweight VLMs without parameter changes, delivering state-of-the-art performance on UCF-Crime, XD-Violence, and UBNormal while using only a fraction of the parameters required by prior methods and maintaining high inference speed with small memory footprints.

What carries the argument

The LLM agent's iterative query-refinement loop, which updates prompts dynamically from visual feedback to steer the VLM.

If this is right

State-of-the-art accuracy on UCF-Crime, XD-Violence, and UBNormal with far fewer parameters than competing methods.
High inference speed and minimal memory use that supports deployment on edge devices.
Strong transfer to the single-scene ComplexVAD dataset without retraining.
No need for parameter updates or domain-specific fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dynamic-refinement pattern could raise accuracy in other open-set vision tasks that currently rely on static prompts.
Prompt interaction may reduce reliance on ever-larger foundation models for narrow, high-stakes applications.
Real-time surveillance systems could become practical on modest hardware once query refinement replaces model scaling.

Load-bearing premise

An LLM agent can consistently produce useful query refinements from visual context without adding errors or failing on ambiguous cases.

What would settle it

On a new video dataset containing previously unseen anomaly types, measure whether QVAD accuracy falls below that of static-prompt baselines or larger-model competitors.

Figures

Figures reproduced from arXiv: 2604.03040 by Hamza Karim, Lokman Bekit, Nghia T Nguyen, Yasin Yilmaz.

**Figure 1.** Figure 1: Overview of the proposed QVAD framework. The VLM and LLM agents engage in an iterative dialogue, where the LLM generates a query qt conditioned on the captions Ct produced by the VLM and directs it back to the VLM. This feedback loop progressively refines their shared understanding of the scene Similarly, “Follow the Rules” (Yang et al., 2024) explores agentic reasoning by translating normality definitions… view at source ↗

**Figure 2.** Figure 2: Detailed architecture of the proposed QVAD framework. 3.2. Iterative Agentic Reasoning The core of our framework is an iterative dialogue loop where an LLM Agent (A) refines its understanding of the visual input Xt by querying a VLM Perception module (P). Initialization. At turn k = 0, the perception module generates an initial comprehensive caption Ct based on a broad system prompt Pinit (e.g., “Describe… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of Static vs. Dynamic Prompting. In Turn 0 (standard VLM prompting), the model captures general scene dynamics but misses fine-grained semantic cues, leading to False Negatives. In Turn 1 or on Turn 2, the QVAD Agent hypothesizes a potential anomaly and generates a targeted query, correcting the prediction without parameter updates [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Examples of anomaly scores and reasoning for videos from UCF-Crime and XD-Violence test set [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Exact anomaly scoring criteria for ComplexVAD. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Exact anomaly scoring criteria for XD-Violence. Ubnormal System Prompt Definitions: STRICT RULES FOR ANOMALY SCORING — 1. SCOPE: Abuse, Arrest, Arson, Assault, RoadAccidents, Shooting, Shoplifting, Stealing, Vandalism, Fighting, Explosion. 2. FACT-BASED: Analyze ONLY explicit visuals. Do NOT assume intent. Consider duration/obstruction/risk cues. 3. ANOMALY = 1 (Immediate or Emerging Public Safety Risk) • … view at source ↗

**Figure 7.** Figure 7: Exact anomaly scoring criteria for UBNormal. B.6. Hyperparameters and Configuration [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

read the original abstract

Video Anomaly Detection (VAD) is a fundamental challenge in computer vision, particularly due to the open-set nature of anomalies. While recent training-free approaches utilizing Vision-Language Models (VLMs) have shown promise, they typically rely on massive, resource-intensive foundation models to compensate for the ambiguity of static prompts. We argue that the bottleneck in VAD is not necessarily model capacity, but rather the static nature of inquiry. We propose QVAD, a question-centric agentic framework that treats VLM-LLM interaction as a dynamic dialogue. By iteratively refining queries based on visual context, our LLM agent guides smaller VLMs to produce high-fidelity captions and precise semantic reasoning without parameter updates. This ``prompt-updating" mechanism effectively unlocks the latent capabilities of lightweight models, enabling state-of-the-art performance on UCF-Crime, XD-Violence, and UBNormal using a fraction of the parameters required by competing methods. We further demonstrate exceptional generalizability on the single-scene ComplexVAD dataset. Crucially, QVAD achieves high inference speeds with minimal memory footprints, making advanced VAD capabilities deployable on resource-constrained edge devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QVAD's agentic loop for refining VLM prompts in video anomaly detection is a reasonable practical tweak but the paper still needs direct ablations to show the iteration step actually drives the gains over static prompts.

read the letter

QVAD's main move is treating VAD as an ongoing dialogue where an LLM agent updates queries for a smaller VLM based on what it sees, instead of firing one static prompt. That framing is the clearest new element compared to earlier training-free VAD work that just scales up the vision model. The paper does a clean job spelling out why static prompts often fall short on open-set anomalies and how the back-and-forth can pull better captions and scores from lightweight models without any fine-tuning. The efficiency angle lands well too: they emphasize low memory and high speed, which matters if the goal is real edge deployment for surveillance. The reported results on UCF-Crime, XD-Violence, and UBNormal plus the single-scene test on ComplexVAD give a sense of where it might apply. The soft spot is the missing control the stress-test note flags. Without a head-to-head run of the same small VLM using a carefully written static prompt or fixed few-shot version, it is hard to know how much the iterative refinement adds versus just better initial prompting or model choice. If those baselines already get close on AUC or mAP, the agentic claim shrinks. The abstract and description do not lay out those exact comparisons or error breakdowns, so the SOTA numbers stay hard to weigh. Citations look standard and the framework stays procedural rather than introducing new equations that could hide circularity. This is aimed at people working on efficient, training-free CV methods who care about running on constrained hardware. A reader already experimenting with VLMs for detection tasks would get concrete ideas from the dialogue setup. It deserves peer review because the efficiency focus is relevant and the core mechanism is distinct enough that referees can usefully check the ablations and numbers.

Referee Report

2 major / 1 minor

Summary. The paper proposes QVAD, a question-centric agentic framework for training-free video anomaly detection. It posits that static prompts are the primary bottleneck rather than model capacity, and introduces an LLM agent that iteratively refines queries based on visual context to guide smaller VLMs toward high-fidelity captions and precise semantic reasoning without any parameter updates or fine-tuning. The work claims state-of-the-art performance on UCF-Crime, XD-Violence, and UBNormal while using a fraction of the parameters of competing methods, with high inference speeds, minimal memory footprints, and strong generalizability on the single-scene ComplexVAD dataset.

Significance. If the performance and efficiency claims hold after proper validation, the result would be significant for practical VAD deployment: it offers a path to advanced anomaly detection on edge devices by leveraging dynamic agentic interaction to unlock lightweight VLMs rather than scaling model size or requiring training. The approach also highlights a general strategy for prompt-updating in open-set vision tasks.

major comments (2)

[Abstract] Abstract: The manuscript asserts state-of-the-art results on UCF-Crime, XD-Violence, and UBNormal 'using a fraction of the parameters required by competing methods' together with high inference speeds and minimal memory, yet supplies no quantitative metrics (AUC, mAP), baseline tables, ablation studies, or error analysis, making the central performance claim impossible to evaluate.
[Framework description] Framework description (implicit in abstract and method overview): The load-bearing claim is that iterative LLM-driven query refinement unlocks latent capabilities in smaller VLMs beyond what static prompts achieve. No controlled ablation isolating the iterative agentic loop versus a single well-engineered static prompt (or fixed few-shot prompt) on the identical VLM is described, leaving open the possibility that gains derive from VLM choice and prompt quality alone rather than the proposed dynamic dialogue.

minor comments (1)

[Abstract] Abstract: The phrase 'prompt-updating' mechanism is placed in quotes but is not formally defined or contrasted with existing iterative prompting or chain-of-thought techniques in the VLM literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. These points highlight opportunities to improve the clarity of our performance claims and to more rigorously isolate the contribution of the iterative agentic mechanism. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript asserts state-of-the-art results on UCF-Crime, XD-Violence, and UBNormal 'using a fraction of the parameters required by competing methods' together with high inference speeds and minimal memory, yet supplies no quantitative metrics (AUC, mAP), baseline tables, ablation studies, or error analysis, making the central performance claim impossible to evaluate.

Authors: We agree that the abstract, due to length constraints, does not embed specific numerical values. The full manuscript contains detailed experimental sections with AUC and mAP results on UCF-Crime, XD-Violence, and UBNormal, along with direct comparisons of parameter counts, inference latency, and memory usage against competing methods. In the revised version we will expand the abstract to include the key quantitative metrics (e.g., AUC scores and relative parameter reduction) while preserving conciseness, and we will ensure the baseline tables and efficiency results are explicitly cross-referenced. revision: yes
Referee: [Framework description] Framework description (implicit in abstract and method overview): The load-bearing claim is that iterative LLM-driven query refinement unlocks latent capabilities in smaller VLMs beyond what static prompts achieve. No controlled ablation isolating the iterative agentic loop versus a single well-engineered static prompt (or fixed few-shot prompt) on the identical VLM is described, leaving open the possibility that gains derive from VLM choice and prompt quality alone rather than the proposed dynamic dialogue.

Authors: This observation is fair and directly targets the core hypothesis. While the manuscript already shows that QVAD outperforms prior training-free VAD approaches (which rely on static prompts and larger models), it does not contain a controlled head-to-head ablation of iterative refinement versus a single, carefully engineered static prompt on the exact same lightweight VLM backbone. We will add this ablation study to the revised manuscript, reporting performance differences on the same VLM to isolate the benefit of the dynamic, context-aware query loop. revision: yes

Circularity Check

0 steps flagged

No circularity; procedural framework with no equations or self-referential derivations

full rationale

The paper describes a new agentic workflow for video anomaly detection that combines existing VLMs and LLMs through iterative query refinement. No mathematical derivations, equations, fitted parameters, or uniqueness theorems appear in the provided text. The central claims rest on empirical benchmark results rather than any step that reduces by construction to its own inputs. Self-citations, if present, are not load-bearing for the method itself, which is presented as a training-free procedural composition of off-the-shelf models. This is the normal case of a non-circular systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that dynamic prompting can surface latent capabilities in off-the-shelf VLMs; no free parameters or new entities are explicitly quantified in the abstract.

axioms (1)

domain assumption Smaller VLMs possess latent capabilities for high-fidelity captioning and semantic reasoning that can be unlocked via iterative LLM-guided prompt refinement without training.
This premise underpins the claim that prompt-updating replaces the need for massive models or parameter updates.

invented entities (1)

QVAD agentic framework no independent evidence
purpose: To enable training-free, efficient video anomaly detection through dynamic VLM-LLM dialogue.
New procedural construct introduced by the paper; no independent evidence outside the abstract is provided.

pith-pipeline@v0.9.0 · 5519 in / 1203 out tokens · 55004 ms · 2026-05-13T20:17:10.140191+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose QVAD, a question-centric agentic framework that treats VLM-LLM interaction as a dynamic dialogue. By iteratively refining queries based on visual context, our LLM agent guides smaller VLMs to produce high-fidelity captions and precise semantic reasoning without parameter updates.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This “prompt-updating” mechanism effectively unlocks the latent capabilities of lightweight models, enabling state-of-the-art performance on UCF-Crime, XD-Violence, and UBNormal using a fraction of the parameters required by competing methods.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks
cs.CV 2026-05 unverdicted novelty 7.0

MAVEN pipeline generates multi-scale spatio-temporal event descriptions from videos using agentic adaptation and refinement, then produces training data that lets a fine-tuned 8B model outperform Gemini baselines on p...

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · cited by 1 Pith paper

[1]

Answer directly in 1 single sentence if possible (max 2)

work page
[2]

14 QV AD: Question-Centric Agentic Framework for Video Anomaly Detection B.5

Do not describe the whole scene, only answer the specific question. 14 QV AD: Question-Centric Agentic Framework for Video Anomaly Detection B.5. LLM Prompting Strategy The LLM is employed in two complementary roles: (i) anomaly scoring and (ii) clarifying question generation. While the prompt used for question generation remains fixed across datasets, th...

work page
[3]

Target specific ambiguous details (objects, actions, interactions)

work page
[4]

Help distinguish between normal and criminal behavior

work page
[5]

Be answerable by observing the video footage

work page
[6]

Focus on concrete, observable elements

work page
[7]

Are there patterns?

Do not ask questions that force VLM to hallucinate a crime If the question needs historical context (e.g., “Are there patterns?”), phrase it naturally – the system will automatically provide relevant context. Generate ONLY the question text, no additional commentary. B.5.3. ANOMALYDEFINITION ONOTHERDATASETS To accommodate varying definitions of “anomaly,”...

work page
[8]

ANOMALY = 1 (Immediate or Emerging Public Safety Risk) •A) HUMAN CRIMINALITY:Assault/Fighting (punches, kicks); Shooting (weapons, panic); Stealing/Burglary (concealment, force); Vandalism. • B) HAZARDOUS SITUATIONS:RoadAccidents:Collision; Vehicle on sidewalk/grass; Blocking active lanes; Stopped in traffic; Visible damage.Explosion/Arson: Flash/blast; T...

work page
[9]

If persistence/danger not explicit, classify as 0

NO ANOMALY = 0:Walking/standing briefly; Jogging; Cars at red lights/congestion; Parked legally; Light fog/rain. If persistence/danger not explicit, classify as 0. Figure 7.Exact anomaly scoring criteria for UBNormal. B.6. Hyperparameters and Configuration Table 12 summarizes the key hyperparameters used in our experiments. B.7. Parallel Processing Archit...

work page
[10]

Vector Memory Isolation:Each video receives its own VectorMemory instance, preventing cross-video context contamination

work page
[11]

3.Model Locks:Thread-safe locks (vlm lock,llm lock) serialize access to the VLM and LLM during generation

Shared Encoder:The sentence embedding model is loaded once and shared across threads (read-only during inference), reducing memory overhead. 3.Model Locks:Thread-safe locks (vlm lock,llm lock) serialize access to the VLM and LLM during generation. This architecture enables efficient parallel processing of multiple videos while guaranteeing zero cross-vide...

work page 2048

[1] [1]

Answer directly in 1 single sentence if possible (max 2)

work page

[2] [2]

14 QV AD: Question-Centric Agentic Framework for Video Anomaly Detection B.5

Do not describe the whole scene, only answer the specific question. 14 QV AD: Question-Centric Agentic Framework for Video Anomaly Detection B.5. LLM Prompting Strategy The LLM is employed in two complementary roles: (i) anomaly scoring and (ii) clarifying question generation. While the prompt used for question generation remains fixed across datasets, th...

work page

[3] [3]

Target specific ambiguous details (objects, actions, interactions)

work page

[4] [4]

Help distinguish between normal and criminal behavior

work page

[5] [5]

Be answerable by observing the video footage

work page

[6] [6]

Focus on concrete, observable elements

work page

[7] [7]

Are there patterns?

Do not ask questions that force VLM to hallucinate a crime If the question needs historical context (e.g., “Are there patterns?”), phrase it naturally – the system will automatically provide relevant context. Generate ONLY the question text, no additional commentary. B.5.3. ANOMALYDEFINITION ONOTHERDATASETS To accommodate varying definitions of “anomaly,”...

work page

[8] [8]

ANOMALY = 1 (Immediate or Emerging Public Safety Risk) •A) HUMAN CRIMINALITY:Assault/Fighting (punches, kicks); Shooting (weapons, panic); Stealing/Burglary (concealment, force); Vandalism. • B) HAZARDOUS SITUATIONS:RoadAccidents:Collision; Vehicle on sidewalk/grass; Blocking active lanes; Stopped in traffic; Visible damage.Explosion/Arson: Flash/blast; T...

work page

[9] [9]

If persistence/danger not explicit, classify as 0

NO ANOMALY = 0:Walking/standing briefly; Jogging; Cars at red lights/congestion; Parked legally; Light fog/rain. If persistence/danger not explicit, classify as 0. Figure 7.Exact anomaly scoring criteria for UBNormal. B.6. Hyperparameters and Configuration Table 12 summarizes the key hyperparameters used in our experiments. B.7. Parallel Processing Archit...

work page

[10] [10]

Vector Memory Isolation:Each video receives its own VectorMemory instance, preventing cross-video context contamination

work page

[11] [11]

3.Model Locks:Thread-safe locks (vlm lock,llm lock) serialize access to the VLM and LLM during generation

Shared Encoder:The sentence embedding model is loaded once and shared across threads (read-only during inference), reducing memory overhead. 3.Model Locks:Thread-safe locks (vlm lock,llm lock) serialize access to the VLM and LLM during generation. This architecture enables efficient parallel processing of multiple videos while guaranteeing zero cross-vide...

work page 2048