pith. sign in

arxiv: 2604.03040 · v1 · submitted 2026-04-03 · 💻 cs.CV

QVAD: A Question-Centric Agentic Framework for Efficient and Training-Free Video Anomaly Detection

Pith reviewed 2026-05-13 20:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords video anomaly detectionvision-language modelsagentic frameworktraining-freequery refinementUCF-CrimeXD-Violence
0
0 comments X

The pith

An agentic framework lets smaller vision-language models reach state-of-the-art video anomaly detection by iteratively refining queries through dynamic dialogue with a language model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims the main obstacle in training-free video anomaly detection is not model size but the fixed nature of prompts. It introduces a question-centric setup in which a language-model agent refines queries step by step using the visual content returned so far, directing a compact vision-language model to produce detailed captions and accurate semantic decisions. This process yields top results on UCF-Crime, XD-Violence, and UBNormal while using far fewer parameters, running at high speed, and fitting in limited memory. The same approach also transfers to the single-scene ComplexVAD dataset without any retraining.

Core claim

QVAD treats VLM-LLM interaction as an ongoing dialogue in which the LLM agent updates prompts on the basis of accumulating visual context. The resulting prompt-updating mechanism elicits high-fidelity captions and precise anomaly reasoning from lightweight VLMs without parameter changes, delivering state-of-the-art performance on UCF-Crime, XD-Violence, and UBNormal while using only a fraction of the parameters required by prior methods and maintaining high inference speed with small memory footprints.

What carries the argument

The LLM agent's iterative query-refinement loop, which updates prompts dynamically from visual feedback to steer the VLM.

If this is right

  • State-of-the-art accuracy on UCF-Crime, XD-Violence, and UBNormal with far fewer parameters than competing methods.
  • High inference speed and minimal memory use that supports deployment on edge devices.
  • Strong transfer to the single-scene ComplexVAD dataset without retraining.
  • No need for parameter updates or domain-specific fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dynamic-refinement pattern could raise accuracy in other open-set vision tasks that currently rely on static prompts.
  • Prompt interaction may reduce reliance on ever-larger foundation models for narrow, high-stakes applications.
  • Real-time surveillance systems could become practical on modest hardware once query refinement replaces model scaling.

Load-bearing premise

An LLM agent can consistently produce useful query refinements from visual context without adding errors or failing on ambiguous cases.

What would settle it

On a new video dataset containing previously unseen anomaly types, measure whether QVAD accuracy falls below that of static-prompt baselines or larger-model competitors.

Figures

Figures reproduced from arXiv: 2604.03040 by Hamza Karim, Lokman Bekit, Nghia T Nguyen, Yasin Yilmaz.

Figure 1
Figure 1. Figure 1: Overview of the proposed QVAD framework. The VLM and LLM agents engage in an iterative dialogue, where the LLM generates a query qt conditioned on the captions Ct produced by the VLM and directs it back to the VLM. This feedback loop progressively refines their shared understanding of the scene Similarly, “Follow the Rules” (Yang et al., 2024) explores agentic reasoning by translating normality definitions… view at source ↗
Figure 2
Figure 2. Figure 2: Detailed architecture of the proposed QVAD framework. 3.2. Iterative Agentic Reasoning The core of our framework is an iterative dialogue loop where an LLM Agent (A) refines its understanding of the visual input Xt by querying a VLM Perception module (P). Initialization. At turn k = 0, the perception module gener￾ates an initial comprehensive caption Ct based on a broad system prompt Pinit (e.g., “Describe… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of Static vs. Dynamic Prompting. In Turn 0 (standard VLM prompting), the model captures general scene dynamics but misses fine-grained semantic cues, leading to False Negatives. In Turn 1 or on Turn 2, the QVAD Agent hypothesizes a potential anomaly and generates a targeted query, correcting the prediction without parameter updates [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Examples of anomaly scores and reasoning for videos from UCF-Crime and XD-Violence test set [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Exact anomaly scoring criteria for ComplexVAD. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Exact anomaly scoring criteria for XD-Violence. Ubnormal System Prompt Definitions: STRICT RULES FOR ANOMALY SCORING — 1. SCOPE: Abuse, Arrest, Arson, Assault, RoadAccidents, Shooting, Shoplifting, Stealing, Vandalism, Fighting, Explosion. 2. FACT-BASED: Analyze ONLY explicit visuals. Do NOT assume intent. Consider duration/obstruction/risk cues. 3. ANOMALY = 1 (Immediate or Emerging Public Safety Risk) • … view at source ↗
Figure 7
Figure 7. Figure 7: Exact anomaly scoring criteria for UBNormal. B.6. Hyperparameters and Configuration [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
read the original abstract

Video Anomaly Detection (VAD) is a fundamental challenge in computer vision, particularly due to the open-set nature of anomalies. While recent training-free approaches utilizing Vision-Language Models (VLMs) have shown promise, they typically rely on massive, resource-intensive foundation models to compensate for the ambiguity of static prompts. We argue that the bottleneck in VAD is not necessarily model capacity, but rather the static nature of inquiry. We propose QVAD, a question-centric agentic framework that treats VLM-LLM interaction as a dynamic dialogue. By iteratively refining queries based on visual context, our LLM agent guides smaller VLMs to produce high-fidelity captions and precise semantic reasoning without parameter updates. This ``prompt-updating" mechanism effectively unlocks the latent capabilities of lightweight models, enabling state-of-the-art performance on UCF-Crime, XD-Violence, and UBNormal using a fraction of the parameters required by competing methods. We further demonstrate exceptional generalizability on the single-scene ComplexVAD dataset. Crucially, QVAD achieves high inference speeds with minimal memory footprints, making advanced VAD capabilities deployable on resource-constrained edge devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes QVAD, a question-centric agentic framework for training-free video anomaly detection. It posits that static prompts are the primary bottleneck rather than model capacity, and introduces an LLM agent that iteratively refines queries based on visual context to guide smaller VLMs toward high-fidelity captions and precise semantic reasoning without any parameter updates or fine-tuning. The work claims state-of-the-art performance on UCF-Crime, XD-Violence, and UBNormal while using a fraction of the parameters of competing methods, with high inference speeds, minimal memory footprints, and strong generalizability on the single-scene ComplexVAD dataset.

Significance. If the performance and efficiency claims hold after proper validation, the result would be significant for practical VAD deployment: it offers a path to advanced anomaly detection on edge devices by leveraging dynamic agentic interaction to unlock lightweight VLMs rather than scaling model size or requiring training. The approach also highlights a general strategy for prompt-updating in open-set vision tasks.

major comments (2)
  1. [Abstract] Abstract: The manuscript asserts state-of-the-art results on UCF-Crime, XD-Violence, and UBNormal 'using a fraction of the parameters required by competing methods' together with high inference speeds and minimal memory, yet supplies no quantitative metrics (AUC, mAP), baseline tables, ablation studies, or error analysis, making the central performance claim impossible to evaluate.
  2. [Framework description] Framework description (implicit in abstract and method overview): The load-bearing claim is that iterative LLM-driven query refinement unlocks latent capabilities in smaller VLMs beyond what static prompts achieve. No controlled ablation isolating the iterative agentic loop versus a single well-engineered static prompt (or fixed few-shot prompt) on the identical VLM is described, leaving open the possibility that gains derive from VLM choice and prompt quality alone rather than the proposed dynamic dialogue.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'prompt-updating' mechanism is placed in quotes but is not formally defined or contrasted with existing iterative prompting or chain-of-thought techniques in the VLM literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. These points highlight opportunities to improve the clarity of our performance claims and to more rigorously isolate the contribution of the iterative agentic mechanism. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The manuscript asserts state-of-the-art results on UCF-Crime, XD-Violence, and UBNormal 'using a fraction of the parameters required by competing methods' together with high inference speeds and minimal memory, yet supplies no quantitative metrics (AUC, mAP), baseline tables, ablation studies, or error analysis, making the central performance claim impossible to evaluate.

    Authors: We agree that the abstract, due to length constraints, does not embed specific numerical values. The full manuscript contains detailed experimental sections with AUC and mAP results on UCF-Crime, XD-Violence, and UBNormal, along with direct comparisons of parameter counts, inference latency, and memory usage against competing methods. In the revised version we will expand the abstract to include the key quantitative metrics (e.g., AUC scores and relative parameter reduction) while preserving conciseness, and we will ensure the baseline tables and efficiency results are explicitly cross-referenced. revision: yes

  2. Referee: [Framework description] Framework description (implicit in abstract and method overview): The load-bearing claim is that iterative LLM-driven query refinement unlocks latent capabilities in smaller VLMs beyond what static prompts achieve. No controlled ablation isolating the iterative agentic loop versus a single well-engineered static prompt (or fixed few-shot prompt) on the identical VLM is described, leaving open the possibility that gains derive from VLM choice and prompt quality alone rather than the proposed dynamic dialogue.

    Authors: This observation is fair and directly targets the core hypothesis. While the manuscript already shows that QVAD outperforms prior training-free VAD approaches (which rely on static prompts and larger models), it does not contain a controlled head-to-head ablation of iterative refinement versus a single, carefully engineered static prompt on the exact same lightweight VLM backbone. We will add this ablation study to the revised manuscript, reporting performance differences on the same VLM to isolate the benefit of the dynamic, context-aware query loop. revision: yes

Circularity Check

0 steps flagged

No circularity; procedural framework with no equations or self-referential derivations

full rationale

The paper describes a new agentic workflow for video anomaly detection that combines existing VLMs and LLMs through iterative query refinement. No mathematical derivations, equations, fitted parameters, or uniqueness theorems appear in the provided text. The central claims rest on empirical benchmark results rather than any step that reduces by construction to its own inputs. Self-citations, if present, are not load-bearing for the method itself, which is presented as a training-free procedural composition of off-the-shelf models. This is the normal case of a non-circular systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that dynamic prompting can surface latent capabilities in off-the-shelf VLMs; no free parameters or new entities are explicitly quantified in the abstract.

axioms (1)
  • domain assumption Smaller VLMs possess latent capabilities for high-fidelity captioning and semantic reasoning that can be unlocked via iterative LLM-guided prompt refinement without training.
    This premise underpins the claim that prompt-updating replaces the need for massive models or parameter updates.
invented entities (1)
  • QVAD agentic framework no independent evidence
    purpose: To enable training-free, efficient video anomaly detection through dynamic VLM-LLM dialogue.
    New procedural construct introduced by the paper; no independent evidence outside the abstract is provided.

pith-pipeline@v0.9.0 · 5519 in / 1203 out tokens · 55004 ms · 2026-05-13T20:17:10.140191+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We propose QVAD, a question-centric agentic framework that treats VLM-LLM interaction as a dynamic dialogue. By iteratively refining queries based on visual context, our LLM agent guides smaller VLMs to produce high-fidelity captions and precise semantic reasoning without parameter updates.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    This “prompt-updating” mechanism effectively unlocks the latent capabilities of lightweight models, enabling state-of-the-art performance on UCF-Crime, XD-Violence, and UBNormal using a fraction of the parameters required by competing methods.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks

    cs.CV 2026-05 unverdicted novelty 7.0

    MAVEN pipeline generates multi-scale spatio-temporal event descriptions from videos using agentic adaptation and refinement, then produces training data that lets a fine-tuned 8B model outperform Gemini baselines on p...

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · cited by 1 Pith paper

  1. [1]

    Answer directly in 1 single sentence if possible (max 2)

  2. [2]

    14 QV AD: Question-Centric Agentic Framework for Video Anomaly Detection B.5

    Do not describe the whole scene, only answer the specific question. 14 QV AD: Question-Centric Agentic Framework for Video Anomaly Detection B.5. LLM Prompting Strategy The LLM is employed in two complementary roles: (i) anomaly scoring and (ii) clarifying question generation. While the prompt used for question generation remains fixed across datasets, th...

  3. [3]

    Target specific ambiguous details (objects, actions, interactions)

  4. [4]

    Help distinguish between normal and criminal behavior

  5. [5]

    Be answerable by observing the video footage

  6. [6]

    Focus on concrete, observable elements

  7. [7]

    Are there patterns?

    Do not ask questions that force VLM to hallucinate a crime If the question needs historical context (e.g., “Are there patterns?”), phrase it naturally – the system will automatically provide relevant context. Generate ONLY the question text, no additional commentary. B.5.3. ANOMALYDEFINITION ONOTHERDATASETS To accommodate varying definitions of “anomaly,”...

  8. [8]

    ANOMALY = 1 (Immediate or Emerging Public Safety Risk) •A) HUMAN CRIMINALITY:Assault/Fighting (punches, kicks); Shooting (weapons, panic); Stealing/Burglary (concealment, force); Vandalism. • B) HAZARDOUS SITUATIONS:RoadAccidents:Collision; Vehicle on sidewalk/grass; Blocking active lanes; Stopped in traffic; Visible damage.Explosion/Arson: Flash/blast; T...

  9. [9]

    If persistence/danger not explicit, classify as 0

    NO ANOMALY = 0:Walking/standing briefly; Jogging; Cars at red lights/congestion; Parked legally; Light fog/rain. If persistence/danger not explicit, classify as 0. Figure 7.Exact anomaly scoring criteria for UBNormal. B.6. Hyperparameters and Configuration Table 12 summarizes the key hyperparameters used in our experiments. B.7. Parallel Processing Archit...

  10. [10]

    Vector Memory Isolation:Each video receives its own VectorMemory instance, preventing cross-video context contamination

  11. [11]

    3.Model Locks:Thread-safe locks (vlm lock,llm lock) serialize access to the VLM and LLM during generation

    Shared Encoder:The sentence embedding model is loaded once and shared across threads (read-only during inference), reducing memory overhead. 3.Model Locks:Thread-safe locks (vlm lock,llm lock) serialize access to the VLM and LLM during generation. This architecture enables efficient parallel processing of multiple videos while guaranteeing zero cross-vide...