pith. machine review for the scientific record. sign in

arxiv: 2512.13671 · v2 · submitted 2025-12-15 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

AgentIAD: Agentic Industrial Anomaly Detection via Adaptive Memory Augmentation

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords industrial anomaly detectionagentic vision-language modeladaptive memoryPerceptive Zoomerreinforcement learningMMAD benchmarkmulti-round reasoninganomaly analysis
0
0 comments X

The pith

An agentic vision-language framework improves industrial anomaly detection by letting the model iteratively zoom in on defects and retrieve comparisons or external knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Industrial anomaly detection struggles with subtle, localized defects that single-pass vision-language models often overlook, and existing methods lack ways to actively gather more evidence during inspection. AgentIAD addresses this by giving the model a unified set of actions to access visual memory for fine-grained zooming and retrieved memory for cross-instance checks or external facts, allowing multi-round reasoning. The system is trained in two stages: first supervised fine-tuning to teach tool use, then reinforcement learning to improve long-horizon decision making under sparse rewards. Experiments on the MMAD benchmark show a 5.92 percent accuracy gain over prior state-of-the-art methods using the same backbone, along with more interpretable outputs.

Core claim

AgentIAD creates an agentic vision-language model that progressively inspects industrial images through a unified action space, dynamically calling the Perceptive Zoomer to examine local regions, the Web Searcher for external knowledge, and the Comparative Retriever for cross-instance verification, all learned via tool-aware supervised fine-tuning followed by agentic reinforcement learning to handle sparse supervision.

What carries the argument

The unified action space that lets the agent switch between visual memory via the Perceptive Zoomer and retrieved memory via the Web Searcher and Comparative Retriever, enabling multi-round perception-action reasoning.

If this is right

  • Classification accuracy rises by 5.92% over prior methods on the MMAD benchmark under identical backbone conditions.
  • Anomaly analysis becomes more reliable and interpretable through explicit evidence-gathering steps.
  • The model can handle subtle defects by collecting complementary visual and external evidence across multiple rounds rather than in one pass.
  • Long-horizon decision policies can be learned effectively even when supervision is sparse by separating tool familiarization from policy refinement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The memory-augmented agent approach could extend to other inspection domains such as medical imaging where fine details and cross-case comparisons matter.
  • If the action space generalizes, similar agentic setups might reduce the need for ever-larger single-pass models in quality-control pipelines.
  • The two-stage training pattern offers a practical template for teaching agents to use external tools when direct rewards are infrequent.

Load-bearing premise

The two-stage training successfully learns effective long-horizon policies under sparse supervision for the unified action space.

What would settle it

If AgentIAD shows no accuracy improvement over the previous state-of-the-art on the MMAD benchmark when using the same backbone, the performance claim would not hold.

Figures

Figures reproduced from arXiv: 2512.13671 by Junwen Miao, Lida Huang, Penghui Du, Runze He, Yan Wang, Yi Liu, Yingying Fan, Yu Wang.

Figure 1
Figure 1. Figure 1: Motivation. Non-tool MLLMs rely on a single global pass and frequently misclassify subtle defects (left). AgentIAD corrects these failures through tool-driven reasoning (right): the Perceptive Zoomer exposes fine-grained abnormal cues, and the Comparative Retriever verifies them against a normal reference. grounds, making one-shot, feed-forward detectors prone to missing fine-grained anomalies. Recently, m… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of AgentIAD. The agent performs multi-round reasoning through a tool-augmented Chain-of-Thought (CoT). At each step, it may invoke the Perceptive Zoomer (PZ) to inspect local regions or the Comparative Retriever (CR) to query normal exemplars. Training consists of two stages: (a) Perceptive Supervised Fine-Tuning for grounding reasoning with visual actions, and (b) Agentic Reinforcement Learning f… view at source ↗
Figure 3
Figure 3. Figure 3: Visualize Inference Cases of AgentIAD. Left: a Perceptive Trajectory inference case where PZ zooms the defect region, enabling correct classification. Right: a Comparative Trajectory inference case where PZ is insufficient; the agent invokes CR to compare with a normal exemplar and corrects its decision [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Trajectory Construction Pipeline. The process includes: (1) data preparation from MMAD, (2) GPT-4o multi-step reasoning (CoT-1/2/3), and (3) trajectory assembly with structured tool calls. Both the perceptive (PZ-only) and comparative (PZ+CR) trajectories are derived from this unified pipeline. Output format: <think> Explain your visual reasoning, considering both the original image and the ROI information… view at source ↗
read the original abstract

Industrial anomaly detection (IAD) is challenging due to the subtle and highly localized nature of many defects, which single-pass vision--language models (VLMs) often fail to capture. Moreover, existing approaches lack mechanisms to actively acquire complementary evidence during inference. We propose AgentIAD, an agentic vision--language framework that enables iterative industrial inspection through a unified action space. The agent dynamically accesses two forms of memory during inspection: visual memory via the Perceptive Zoomer (PZ) for fine-grained local analysis, and retrieved memory via the Web Searcher (WS) and Comparative Retriever (CR) for external knowledge acquisition and cross-instance verification. This design allows the model to progressively gather evidence through multi-round perception--action reasoning. To effectively learn such policies under sparse supervision, AgentIAD adopts a two-stage training strategy: tool-aware supervised fine-tuning first initializes structured reasoning and memory-access behaviors, followed by agentic reinforcement learning to refine long-horizon decision policies. Extensive experiments show that, under the same backbone, AgentIAD improves classification accuracy by 5.92% over the previous state-of-the-art method on the MMAD benchmark while providing more reliable and interpretable anomaly analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes AgentIAD, an agentic vision-language framework for industrial anomaly detection that performs iterative inspection via a unified action space. It augments the model with visual memory through the Perceptive Zoomer and retrieved memory through the Web Searcher and Comparative Retriever, enabling multi-round perception-action reasoning. Training proceeds in two stages: tool-aware supervised fine-tuning to initialize structured behaviors, followed by agentic reinforcement learning to optimize long-horizon policies under sparse rewards. The central empirical claim is a 5.92% classification accuracy improvement over prior state-of-the-art on the MMAD benchmark when using the same backbone, accompanied by assertions of improved reliability and interpretability.

Significance. If the reported gains are shown to arise from learned iterative reasoning rather than the base tools or SFT initialization alone, the work would meaningfully extend anomaly detection beyond single-pass VLMs by demonstrating active evidence gathering for subtle, localized defects. The two-stage training recipe and memory-augmented action space could serve as a template for other sparse-reward vision tasks, provided the RL component demonstrably alters policy behavior.

major comments (2)
  1. [Abstract / Experiments] Abstract and experimental results: the 5.92% MMAD accuracy gain is stated without any reported baselines, number of runs, statistical tests, ablation tables isolating the RL stage, or comparison of multi-round trajectories before versus after the agentic RL phase, preventing evaluation of whether the claimed improvement stems from learned long-horizon policies.
  2. [Training Strategy] Training strategy section: the claim that agentic RL successfully learns effective long-horizon policies under sparse supervision is load-bearing for attributing gains to the agentic framework, yet no verification is supplied that the RL stage produces measurably different action sequences, credit assignment success, or policy divergence from the tool-aware SFT checkpoint.
minor comments (2)
  1. [Method] The unified action space would benefit from an explicit enumeration or pseudocode listing of available actions and their arguments to clarify how perception, zoom, search, and retrieval are interleaved.
  2. [Method] Notation for the two memory modules (PZ, WS, CR) is introduced without a consolidated table of their inputs, outputs, and integration points into the agent loop.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for stronger empirical isolation of the agentic RL component. We address each point below and commit to targeted revisions that will clarify the source of the reported gains without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and experimental results: the 5.92% MMAD accuracy gain is stated without any reported baselines, number of runs, statistical tests, ablation tables isolating the RL stage, or comparison of multi-round trajectories before versus after the agentic RL phase, preventing evaluation of whether the claimed improvement stems from learned long-horizon policies.

    Authors: We thank the referee for this observation. The 5.92% figure represents the improvement of the full AgentIAD model over the prior SOTA using the identical backbone, as shown in the main results table of the Experiments section. To directly address the request for isolation of the RL stage, we will add a dedicated ablation table in the revised manuscript comparing the tool-aware SFT checkpoint against the full SFT+RL model. This table will report mean accuracy and standard deviation across five independent runs, together with paired t-test p-values for statistical significance. We will also include side-by-side qualitative examples of multi-round action trajectories from the SFT-only and RL-trained models to illustrate policy evolution, such as increased invocation of the Perceptive Zoomer on subtle defects. revision: yes

  2. Referee: [Training Strategy] Training strategy section: the claim that agentic RL successfully learns effective long-horizon policies under sparse supervision is load-bearing for attributing gains to the agentic framework, yet no verification is supplied that the RL stage produces measurably different action sequences, credit assignment success, or policy divergence from the tool-aware SFT checkpoint.

    Authors: We agree that explicit verification of behavioral changes induced by the RL stage is necessary to substantiate the attribution of gains to long-horizon policy learning. The current manuscript provides only qualitative reasoning-chain examples in Section 4.4. In the revision we will add quantitative metrics in a new subsection of Training Strategy: (i) shifts in average episode length and action-type distributions before versus after RL, (ii) policy divergence measured by average KL divergence between the SFT and RL policies evaluated on held-out states, and (iii) credit-assignment diagnostics by tracing reward propagation along sampled trajectories. These additions will allow readers to assess whether the RL stage produces measurably distinct long-horizon behavior. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; empirical claim only

full rationale

The paper describes an agentic framework and two-stage training procedure but contains no equations, first-principles derivations, or predictions that reduce to fitted inputs or self-citations by construction. The reported 5.92% accuracy gain is framed strictly as an empirical comparison against external prior methods on the MMAD benchmark, with no load-bearing mathematical step that collapses to the paper's own definitions or prior self-citations. This is the normal case for an applied systems paper whose central claim rests on experimental results rather than analytic reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5532 in / 977 out tokens · 32227 ms · 2026-05-16T21:56:44.240098+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 10 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6

  2. [2]

    Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection

    Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9592–9600, 2019. 1, 2, 6

  3. [3]

    Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and localization.International Journal of Computer Vision, 130 (4):947–969, 2022

    Paul Bergmann, Kilian Batzner, Michael Fauser, David Sat- tlegger, and Carsten Steger. Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and localization.International Journal of Computer Vision, 130 (4):947–969, 2022. 6

  4. [4]

    Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly de- tection

    Yunkang Cao, Jiangning Zhang, Luca Frittoli, Yuqi Cheng, Weiming Shen, and Giacomo Boracchi. Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly de- tection. InEuropean Conference on Computer Vision, pages 55–72. Springer, 2024. 6, 7

  5. [5]

    Anoma- lyr1: A grpo-based end-to-end mllm for industrial anomaly detection.arXiv preprint arXiv:2504.11914, 2025

    Yuhao Chao, Jie Liu, Jie Tang, and Gangshan Wu. Anoma- lyr1: A grpo-based end-to-end mllm for industrial anomaly detection.arXiv preprint arXiv:2504.11914, 2025. 2, 3, 6, 7

  6. [6]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 6

  7. [7]

    Offset: Segmentation-based focus shift revision for composed image retrieval

    Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Xuemeng Song, and Liqiang Nie. Offset: Segmentation-based focus shift revision for composed image retrieval. InProceedings of the ACM International Conference on Multimedia, page 6113–6122, 2025. 3

  8. [8]

    Hud: Hierarchical uncertainty-aware disambiguation network for composed video retrieval

    Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Haokun Wen, and Weili Guan. Hud: Hierarchical uncertainty-aware disambiguation network for composed video retrieval. In Proceedings of the ACM International Conference on Mul- timedia, page 6143–6152, 2025. 3

  9. [9]

    Padim: a patch distribution modeling framework for anomaly detection and localization

    Thomas Defard, Aleksandr Setkov, Angelique Loesch, and Romaric Audigier. Padim: a patch distribution modeling framework for anomaly detection and localization. InInter- national conference on pattern recognition, pages 475–489. Springer, 2021. 1, 2

  10. [10]

    Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection

    Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1705–1714,

  11. [11]

    Anomalygpt: Detecting in- dustrial anomalies using large vision-language models

    Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Anomalygpt: Detecting in- dustrial anomalies using large vision-language models. In Proceedings of the AAAI conference on artificial intelli- gence, pages 1932–1940, 2024. 3, 6, 7

  12. [12]

    Univad: A training-free uni- fied model for few-shot visual anomaly detection

    Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Univad: A training-free uni- fied model for few-shot visual anomaly detection. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 15194–15203, 2025. 6, 7

  13. [13]

    Emit: Enhancing mllms for industrial anomaly detection via difficulty-aware grpo.arXiv preprint arXiv:2507.21619, 2025

    Wei Guan, Jun Lan, Jian Cao, Hao Tan, Huijia Zhu, and Weiqiang Wang. Emit: Enhancing mllms for industrial anomaly detection via difficulty-aware grpo.arXiv preprint arXiv:2507.21619, 2025. 2

  14. [14]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2

  15. [15]

    DeepEyesV2: Toward Agentic Multimodal Model

    Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic mul- timodal model.arXiv preprint arXiv:2511.05271, 2025. 2

  16. [16]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

  17. [17]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 3

  18. [18]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 2

  19. [19]

    Winclip: Zero- /few-shot anomaly classification and segmentation

    Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer. Winclip: Zero- /few-shot anomaly classification and segmentation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19606–19616, 2023. 6, 7

  20. [20]

    Au- toiad: Manager-driven multi-agent collaboration for au- tomated industrial anomaly detection.arXiv preprint arXiv:2508.05503, 2025

    Dongwei Ji, Bingzhang Hu, and Yi Zhou. Au- toiad: Manager-driven multi-agent collaboration for au- tomated industrial anomaly detection.arXiv preprint arXiv:2508.05503, 2025. 2

  21. [21]

    Mmad: A comprehensive benchmark for multimodal large language models in industrial anomaly detection.arXiv preprint arXiv:2410.09453, 2024

    Xi Jiang, Jian Li, Hanqiu Deng, Yong Liu, Bin-Bin Gao, Yifeng Zhou, Jialin Li, Chengjie Wang, and Feng Zheng. Mmad: A comprehensive benchmark for multimodal large language models in industrial anomaly detection.arXiv preprint arXiv:2410.09453, 2024. 3, 6

  22. [22]

    Logi- cad: Explainable anomaly detection via vlm-based text fea- ture extraction

    Er Jin, Qihui Feng, Yongli Mou, Gerhard Lakemeyer, Ste- fan Decker, Oliver Simons, and Johannes Stegmaier. Logi- cad: Explainable anomaly detection via vlm-based text fea- ture extraction. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4129–4137, 2025. 3 9

  23. [23]

    Cutpaste: Self-supervised learning for anomaly de- tection and localization

    Chun-Liang Li, Kihyuk Sohn, Jinsung Yoon, and Tomas Pfister. Cutpaste: Self-supervised learning for anomaly de- tection and localization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9664–9674, 2021. 3

  24. [24]

    Lad-reasoner: Tiny multimodal mod- els are good reasoners for logical anomaly detection.arXiv preprint arXiv:2504.12749, 2025

    Weijia Li, Guanglei Chu, Jiong Chen, Guo-Sen Xie, Caifeng Shan, and Fang Zhao. Lad-reasoner: Tiny multimodal mod- els are good reasoners for logical anomaly detection.arXiv preprint arXiv:2504.12749, 2025. 3

  25. [25]

    Iad-r1: Reinforcing con- sistent reasoning in industrial anomaly detection.arXiv preprint arXiv:2508.09178, 2025

    Yanhui Li, Yunkang Cao, Chengliang Liu, Yuan Xiong, Xinghui Dong, and Chao Huang. Iad-r1: Reinforcing con- sistent reasoning in industrial anomaly detection.arXiv preprint arXiv:2508.09178, 2025. 2

  26. [26]

    Encoder: Entity mining and modifica- tion relation binding for composed image retrieval

    Zixu Li, Zhiwei Chen, Haokun Wen, Zhiheng Fu, Yupeng Hu, and Weili Guan. Encoder: Entity mining and modifica- tion relation binding for composed image retrieval. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 5101–5109, 2025. 3

  27. [27]

    Finecir: Explicit parsing of fine- grained modification semantics for composed image re- trieval.https://arxiv.org/abs/2503.21309, 2025

    Zixu Li, Zhiheng Fu, Yupeng Hu, Zhiwei Chen, Haokun Wen, and Liqiang Nie. Finecir: Explicit parsing of fine- grained modification semantics for composed image re- trieval.https://arxiv.org/abs/2503.21309, 2025. 3

  28. [28]

    Ad-fm: Multimodal llms for anomaly detection via multi- stage reasoning and fine-grained reward optimization.arXiv preprint arXiv:2508.04175, 2025

    Jingyi Liao, Yongyi Su, Rong-Cheng Tu, Zhao Jin, Wen- hao Sun, Yiting Li, Dacheng Tao, Xun Xu, and Xulei Yang. Ad-fm: Multimodal llms for anomaly detection via multi- stage reasoning and fine-grained reward optimization.arXiv preprint arXiv:2508.04175, 2025. 2, 6, 7

  29. [29]

    Llavanext: Improved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024. 6

  30. [30]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 2

  31. [31]

    Explainable deep one-class classification.arXiv preprint arXiv:2007.01760, 2020

    Philipp Liznerski, Lukas Ruff, Robert A Vandermeulen, Billy Joe Franks, Marius Kloft, and Klaus-Robert M ¨uller. Explainable deep one-class classification.arXiv preprint arXiv:2007.01760, 2020. 3

  32. [32]

    Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip

    Wenxin Ma, Xu Zhang, Qingsong Yao, Fenghe Tang, Chenxu Wu, Yingtai Li, Rui Yan, Zihang Jiang, and S Kevin Zhou. Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 4744–4754,

  33. [33]

    Towards to- tal recall in industrial anomaly detection

    Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Sch¨olkopf, Thomas Brox, and Peter Gehler. Towards to- tal recall in industrial anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14318–14328, 2022. 1, 2

  34. [34]

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

    Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 2

  35. [35]

    Draem- a discriminatively trained reconstruction embedding for sur- face anomaly detection

    Vitjan Zavrtanik, Matej Kristan, and Danijel Skoˇcaj. Draem- a discriminatively trained reconstruction embedding for sur- face anomaly detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 8330– 8339, 2021. 1, 3

  36. [36]

    Lr-iad: Mask-free industrial anomaly detection with logical reasoning.arXiv preprint arXiv:2504.19524, 2025

    Peijian Zeng, Feiyan Pang, Zhanbo Wang, and Aimin Yang. Lr-iad: Mask-free industrial anomaly detection with logical reasoning.arXiv preprint arXiv:2504.19524, 2025. 2, 3

  37. [37]

    Agentrl: Scaling agentic reinforcement learn- ing with a multi-turn, multi-task framework.arXiv preprint arXiv:2510.04206, 2025

    Hanchen Zhang, Xiao Liu, Bowen Lv, Xueqiao Sun, Bohao Jing, Iat Long Iong, Zhenyu Hou, Zehan Qi, Hanyu Lai, Yi- fan Xu, et al. Agentrl: Scaling agentic reinforcement learn- ing with a multi-turn, multi-task framework.arXiv preprint arXiv:2510.04206, 2025. 2

  38. [38]

    Pku-goodsad: A supermarket goods dataset for unsupervised anomaly detection and segmentation.IEEE Robotics and Au- tomation Letters, 9(3):2008–2015, 2024

    Jian Zhang, Runwei Ding, Miaoju Ban, and Linhui Dai. Pku-goodsad: A supermarket goods dataset for unsupervised anomaly detection and segmentation.IEEE Robotics and Au- tomation Letters, 9(3):2008–2015, 2024. 6

  39. [39]

    Omniad: Detect and understand industrial anomaly via multimodal reasoning.arXiv preprint arXiv:2505.22039,

    Shifang Zhao, Yiheng Lin, Lu Han, Yao Zhao, and Yunchao Wei. Omniad: Detect and understand industrial anomaly via multimodal reasoning.arXiv preprint arXiv:2505.22039,

  40. [40]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 2

  41. [41]

    Anomalyclip: Object-agnostic prompt learn- ing for zero-shot anomaly detection.arXiv preprint arXiv:2310.18961, 2023

    Qihang Zhou, Guansong Pang, Yu Tian, Shibo He, and Jiming Chen. Anomalyclip: Object-agnostic prompt learn- ing for zero-shot anomaly detection.arXiv preprint arXiv:2310.18961, 2023. 6, 7

  42. [42]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 6

  43. [43]

    Spot-the-difference self-supervised pre- training for anomaly detection and segmentation

    Yang Zou, Jongheon Jeong, Latha Pemula, Dongqing Zhang, and Onkar Dabeer. Spot-the-difference self-supervised pre- training for anomaly detection and segmentation. InEu- ropean conference on computer vision, pages 392–408. Springer, 2022. 6 10 AgentIAD: Tool-Augmented Single-Agent for Industrial Anomaly Detection Supplementary Material This supplementary ...

  44. [44]

    name\": \

    Trajectory Construction Details This section provides additional details on how we con- struct the perceptive and comparative trajectories used in the SFT stage. Figure 4 shows the full data pipeline, from data source preparation, GPT-4o reasoning, to the final tra- jectory. 6.1. Data Source Preparation For each sample in the MMAD dataset, we extract both...

  45. [45]

    jagged, irregular area

    comparing the cropped ROI against the reference image before producing a final anomaly classification. 6.5. Prompt Templates for Multi-step CoT Genera- tion To construct consistent multi-stage reasoning trajectories, we use GPT-4o to generate the textual CoT for each step: (1) global reasoning (CoT-1), (2) local reasoning after zooming (CoT-2), and (3) op...

  46. [46]

    anomaly_present\

    Prompts for Evaluation and Agentic RL This section provides the exact prompt templates used dur- ing (1) GRPO-based agentic reinforcement learning, and (2) inference-time evaluation. Both stages share identical prompt structures to ensure training–inference consistency. However, the system prompt differs depending on whether the agent is evaluated in theP...

  47. [47]

    All settings are consistent across PZ-only and PZ+CR experiments unless otherwise speci- fied

    Training Hyperparameters and Coefficients This section provides the full hyperparameter configuration used for both stages of AgentIAD training: (1) Perceptive Supervised Fine-Tuning (SFT), and (2) Agentic Reinforce- ment Learning (GRPO). All settings are consistent across PZ-only and PZ+CR experiments unless otherwise speci- fied. A complete list of SFT ...