arxiv: 2512.13671 · v2 · submitted 2025-12-15 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

AgentIAD: Agentic Industrial Anomaly Detection via Adaptive Memory Augmentation

Junwen Miao , Penghui Du , Yingying Fan , Yi Liu , Yu Wang , Runze He , Lida Huang , Yan Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords industrial anomaly detectionagentic vision-language modeladaptive memoryPerceptive Zoomerreinforcement learningMMAD benchmarkmulti-round reasoninganomaly analysis

0 comments

The pith

An agentic vision-language framework improves industrial anomaly detection by letting the model iteratively zoom in on defects and retrieve comparisons or external knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Industrial anomaly detection struggles with subtle, localized defects that single-pass vision-language models often overlook, and existing methods lack ways to actively gather more evidence during inspection. AgentIAD addresses this by giving the model a unified set of actions to access visual memory for fine-grained zooming and retrieved memory for cross-instance checks or external facts, allowing multi-round reasoning. The system is trained in two stages: first supervised fine-tuning to teach tool use, then reinforcement learning to improve long-horizon decision making under sparse rewards. Experiments on the MMAD benchmark show a 5.92 percent accuracy gain over prior state-of-the-art methods using the same backbone, along with more interpretable outputs.

Core claim

AgentIAD creates an agentic vision-language model that progressively inspects industrial images through a unified action space, dynamically calling the Perceptive Zoomer to examine local regions, the Web Searcher for external knowledge, and the Comparative Retriever for cross-instance verification, all learned via tool-aware supervised fine-tuning followed by agentic reinforcement learning to handle sparse supervision.

What carries the argument

The unified action space that lets the agent switch between visual memory via the Perceptive Zoomer and retrieved memory via the Web Searcher and Comparative Retriever, enabling multi-round perception-action reasoning.

If this is right

Classification accuracy rises by 5.92% over prior methods on the MMAD benchmark under identical backbone conditions.
Anomaly analysis becomes more reliable and interpretable through explicit evidence-gathering steps.
The model can handle subtle defects by collecting complementary visual and external evidence across multiple rounds rather than in one pass.
Long-horizon decision policies can be learned effectively even when supervision is sparse by separating tool familiarization from policy refinement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The memory-augmented agent approach could extend to other inspection domains such as medical imaging where fine details and cross-case comparisons matter.
If the action space generalizes, similar agentic setups might reduce the need for ever-larger single-pass models in quality-control pipelines.
The two-stage training pattern offers a practical template for teaching agents to use external tools when direct rewards are infrequent.

Load-bearing premise

The two-stage training successfully learns effective long-horizon policies under sparse supervision for the unified action space.

What would settle it

If AgentIAD shows no accuracy improvement over the previous state-of-the-art on the MMAD benchmark when using the same backbone, the performance claim would not hold.

Figures

Figures reproduced from arXiv: 2512.13671 by Junwen Miao, Lida Huang, Penghui Du, Runze He, Yan Wang, Yi Liu, Yingying Fan, Yu Wang.

**Figure 1.** Figure 1: Motivation. Non-tool MLLMs rely on a single global pass and frequently misclassify subtle defects (left). AgentIAD corrects these failures through tool-driven reasoning (right): the Perceptive Zoomer exposes fine-grained abnormal cues, and the Comparative Retriever verifies them against a normal reference. grounds, making one-shot, feed-forward detectors prone to missing fine-grained anomalies. Recently, m… view at source ↗

**Figure 2.** Figure 2: Overview of AgentIAD. The agent performs multi-round reasoning through a tool-augmented Chain-of-Thought (CoT). At each step, it may invoke the Perceptive Zoomer (PZ) to inspect local regions or the Comparative Retriever (CR) to query normal exemplars. Training consists of two stages: (a) Perceptive Supervised Fine-Tuning for grounding reasoning with visual actions, and (b) Agentic Reinforcement Learning f… view at source ↗

**Figure 3.** Figure 3: Visualize Inference Cases of AgentIAD. Left: a Perceptive Trajectory inference case where PZ zooms the defect region, enabling correct classification. Right: a Comparative Trajectory inference case where PZ is insufficient; the agent invokes CR to compare with a normal exemplar and corrects its decision [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Trajectory Construction Pipeline. The process includes: (1) data preparation from MMAD, (2) GPT-4o multi-step reasoning (CoT-1/2/3), and (3) trajectory assembly with structured tool calls. Both the perceptive (PZ-only) and comparative (PZ+CR) trajectories are derived from this unified pipeline. Output format: <think> Explain your visual reasoning, considering both the original image and the ROI information… view at source ↗

read the original abstract

Industrial anomaly detection (IAD) is challenging due to the subtle and highly localized nature of many defects, which single-pass vision--language models (VLMs) often fail to capture. Moreover, existing approaches lack mechanisms to actively acquire complementary evidence during inference. We propose AgentIAD, an agentic vision--language framework that enables iterative industrial inspection through a unified action space. The agent dynamically accesses two forms of memory during inspection: visual memory via the Perceptive Zoomer (PZ) for fine-grained local analysis, and retrieved memory via the Web Searcher (WS) and Comparative Retriever (CR) for external knowledge acquisition and cross-instance verification. This design allows the model to progressively gather evidence through multi-round perception--action reasoning. To effectively learn such policies under sparse supervision, AgentIAD adopts a two-stage training strategy: tool-aware supervised fine-tuning first initializes structured reasoning and memory-access behaviors, followed by agentic reinforcement learning to refine long-horizon decision policies. Extensive experiments show that, under the same backbone, AgentIAD improves classification accuracy by 5.92% over the previous state-of-the-art method on the MMAD benchmark while providing more reliable and interpretable anomaly analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgentIAD puts together a concrete agentic loop with zoom, search, and retrieval tools plus two-stage training for industrial anomaly detection, but the 5.92% claim sits on thin experimental ground.

read the letter

The paper's main move is to treat industrial anomaly detection as an iterative process rather than a single forward pass. It equips a vision-language agent with three memory tools: Perceptive Zoomer for local detail, Web Searcher for outside knowledge, and Comparative Retriever for cross-instance checks. These sit inside a unified action space so the model can decide when to zoom, search, or compare. Training splits into tool-aware supervised fine-tuning followed by agentic reinforcement learning to handle the sparse rewards that come with anomaly labels. That combination is new for this subfield and directly tackles the problem that single-pass models miss subtle, localized defects in manufacturing images. The framing is practical and the tool set is specific enough to be reproducible if the code ships. The two-stage recipe is a reasonable engineering response to the credit-assignment problem in long-horizon tool use. On the positive side, the work stays focused on a real industrial pain point and avoids over-claiming generality. The soft spot is the evidence. The abstract states a 5.92% accuracy lift on MMAD over prior state-of-the-art under the same backbone, yet supplies no listed baselines, no ablation on the RL stage, no statistical tests, and no check that the reinforcement learning actually changes behavior beyond the supervised initialization. Without those pieces it is hard to know whether the gain comes from the agentic loop or simply from better tool access during fine-tuning. The concern that policy gradients may fail to credit multi-round sequences under sparse supervision looks real until the full results show otherwise. This paper is for computer-vision researchers who work on applied anomaly detection or on agentic extensions of vision-language models. A reader who needs concrete ideas for inspection pipelines would find usable pieces even if the numbers need tightening. It deserves peer review because the framework is well-specified and the target problem matters; referees can insist on the missing experimental controls and ablations before any stronger claims are accepted.

Referee Report

2 major / 2 minor

Summary. The paper proposes AgentIAD, an agentic vision-language framework for industrial anomaly detection that performs iterative inspection via a unified action space. It augments the model with visual memory through the Perceptive Zoomer and retrieved memory through the Web Searcher and Comparative Retriever, enabling multi-round perception-action reasoning. Training proceeds in two stages: tool-aware supervised fine-tuning to initialize structured behaviors, followed by agentic reinforcement learning to optimize long-horizon policies under sparse rewards. The central empirical claim is a 5.92% classification accuracy improvement over prior state-of-the-art on the MMAD benchmark when using the same backbone, accompanied by assertions of improved reliability and interpretability.

Significance. If the reported gains are shown to arise from learned iterative reasoning rather than the base tools or SFT initialization alone, the work would meaningfully extend anomaly detection beyond single-pass VLMs by demonstrating active evidence gathering for subtle, localized defects. The two-stage training recipe and memory-augmented action space could serve as a template for other sparse-reward vision tasks, provided the RL component demonstrably alters policy behavior.

major comments (2)

[Abstract / Experiments] Abstract and experimental results: the 5.92% MMAD accuracy gain is stated without any reported baselines, number of runs, statistical tests, ablation tables isolating the RL stage, or comparison of multi-round trajectories before versus after the agentic RL phase, preventing evaluation of whether the claimed improvement stems from learned long-horizon policies.
[Training Strategy] Training strategy section: the claim that agentic RL successfully learns effective long-horizon policies under sparse supervision is load-bearing for attributing gains to the agentic framework, yet no verification is supplied that the RL stage produces measurably different action sequences, credit assignment success, or policy divergence from the tool-aware SFT checkpoint.

minor comments (2)

[Method] The unified action space would benefit from an explicit enumeration or pseudocode listing of available actions and their arguments to clarify how perception, zoom, search, and retrieval are interleaved.
[Method] Notation for the two memory modules (PZ, WS, CR) is introduced without a consolidated table of their inputs, outputs, and integration points into the agent loop.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for stronger empirical isolation of the agentic RL component. We address each point below and commit to targeted revisions that will clarify the source of the reported gains without altering the core claims.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and experimental results: the 5.92% MMAD accuracy gain is stated without any reported baselines, number of runs, statistical tests, ablation tables isolating the RL stage, or comparison of multi-round trajectories before versus after the agentic RL phase, preventing evaluation of whether the claimed improvement stems from learned long-horizon policies.

Authors: We thank the referee for this observation. The 5.92% figure represents the improvement of the full AgentIAD model over the prior SOTA using the identical backbone, as shown in the main results table of the Experiments section. To directly address the request for isolation of the RL stage, we will add a dedicated ablation table in the revised manuscript comparing the tool-aware SFT checkpoint against the full SFT+RL model. This table will report mean accuracy and standard deviation across five independent runs, together with paired t-test p-values for statistical significance. We will also include side-by-side qualitative examples of multi-round action trajectories from the SFT-only and RL-trained models to illustrate policy evolution, such as increased invocation of the Perceptive Zoomer on subtle defects. revision: yes
Referee: [Training Strategy] Training strategy section: the claim that agentic RL successfully learns effective long-horizon policies under sparse supervision is load-bearing for attributing gains to the agentic framework, yet no verification is supplied that the RL stage produces measurably different action sequences, credit assignment success, or policy divergence from the tool-aware SFT checkpoint.

Authors: We agree that explicit verification of behavioral changes induced by the RL stage is necessary to substantiate the attribution of gains to long-horizon policy learning. The current manuscript provides only qualitative reasoning-chain examples in Section 4.4. In the revision we will add quantitative metrics in a new subsection of Training Strategy: (i) shifts in average episode length and action-type distributions before versus after RL, (ii) policy divergence measured by average KL divergence between the SFT and RL policies evaluated on held-out states, and (iii) credit-assignment diagnostics by tracing reward propagation along sampled trajectories. These additions will allow readers to assess whether the RL stage produces measurably distinct long-horizon behavior. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; empirical claim only

full rationale

The paper describes an agentic framework and two-stage training procedure but contains no equations, first-principles derivations, or predictions that reduce to fitted inputs or self-citations by construction. The reported 5.92% accuracy gain is framed strictly as an empirical comparison against external prior methods on the MMAD benchmark, with no load-bearing mathematical step that collapses to the paper's own definitions or prior self-citations. This is the normal case for an applied systems paper whose central claim rests on experimental results rather than analytic reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5532 in / 977 out tokens · 32227 ms · 2026-05-16T21:56:44.240098+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-stage training strategy: tool-aware supervised fine-tuning first initializes structured reasoning and memory-access behaviors, followed by agentic reinforcement learning to refine long-horizon decision policies

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 10 internal anchors

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection

Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9592–9600, 2019. 1, 2, 6

work page 2019
[3]

Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and localization.International Journal of Computer Vision, 130 (4):947–969, 2022

Paul Bergmann, Kilian Batzner, Michael Fauser, David Sat- tlegger, and Carsten Steger. Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and localization.International Journal of Computer Vision, 130 (4):947–969, 2022. 6

work page 2022
[4]

Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly de- tection

Yunkang Cao, Jiangning Zhang, Luca Frittoli, Yuqi Cheng, Weiming Shen, and Giacomo Boracchi. Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly de- tection. InEuropean Conference on Computer Vision, pages 55–72. Springer, 2024. 6, 7

work page 2024
[5]

Anoma- lyr1: A grpo-based end-to-end mllm for industrial anomaly detection.arXiv preprint arXiv:2504.11914, 2025

Yuhao Chao, Jie Liu, Jie Tang, and Gangshan Wu. Anoma- lyr1: A grpo-based end-to-end mllm for industrial anomaly detection.arXiv preprint arXiv:2504.11914, 2025. 2, 3, 6, 7

work page arXiv 2025
[6]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 6

work page 2024
[7]

Offset: Segmentation-based focus shift revision for composed image retrieval

Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Xuemeng Song, and Liqiang Nie. Offset: Segmentation-based focus shift revision for composed image retrieval. InProceedings of the ACM International Conference on Multimedia, page 6113–6122, 2025. 3

work page 2025
[8]

Hud: Hierarchical uncertainty-aware disambiguation network for composed video retrieval

Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Haokun Wen, and Weili Guan. Hud: Hierarchical uncertainty-aware disambiguation network for composed video retrieval. In Proceedings of the ACM International Conference on Mul- timedia, page 6143–6152, 2025. 3

work page 2025
[9]

Padim: a patch distribution modeling framework for anomaly detection and localization

Thomas Defard, Aleksandr Setkov, Angelique Loesch, and Romaric Audigier. Padim: a patch distribution modeling framework for anomaly detection and localization. InInter- national conference on pattern recognition, pages 475–489. Springer, 2021. 1, 2

work page 2021
[10]

Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection

Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1705–1714,

work page
[11]

Anomalygpt: Detecting in- dustrial anomalies using large vision-language models

Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Anomalygpt: Detecting in- dustrial anomalies using large vision-language models. In Proceedings of the AAAI conference on artificial intelli- gence, pages 1932–1940, 2024. 3, 6, 7

work page 1932
[12]

Univad: A training-free uni- fied model for few-shot visual anomaly detection

Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Univad: A training-free uni- fied model for few-shot visual anomaly detection. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 15194–15203, 2025. 6, 7

work page 2025
[13]

Emit: Enhancing mllms for industrial anomaly detection via difficulty-aware grpo.arXiv preprint arXiv:2507.21619, 2025

Wei Guan, Jun Lan, Jian Cao, Hao Tan, Huijia Zhu, and Weiqiang Wang. Emit: Enhancing mllms for industrial anomaly detection via difficulty-aware grpo.arXiv preprint arXiv:2507.21619, 2025. 2

work page arXiv 2025
[14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic mul- timodal model.arXiv preprint arXiv:2511.05271, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Winclip: Zero- /few-shot anomaly classification and segmentation

Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer. Winclip: Zero- /few-shot anomaly classification and segmentation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19606–19616, 2023. 6, 7

work page 2023
[20]

Au- toiad: Manager-driven multi-agent collaboration for au- tomated industrial anomaly detection.arXiv preprint arXiv:2508.05503, 2025

Dongwei Ji, Bingzhang Hu, and Yi Zhou. Au- toiad: Manager-driven multi-agent collaboration for au- tomated industrial anomaly detection.arXiv preprint arXiv:2508.05503, 2025. 2

work page arXiv 2025
[21]

Mmad: A comprehensive benchmark for multimodal large language models in industrial anomaly detection.arXiv preprint arXiv:2410.09453, 2024

Xi Jiang, Jian Li, Hanqiu Deng, Yong Liu, Bin-Bin Gao, Yifeng Zhou, Jialin Li, Chengjie Wang, and Feng Zheng. Mmad: A comprehensive benchmark for multimodal large language models in industrial anomaly detection.arXiv preprint arXiv:2410.09453, 2024. 3, 6

work page arXiv 2024
[22]

Logi- cad: Explainable anomaly detection via vlm-based text fea- ture extraction

Er Jin, Qihui Feng, Yongli Mou, Gerhard Lakemeyer, Ste- fan Decker, Oliver Simons, and Johannes Stegmaier. Logi- cad: Explainable anomaly detection via vlm-based text fea- ture extraction. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4129–4137, 2025. 3 9

work page 2025
[23]

Cutpaste: Self-supervised learning for anomaly de- tection and localization

Chun-Liang Li, Kihyuk Sohn, Jinsung Yoon, and Tomas Pfister. Cutpaste: Self-supervised learning for anomaly de- tection and localization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9664–9674, 2021. 3

work page 2021
[24]

Lad-reasoner: Tiny multimodal mod- els are good reasoners for logical anomaly detection.arXiv preprint arXiv:2504.12749, 2025

Weijia Li, Guanglei Chu, Jiong Chen, Guo-Sen Xie, Caifeng Shan, and Fang Zhao. Lad-reasoner: Tiny multimodal mod- els are good reasoners for logical anomaly detection.arXiv preprint arXiv:2504.12749, 2025. 3

work page arXiv 2025
[25]

Iad-r1: Reinforcing con- sistent reasoning in industrial anomaly detection.arXiv preprint arXiv:2508.09178, 2025

Yanhui Li, Yunkang Cao, Chengliang Liu, Yuan Xiong, Xinghui Dong, and Chao Huang. Iad-r1: Reinforcing con- sistent reasoning in industrial anomaly detection.arXiv preprint arXiv:2508.09178, 2025. 2

work page arXiv 2025
[26]

Encoder: Entity mining and modifica- tion relation binding for composed image retrieval

Zixu Li, Zhiwei Chen, Haokun Wen, Zhiheng Fu, Yupeng Hu, and Weili Guan. Encoder: Entity mining and modifica- tion relation binding for composed image retrieval. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 5101–5109, 2025. 3

work page 2025
[27]

Finecir: Explicit parsing of fine- grained modification semantics for composed image re- trieval.https://arxiv.org/abs/2503.21309, 2025

Zixu Li, Zhiheng Fu, Yupeng Hu, Zhiwei Chen, Haokun Wen, and Liqiang Nie. Finecir: Explicit parsing of fine- grained modification semantics for composed image re- trieval.https://arxiv.org/abs/2503.21309, 2025. 3

work page arXiv 2025
[28]

Ad-fm: Multimodal llms for anomaly detection via multi- stage reasoning and fine-grained reward optimization.arXiv preprint arXiv:2508.04175, 2025

Jingyi Liao, Yongyi Su, Rong-Cheng Tu, Zhao Jin, Wen- hao Sun, Yiting Li, Dacheng Tao, Xun Xu, and Xulei Yang. Ad-fm: Multimodal llms for anomaly detection via multi- stage reasoning and fine-grained reward optimization.arXiv preprint arXiv:2508.04175, 2025. 2, 6, 7

work page arXiv 2025
[29]

Llavanext: Improved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024. 6

work page 2024
[30]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Explainable deep one-class classification.arXiv preprint arXiv:2007.01760, 2020

Philipp Liznerski, Lukas Ruff, Robert A Vandermeulen, Billy Joe Franks, Marius Kloft, and Klaus-Robert M ¨uller. Explainable deep one-class classification.arXiv preprint arXiv:2007.01760, 2020. 3

work page arXiv 2007
[32]

Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip

Wenxin Ma, Xu Zhang, Qingsong Yao, Fenghe Tang, Chenxu Wu, Yingtai Li, Rui Yan, Zihang Jiang, and S Kevin Zhou. Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 4744–4754,

work page
[33]

Towards to- tal recall in industrial anomaly detection

Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Sch¨olkopf, Thomas Brox, and Peter Gehler. Towards to- tal recall in industrial anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14318–14328, 2022. 1, 2

work page 2022
[34]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Draem- a discriminatively trained reconstruction embedding for sur- face anomaly detection

Vitjan Zavrtanik, Matej Kristan, and Danijel Skoˇcaj. Draem- a discriminatively trained reconstruction embedding for sur- face anomaly detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 8330– 8339, 2021. 1, 3

work page 2021
[36]

Lr-iad: Mask-free industrial anomaly detection with logical reasoning.arXiv preprint arXiv:2504.19524, 2025

Peijian Zeng, Feiyan Pang, Zhanbo Wang, and Aimin Yang. Lr-iad: Mask-free industrial anomaly detection with logical reasoning.arXiv preprint arXiv:2504.19524, 2025. 2, 3

work page arXiv 2025
[37]

Agentrl: Scaling agentic reinforcement learn- ing with a multi-turn, multi-task framework.arXiv preprint arXiv:2510.04206, 2025

Hanchen Zhang, Xiao Liu, Bowen Lv, Xueqiao Sun, Bohao Jing, Iat Long Iong, Zhenyu Hou, Zehan Qi, Hanyu Lai, Yi- fan Xu, et al. Agentrl: Scaling agentic reinforcement learn- ing with a multi-turn, multi-task framework.arXiv preprint arXiv:2510.04206, 2025. 2

work page arXiv 2025
[38]

Pku-goodsad: A supermarket goods dataset for unsupervised anomaly detection and segmentation.IEEE Robotics and Au- tomation Letters, 9(3):2008–2015, 2024

Jian Zhang, Runwei Ding, Miaoju Ban, and Linhui Dai. Pku-goodsad: A supermarket goods dataset for unsupervised anomaly detection and segmentation.IEEE Robotics and Au- tomation Letters, 9(3):2008–2015, 2024. 6

work page 2008
[39]

Omniad: Detect and understand industrial anomaly via multimodal reasoning.arXiv preprint arXiv:2505.22039,

Shifang Zhao, Yiheng Lin, Lu Han, Yao Zhao, and Yunchao Wei. Omniad: Detect and understand industrial anomaly via multimodal reasoning.arXiv preprint arXiv:2505.22039,

work page arXiv
[40]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Anomalyclip: Object-agnostic prompt learn- ing for zero-shot anomaly detection.arXiv preprint arXiv:2310.18961, 2023

Qihang Zhou, Guansong Pang, Yu Tian, Shibo He, and Jiming Chen. Anomalyclip: Object-agnostic prompt learn- ing for zero-shot anomaly detection.arXiv preprint arXiv:2310.18961, 2023. 6, 7

work page arXiv 2023
[42]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Spot-the-difference self-supervised pre- training for anomaly detection and segmentation

Yang Zou, Jongheon Jeong, Latha Pemula, Dongqing Zhang, and Onkar Dabeer. Spot-the-difference self-supervised pre- training for anomaly detection and segmentation. InEu- ropean conference on computer vision, pages 392–408. Springer, 2022. 6 10 AgentIAD: Tool-Augmented Single-Agent for Industrial Anomaly Detection Supplementary Material This supplementary ...

work page 2022
[44]

name\": \

Trajectory Construction Details This section provides additional details on how we con- struct the perceptive and comparative trajectories used in the SFT stage. Figure 4 shows the full data pipeline, from data source preparation, GPT-4o reasoning, to the final tra- jectory. 6.1. Data Source Preparation For each sample in the MMAD dataset, we extract both...

work page
[45]

jagged, irregular area

comparing the cropped ROI against the reference image before producing a final anomaly classification. 6.5. Prompt Templates for Multi-step CoT Genera- tion To construct consistent multi-stage reasoning trajectories, we use GPT-4o to generate the textual CoT for each step: (1) global reasoning (CoT-1), (2) local reasoning after zooming (CoT-2), and (3) op...

work page
[46]

anomaly_present\

Prompts for Evaluation and Agentic RL This section provides the exact prompt templates used dur- ing (1) GRPO-based agentic reinforcement learning, and (2) inference-time evaluation. Both stages share identical prompt structures to ensure training–inference consistency. However, the system prompt differs depending on whether the agent is evaluated in theP...

work page
[47]

All settings are consistent across PZ-only and PZ+CR experiments unless otherwise speci- fied

Training Hyperparameters and Coefficients This section provides the full hyperparameter configuration used for both stages of AgentIAD training: (1) Perceptive Supervised Fine-Tuning (SFT), and (2) Agentic Reinforce- ment Learning (GRPO). All settings are consistent across PZ-only and PZ+CR experiments unless otherwise speci- fied. A complete list of SFT ...

work page