Recognition: 1 theorem link
· Lean TheoremAgentIAD: Agentic Industrial Anomaly Detection via Adaptive Memory Augmentation
Pith reviewed 2026-05-16 21:56 UTC · model grok-4.3
The pith
An agentic vision-language framework improves industrial anomaly detection by letting the model iteratively zoom in on defects and retrieve comparisons or external knowledge.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AgentIAD creates an agentic vision-language model that progressively inspects industrial images through a unified action space, dynamically calling the Perceptive Zoomer to examine local regions, the Web Searcher for external knowledge, and the Comparative Retriever for cross-instance verification, all learned via tool-aware supervised fine-tuning followed by agentic reinforcement learning to handle sparse supervision.
What carries the argument
The unified action space that lets the agent switch between visual memory via the Perceptive Zoomer and retrieved memory via the Web Searcher and Comparative Retriever, enabling multi-round perception-action reasoning.
If this is right
- Classification accuracy rises by 5.92% over prior methods on the MMAD benchmark under identical backbone conditions.
- Anomaly analysis becomes more reliable and interpretable through explicit evidence-gathering steps.
- The model can handle subtle defects by collecting complementary visual and external evidence across multiple rounds rather than in one pass.
- Long-horizon decision policies can be learned effectively even when supervision is sparse by separating tool familiarization from policy refinement.
Where Pith is reading between the lines
- The memory-augmented agent approach could extend to other inspection domains such as medical imaging where fine details and cross-case comparisons matter.
- If the action space generalizes, similar agentic setups might reduce the need for ever-larger single-pass models in quality-control pipelines.
- The two-stage training pattern offers a practical template for teaching agents to use external tools when direct rewards are infrequent.
Load-bearing premise
The two-stage training successfully learns effective long-horizon policies under sparse supervision for the unified action space.
What would settle it
If AgentIAD shows no accuracy improvement over the previous state-of-the-art on the MMAD benchmark when using the same backbone, the performance claim would not hold.
Figures
read the original abstract
Industrial anomaly detection (IAD) is challenging due to the subtle and highly localized nature of many defects, which single-pass vision--language models (VLMs) often fail to capture. Moreover, existing approaches lack mechanisms to actively acquire complementary evidence during inference. We propose AgentIAD, an agentic vision--language framework that enables iterative industrial inspection through a unified action space. The agent dynamically accesses two forms of memory during inspection: visual memory via the Perceptive Zoomer (PZ) for fine-grained local analysis, and retrieved memory via the Web Searcher (WS) and Comparative Retriever (CR) for external knowledge acquisition and cross-instance verification. This design allows the model to progressively gather evidence through multi-round perception--action reasoning. To effectively learn such policies under sparse supervision, AgentIAD adopts a two-stage training strategy: tool-aware supervised fine-tuning first initializes structured reasoning and memory-access behaviors, followed by agentic reinforcement learning to refine long-horizon decision policies. Extensive experiments show that, under the same backbone, AgentIAD improves classification accuracy by 5.92% over the previous state-of-the-art method on the MMAD benchmark while providing more reliable and interpretable anomaly analysis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AgentIAD, an agentic vision-language framework for industrial anomaly detection that performs iterative inspection via a unified action space. It augments the model with visual memory through the Perceptive Zoomer and retrieved memory through the Web Searcher and Comparative Retriever, enabling multi-round perception-action reasoning. Training proceeds in two stages: tool-aware supervised fine-tuning to initialize structured behaviors, followed by agentic reinforcement learning to optimize long-horizon policies under sparse rewards. The central empirical claim is a 5.92% classification accuracy improvement over prior state-of-the-art on the MMAD benchmark when using the same backbone, accompanied by assertions of improved reliability and interpretability.
Significance. If the reported gains are shown to arise from learned iterative reasoning rather than the base tools or SFT initialization alone, the work would meaningfully extend anomaly detection beyond single-pass VLMs by demonstrating active evidence gathering for subtle, localized defects. The two-stage training recipe and memory-augmented action space could serve as a template for other sparse-reward vision tasks, provided the RL component demonstrably alters policy behavior.
major comments (2)
- [Abstract / Experiments] Abstract and experimental results: the 5.92% MMAD accuracy gain is stated without any reported baselines, number of runs, statistical tests, ablation tables isolating the RL stage, or comparison of multi-round trajectories before versus after the agentic RL phase, preventing evaluation of whether the claimed improvement stems from learned long-horizon policies.
- [Training Strategy] Training strategy section: the claim that agentic RL successfully learns effective long-horizon policies under sparse supervision is load-bearing for attributing gains to the agentic framework, yet no verification is supplied that the RL stage produces measurably different action sequences, credit assignment success, or policy divergence from the tool-aware SFT checkpoint.
minor comments (2)
- [Method] The unified action space would benefit from an explicit enumeration or pseudocode listing of available actions and their arguments to clarify how perception, zoom, search, and retrieval are interleaved.
- [Method] Notation for the two memory modules (PZ, WS, CR) is introduced without a consolidated table of their inputs, outputs, and integration points into the agent loop.
Simulated Author's Rebuttal
We thank the referee for the constructive comments highlighting the need for stronger empirical isolation of the agentic RL component. We address each point below and commit to targeted revisions that will clarify the source of the reported gains without altering the core claims.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and experimental results: the 5.92% MMAD accuracy gain is stated without any reported baselines, number of runs, statistical tests, ablation tables isolating the RL stage, or comparison of multi-round trajectories before versus after the agentic RL phase, preventing evaluation of whether the claimed improvement stems from learned long-horizon policies.
Authors: We thank the referee for this observation. The 5.92% figure represents the improvement of the full AgentIAD model over the prior SOTA using the identical backbone, as shown in the main results table of the Experiments section. To directly address the request for isolation of the RL stage, we will add a dedicated ablation table in the revised manuscript comparing the tool-aware SFT checkpoint against the full SFT+RL model. This table will report mean accuracy and standard deviation across five independent runs, together with paired t-test p-values for statistical significance. We will also include side-by-side qualitative examples of multi-round action trajectories from the SFT-only and RL-trained models to illustrate policy evolution, such as increased invocation of the Perceptive Zoomer on subtle defects. revision: yes
-
Referee: [Training Strategy] Training strategy section: the claim that agentic RL successfully learns effective long-horizon policies under sparse supervision is load-bearing for attributing gains to the agentic framework, yet no verification is supplied that the RL stage produces measurably different action sequences, credit assignment success, or policy divergence from the tool-aware SFT checkpoint.
Authors: We agree that explicit verification of behavioral changes induced by the RL stage is necessary to substantiate the attribution of gains to long-horizon policy learning. The current manuscript provides only qualitative reasoning-chain examples in Section 4.4. In the revision we will add quantitative metrics in a new subsection of Training Strategy: (i) shifts in average episode length and action-type distributions before versus after RL, (ii) policy divergence measured by average KL divergence between the SFT and RL policies evaluated on held-out states, and (iii) credit-assignment diagnostics by tracing reward propagation along sampled trajectories. These additions will allow readers to assess whether the RL stage produces measurably distinct long-horizon behavior. revision: yes
Circularity Check
No derivation chain present; empirical claim only
full rationale
The paper describes an agentic framework and two-stage training procedure but contains no equations, first-principles derivations, or predictions that reduce to fitted inputs or self-citations by construction. The reported 5.92% accuracy gain is framed strictly as an empirical comparison against external prior methods on the MMAD benchmark, with no load-bearing mathematical step that collapses to the paper's own definitions or prior self-citations. This is the normal case for an applied systems paper whose central claim rests on experimental results rather than analytic reduction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
two-stage training strategy: tool-aware supervised fine-tuning first initializes structured reasoning and memory-access behaviors, followed by agentic reinforcement learning to refine long-horizon decision policies
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection
Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9592–9600, 2019. 1, 2, 6
work page 2019
-
[3]
Paul Bergmann, Kilian Batzner, Michael Fauser, David Sat- tlegger, and Carsten Steger. Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and localization.International Journal of Computer Vision, 130 (4):947–969, 2022. 6
work page 2022
-
[4]
Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly de- tection
Yunkang Cao, Jiangning Zhang, Luca Frittoli, Yuqi Cheng, Weiming Shen, and Giacomo Boracchi. Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly de- tection. InEuropean Conference on Computer Vision, pages 55–72. Springer, 2024. 6, 7
work page 2024
-
[5]
Yuhao Chao, Jie Liu, Jie Tang, and Gangshan Wu. Anoma- lyr1: A grpo-based end-to-end mllm for industrial anomaly detection.arXiv preprint arXiv:2504.11914, 2025. 2, 3, 6, 7
-
[6]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 6
work page 2024
-
[7]
Offset: Segmentation-based focus shift revision for composed image retrieval
Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Xuemeng Song, and Liqiang Nie. Offset: Segmentation-based focus shift revision for composed image retrieval. InProceedings of the ACM International Conference on Multimedia, page 6113–6122, 2025. 3
work page 2025
-
[8]
Hud: Hierarchical uncertainty-aware disambiguation network for composed video retrieval
Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Haokun Wen, and Weili Guan. Hud: Hierarchical uncertainty-aware disambiguation network for composed video retrieval. In Proceedings of the ACM International Conference on Mul- timedia, page 6143–6152, 2025. 3
work page 2025
-
[9]
Padim: a patch distribution modeling framework for anomaly detection and localization
Thomas Defard, Aleksandr Setkov, Angelique Loesch, and Romaric Audigier. Padim: a patch distribution modeling framework for anomaly detection and localization. InInter- national conference on pattern recognition, pages 475–489. Springer, 2021. 1, 2
work page 2021
-
[10]
Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1705–1714,
-
[11]
Anomalygpt: Detecting in- dustrial anomalies using large vision-language models
Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Anomalygpt: Detecting in- dustrial anomalies using large vision-language models. In Proceedings of the AAAI conference on artificial intelli- gence, pages 1932–1940, 2024. 3, 6, 7
work page 1932
-
[12]
Univad: A training-free uni- fied model for few-shot visual anomaly detection
Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Univad: A training-free uni- fied model for few-shot visual anomaly detection. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 15194–15203, 2025. 6, 7
work page 2025
-
[13]
Wei Guan, Jun Lan, Jian Cao, Hao Tan, Huijia Zhu, and Weiqiang Wang. Emit: Enhancing mllms for industrial anomaly detection via difficulty-aware grpo.arXiv preprint arXiv:2507.21619, 2025. 2
-
[14]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
DeepEyesV2: Toward Agentic Multimodal Model
Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic mul- timodal model.arXiv preprint arXiv:2511.05271, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Winclip: Zero- /few-shot anomaly classification and segmentation
Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer. Winclip: Zero- /few-shot anomaly classification and segmentation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19606–19616, 2023. 6, 7
work page 2023
-
[20]
Dongwei Ji, Bingzhang Hu, and Yi Zhou. Au- toiad: Manager-driven multi-agent collaboration for au- tomated industrial anomaly detection.arXiv preprint arXiv:2508.05503, 2025. 2
-
[21]
Xi Jiang, Jian Li, Hanqiu Deng, Yong Liu, Bin-Bin Gao, Yifeng Zhou, Jialin Li, Chengjie Wang, and Feng Zheng. Mmad: A comprehensive benchmark for multimodal large language models in industrial anomaly detection.arXiv preprint arXiv:2410.09453, 2024. 3, 6
-
[22]
Logi- cad: Explainable anomaly detection via vlm-based text fea- ture extraction
Er Jin, Qihui Feng, Yongli Mou, Gerhard Lakemeyer, Ste- fan Decker, Oliver Simons, and Johannes Stegmaier. Logi- cad: Explainable anomaly detection via vlm-based text fea- ture extraction. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4129–4137, 2025. 3 9
work page 2025
-
[23]
Cutpaste: Self-supervised learning for anomaly de- tection and localization
Chun-Liang Li, Kihyuk Sohn, Jinsung Yoon, and Tomas Pfister. Cutpaste: Self-supervised learning for anomaly de- tection and localization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9664–9674, 2021. 3
work page 2021
-
[24]
Weijia Li, Guanglei Chu, Jiong Chen, Guo-Sen Xie, Caifeng Shan, and Fang Zhao. Lad-reasoner: Tiny multimodal mod- els are good reasoners for logical anomaly detection.arXiv preprint arXiv:2504.12749, 2025. 3
-
[25]
Yanhui Li, Yunkang Cao, Chengliang Liu, Yuan Xiong, Xinghui Dong, and Chao Huang. Iad-r1: Reinforcing con- sistent reasoning in industrial anomaly detection.arXiv preprint arXiv:2508.09178, 2025. 2
-
[26]
Encoder: Entity mining and modifica- tion relation binding for composed image retrieval
Zixu Li, Zhiwei Chen, Haokun Wen, Zhiheng Fu, Yupeng Hu, and Weili Guan. Encoder: Entity mining and modifica- tion relation binding for composed image retrieval. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 5101–5109, 2025. 3
work page 2025
-
[27]
Zixu Li, Zhiheng Fu, Yupeng Hu, Zhiwei Chen, Haokun Wen, and Liqiang Nie. Finecir: Explicit parsing of fine- grained modification semantics for composed image re- trieval.https://arxiv.org/abs/2503.21309, 2025. 3
-
[28]
Jingyi Liao, Yongyi Su, Rong-Cheng Tu, Zhao Jin, Wen- hao Sun, Yiting Li, Dacheng Tao, Xun Xu, and Xulei Yang. Ad-fm: Multimodal llms for anomaly detection via multi- stage reasoning and fine-grained reward optimization.arXiv preprint arXiv:2508.04175, 2025. 2, 6, 7
-
[29]
Llavanext: Improved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024. 6
work page 2024
-
[30]
Visual-RFT: Visual Reinforcement Fine-Tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Explainable deep one-class classification.arXiv preprint arXiv:2007.01760, 2020
Philipp Liznerski, Lukas Ruff, Robert A Vandermeulen, Billy Joe Franks, Marius Kloft, and Klaus-Robert M ¨uller. Explainable deep one-class classification.arXiv preprint arXiv:2007.01760, 2020. 3
-
[32]
Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip
Wenxin Ma, Xu Zhang, Qingsong Yao, Fenghe Tang, Chenxu Wu, Yingtai Li, Rui Yan, Zihang Jiang, and S Kevin Zhou. Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 4744–4754,
-
[33]
Towards to- tal recall in industrial anomaly detection
Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Sch¨olkopf, Thomas Brox, and Peter Gehler. Towards to- tal recall in industrial anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14318–14328, 2022. 1, 2
work page 2022
-
[34]
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Draem- a discriminatively trained reconstruction embedding for sur- face anomaly detection
Vitjan Zavrtanik, Matej Kristan, and Danijel Skoˇcaj. Draem- a discriminatively trained reconstruction embedding for sur- face anomaly detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 8330– 8339, 2021. 1, 3
work page 2021
-
[36]
Peijian Zeng, Feiyan Pang, Zhanbo Wang, and Aimin Yang. Lr-iad: Mask-free industrial anomaly detection with logical reasoning.arXiv preprint arXiv:2504.19524, 2025. 2, 3
-
[37]
Hanchen Zhang, Xiao Liu, Bowen Lv, Xueqiao Sun, Bohao Jing, Iat Long Iong, Zhenyu Hou, Zehan Qi, Hanyu Lai, Yi- fan Xu, et al. Agentrl: Scaling agentic reinforcement learn- ing with a multi-turn, multi-task framework.arXiv preprint arXiv:2510.04206, 2025. 2
-
[38]
Jian Zhang, Runwei Ding, Miaoju Ban, and Linhui Dai. Pku-goodsad: A supermarket goods dataset for unsupervised anomaly detection and segmentation.IEEE Robotics and Au- tomation Letters, 9(3):2008–2015, 2024. 6
work page 2008
-
[39]
Shifang Zhao, Yiheng Lin, Lu Han, Yao Zhao, and Yunchao Wei. Omniad: Detect and understand industrial anomaly via multimodal reasoning.arXiv preprint arXiv:2505.22039,
-
[40]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Qihang Zhou, Guansong Pang, Yu Tian, Shibo He, and Jiming Chen. Anomalyclip: Object-agnostic prompt learn- ing for zero-shot anomaly detection.arXiv preprint arXiv:2310.18961, 2023. 6, 7
-
[42]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Spot-the-difference self-supervised pre- training for anomaly detection and segmentation
Yang Zou, Jongheon Jeong, Latha Pemula, Dongqing Zhang, and Onkar Dabeer. Spot-the-difference self-supervised pre- training for anomaly detection and segmentation. InEu- ropean conference on computer vision, pages 392–408. Springer, 2022. 6 10 AgentIAD: Tool-Augmented Single-Agent for Industrial Anomaly Detection Supplementary Material This supplementary ...
work page 2022
-
[44]
Trajectory Construction Details This section provides additional details on how we con- struct the perceptive and comparative trajectories used in the SFT stage. Figure 4 shows the full data pipeline, from data source preparation, GPT-4o reasoning, to the final tra- jectory. 6.1. Data Source Preparation For each sample in the MMAD dataset, we extract both...
-
[45]
comparing the cropped ROI against the reference image before producing a final anomaly classification. 6.5. Prompt Templates for Multi-step CoT Genera- tion To construct consistent multi-stage reasoning trajectories, we use GPT-4o to generate the textual CoT for each step: (1) global reasoning (CoT-1), (2) local reasoning after zooming (CoT-2), and (3) op...
-
[46]
Prompts for Evaluation and Agentic RL This section provides the exact prompt templates used dur- ing (1) GRPO-based agentic reinforcement learning, and (2) inference-time evaluation. Both stages share identical prompt structures to ensure training–inference consistency. However, the system prompt differs depending on whether the agent is evaluated in theP...
-
[47]
All settings are consistent across PZ-only and PZ+CR experiments unless otherwise speci- fied
Training Hyperparameters and Coefficients This section provides the full hyperparameter configuration used for both stages of AgentIAD training: (1) Perceptive Supervised Fine-Tuning (SFT), and (2) Agentic Reinforce- ment Learning (GRPO). All settings are consistent across PZ-only and PZ+CR experiments unless otherwise speci- fied. A complete list of SFT ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.