arxiv: 2604.10971 · v1 · submitted 2026-04-13 · 💻 cs.CV · cs.AI

Recognition: unknown

MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models

Xincheng Yao , Zefeng Qian , Chao Shi , Jiayang Song , Chongyang Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords general anomaly detectionmultimodal large language modelsMMR-AD datasetAnomaly-R1industrial anomaly detectionchain-of-thought reasoningreinforcement learningmultimodal benchmark

0 comments

The pith

A new multimodal dataset shows that current generalist MLLMs fall short of industrial standards for anomaly detection, while a reasoning-based model trained on it improves both detection and localization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MMR-AD, a large-scale multimodal dataset of image-text pairs with chain-of-thought annotations built specifically for general anomaly detection. General anomaly detection seeks models that can identify defects in entirely new classes without any retraining or fine-tuning on target data. Existing multimodal large language models, pretrained on web-scale data, lack the right examples and reasoning patterns for industrial scenarios, leading to poor results on the benchmark. The authors also present Anomaly-R1, which learns from the dataset's reasoning traces and applies reinforcement learning to raise performance on anomaly detection and localization tasks. This work matters because it tests whether flexible language-vision models can eventually replace narrow, class-specific systems in real factory inspection pipelines.

Core claim

MMR-AD supplies a comprehensive training and evaluation benchmark of multimodal data tailored to anomaly detection, including chain-of-thought reasoning examples that address gaps between web pretraining and industrial needs. Tests on MMR-AD show that current state-of-the-art generalist MLLMs still perform far below industrial requirements for general anomaly detection. Anomaly-R1, trained on the dataset's CoT data and further improved via reinforcement learning, delivers substantial gains in both anomaly detection and localization over those generalist baselines.

What carries the argument

The MMR-AD dataset, which provides multimodal image-text pairs with anomaly-specific chain-of-thought annotations to enable post-training and benchmarking of MLLMs for general anomaly detection.

If this is right

Generalist MLLMs can reach usable levels of general anomaly detection once supplied with targeted multimodal reasoning data.
Chain-of-thought supervision plus reinforcement learning improves both detection accuracy and localization precision on novel classes.
MMR-AD provides a reusable benchmark that allows direct comparison of future MLLM approaches to general anomaly detection.
Industrial inspection pipelines could shift toward models that handle new product types without per-class retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Dataset construction methods used here could be adapted to create similar benchmarks for other specialized visual reasoning problems such as medical defect detection.
The performance gap likely arises from mismatches in low-level visual features or reasoning style that could be targeted during earlier pretraining stages.
Real-world deployment would still need separate checks on live production lines to confirm the improvements survive lighting changes, camera angles, and sensor noise.
Combining the reasoning approach with existing single-class anomaly detectors might create practical hybrid systems for factories with mixed needs.

Load-bearing premise

The scenarios, classes, and annotations inside MMR-AD capture enough of the real diversity and difficulty of industrial anomaly detection that measured gaps and improvements will hold outside the specific test splits.

What would settle it

Evaluating both a generalist MLLM and Anomaly-R1 on a new collection of real factory images from product categories absent from MMR-AD and finding no meaningful difference in detection or localization metrics would falsify the reported improvements.

Figures

Figures reproduced from arXiv: 2604.10971 by Chao Shi, Chongyang Zhang, Jiayang Song, Xincheng Yao, Zefeng Qian.

**Figure 2.** Figure 2: The illustration of the text generation pipeline. We [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Data examples from our MMR-AD dataset and comparison with MMAD [ [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 6.** Figure 6: Green marks correct reasoning and correct bbox coordinates, red marks wrong reasoning and imprecise bbox coordinates. Although both GPT-4o and Qwen2.5-VL-72B generate correct reasoning, the anomaly localization results are still not precise enough. Normal Reference Image Input Image Ours Predicted Bbox <think>To determine if there are any anomalies in the second image compared to the first, I'll analyze… view at source ↗

**Figure 7.** Figure 7: Failure case. Red marks wrong reasoning. Normal Reference Image Input Image Ours Predicted Bbox <think>To determine if there are any anomalies in the second image, I will compare it with the first image. The first image shows a normal cable with three wires of different colors (blue, green, and brown) and metallic contacts on each wire. In the second image, the same three wires are present, but one of them… view at source ↗

**Figure 5.** Figure 5: Green marks correct reasoning and correct bbox coordinates, red marks wrong reasoning and imprecise bbox coordinates. Both GPT-4o and Qwen2.5-VL-72B show hallucination, thinking that there is the copper protrusion defect without observing the severe cut defect. Normal Reference Image Input Image GPT-4o Qwen2.5-VL-72B Ours <think>Upon comparing the first image (normal) with the second image, I observe tha… view at source ↗

read the original abstract

In the progress of industrial anomaly detection, general anomaly detection (GAD) is an emerging trend and also the ultimate goal. Unlike the conventional single- and multi-class AD, general AD aims to train a general AD model that can directly detect anomalies in diverse novel classes without any retraining or fine-tuning on the target data. Recently, Multimodal Large Language Models (MLLMs) have shown great promise in achieving general anomaly detection due to their revolutionary visual understanding and language reasoning capabilities. However, MLLM's general AD ability remains underexplored due to: (1) MLLMs are pretrained on amounts of data sourced from the Web, these data still have significant gaps with the data in AD scenarios. Moreover, the image-text pairs during pretraining are also not specifically for AD tasks. (2) The current mainstream AD datasets are image-based and not yet suitable for post-training MLLMs. To facilitate MLLM-based general AD research, we present MMR-AD, which is a comprehensive benchmark for both training and evaluating MLLM-based AD models. With MMR-AD, we reveal that the AD performance of current SOTA generalist MLLMs still falls far behind the industrial requirements. Based on MMR-AD, we also propose a baseline model, Anomaly-R1, which is a reasoning-based AD model that learns from the CoT data in MMR-AD and is further enhanced by reinforcement learning. Extensive experiments show that our Anomaly-R1 achieves remarkable improvements over generalist MLLMs in both anomaly detection and localization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MMR-AD supplies a new multimodal benchmark for MLLM-based general anomaly detection along with a CoT-plus-RL baseline, but its claims rest on unverified details of dataset construction and generalization.

read the letter

The paper introduces MMR-AD, a large-scale multimodal dataset built for training and evaluating MLLMs on general anomaly detection, plus Anomaly-R1, a baseline that adds chain-of-thought reasoning data and reinforcement learning on top of that resource. This is the core new artifact: a dedicated benchmark that targets the gap between web-pretrained models and industrial anomaly scenarios, where current MLLMs reportedly fall short of practical needs. The experiments then show the baseline improving both detection and localization over off-the-shelf generalist models. That combination of dataset and adapted model is a straightforward, useful step for anyone trying to move anomaly detection beyond per-class training. The work does a decent job laying out why existing AD datasets fall short for MLLM post-training and why a multimodal, reasoning-oriented resource could help. If the full paper includes clear numbers on class count, anomaly variety, imaging conditions, and train/test splits that avoid semantic leakage, the benchmark could become a practical reference point. The soft spots sit in the strength of the headline claims. The abstract states that current SOTA MLLMs still fall far behind industrial requirements and that Anomaly-R1 delivers remarkable gains, yet those statements depend on how closely MMR-AD matches real factory variability and whether the reported improvements hold for truly novel classes rather than the specific test distribution. Without the construction details, annotation process, and ablation results, it is difficult to tell whether the gains reflect better general reasoning or adaptation to this particular data. The paper is aimed at researchers working on industrial computer vision and multimodal models for applied tasks. It has enough substance to merit peer review so the dataset quality, split integrity, and experimental controls can be checked properly.

Referee Report

2 major / 2 minor

Summary. The paper introduces MMR-AD, a large-scale multimodal dataset for training and benchmarking MLLMs on general anomaly detection (GAD) tasks in industrial settings. It argues that web-pretrained SOTA generalist MLLMs underperform due to domain gaps and lack of AD-specific pretraining data, and proposes Anomaly-R1, a CoT- and RL-enhanced reasoning model that achieves notable gains in detection and localization over baselines.

Significance. If the dataset construction and reported gains hold under scrutiny, the work could meaningfully advance MLLM-based GAD research by supplying a dedicated multimodal benchmark with CoT annotations that targets the web-to-industrial domain shift. The emphasis on generalist models without per-class retraining and the RL-enhanced baseline represent constructive steps toward more capable anomaly reasoning systems.

major comments (2)

[Abstract] Abstract: The central claim that current SOTA generalist MLLMs 'still falls far behind the industrial requirements' is load-bearing for the paper's motivation, yet the abstract supplies no quantitative metrics (e.g., AUROC, localization accuracy), no table references, and no explicit comparison to industrial thresholds; the experiments section must furnish these numbers plus evidence that MMR-AD's imaging conditions and anomaly types lie outside web pretraining distributions.
[Abstract] Abstract: The assertion of 'remarkable improvements' for Anomaly-R1 via CoT data and reinforcement learning requires demonstration that gains arise from improved general reasoning rather than distribution-specific adaptation; ablations isolating the CoT and RL components, plus evaluation on external industrial AD datasets beyond MMR-AD test splits, are needed to substantiate generalization.

minor comments (2)

[Abstract] The phrase 'amounts of data' in the abstract should read 'large amounts of data' for grammatical precision.
[Abstract] The abstract would benefit from a concise statement of MMR-AD scale (image count, class count, anomaly categories) to better support the 'comprehensive benchmark' description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on strengthening the motivation and evidence in the abstract. We address each major comment below and will incorporate revisions to improve clarity and substantiation of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that current SOTA generalist MLLMs 'still falls far behind the industrial requirements' is load-bearing for the paper's motivation, yet the abstract supplies no quantitative metrics (e.g., AUROC, localization accuracy), no table references, and no explicit comparison to industrial thresholds; the experiments section must furnish these numbers plus evidence that MMR-AD's imaging conditions and anomaly types lie outside web pretraining distributions.

Authors: We agree that the abstract would benefit from explicit quantitative support. In the revised version, we will include key metrics (e.g., AUROC and localization accuracy for SOTA generalist MLLMs on MMR-AD) with direct references to the experiments tables. We will also add a concise comparison to typical industrial thresholds (such as AUROC > 0.95 for practical deployment). For the domain gap, we will expand the introduction and dataset sections with concrete details on MMR-AD's controlled industrial imaging conditions, lighting setups, and fine-grained anomaly types that are underrepresented in web-scale pretraining corpora. revision: yes
Referee: [Abstract] Abstract: The assertion of 'remarkable improvements' for Anomaly-R1 via CoT data and reinforcement learning requires demonstration that gains arise from improved general reasoning rather than distribution-specific adaptation; ablations isolating the CoT and RL components, plus evaluation on external industrial AD datasets beyond MMR-AD test splits, are needed to substantiate generalization.

Authors: We recognize the need to isolate the sources of improvement. We will add ablation experiments that separately remove or vary the CoT annotations and the RL stage to quantify their individual contributions to reasoning quality. To address generalization, we will report Anomaly-R1 results on at least one external industrial benchmark (e.g., MVTec AD) in addition to MMR-AD, allowing direct comparison of performance outside the training distribution. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset introduction with standard baseline evaluation

full rationale

The paper presents MMR-AD as a new multimodal dataset for general anomaly detection benchmarking and introduces Anomaly-R1 as a baseline trained via CoT and RL on that dataset. No equations, parameter fits, or derivations are described; performance claims consist of direct empirical comparisons between zero-shot generalist MLLMs and the fine-tuned baseline on the authors' own splits. This structure is self-contained and does not invoke self-citations, uniqueness theorems, or ansatzes that reduce the central results to their own inputs by construction. The reported gaps and improvements are benchmark-specific measurements rather than tautological predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claims rest on the assumption that a new multimodal dataset can close the domain gap for MLLMs in anomaly detection and that CoT-plus-RL training yields generalizable improvements; no free parameters or invented physical entities are introduced.

axioms (2)

domain assumption MLLMs pretrained on web data have significant gaps with industrial AD scenarios
Stated in abstract as motivation for the dataset.
domain assumption Current mainstream AD datasets are unsuitable for post-training MLLMs
Stated directly in abstract.

invented entities (2)

MMR-AD dataset no independent evidence
purpose: Benchmark for training and evaluating MLLM-based general anomaly detection
Newly introduced collection of multimodal data for AD tasks.
Anomaly-R1 model no independent evidence
purpose: Reasoning-based AD model trained on CoT data with reinforcement learning
Proposed baseline that learns from the new dataset.

pith-pipeline@v0.9.0 · 5601 in / 1430 out tokens · 52310 ms · 2026-05-10T15:48:02.755428+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 25 canonical work pages · 12 internal anchors

[1]

Cableinspect-ad: An expert- annotated anomaly detection dataset.In NeurIPS, 2024

Akshatha Arodi, Margaux Luck, Jean-Luc Bedwani, Aldo Zaimi, Ge Li, Nicolas Pouliot, Julien Beaudry, and Ga´etan Marceau Caron. Cableinspect-ad: An expert- annotated anomaly detection dataset.In NeurIPS, 2024. 3

2024
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jinren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv: 2308.12966, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, and et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Miad: A main- tenance inspection dataset for unsupervised anomaly detec- tion.In ICCV Workshop, 2023

Tianpeng Bao, Jiadong Chen, Wei Li, Xiang Wang, Jingjing Fei, Liwei Wu, Rui Zhao, and Ye Zheng. Miad: A main- tenance inspection dataset for unsupervised anomaly detec- tion.In ICCV Workshop, 2023. 3

2023
[5]

Mvtec ad - a comprehensive real-world dataset for unsupervised anomaly detection.In CVPR, 2019

Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Mvtec ad - a comprehensive real-world dataset for unsupervised anomaly detection.In CVPR, 2019. 3

2019
[6]

The mvtec 3d-ad dataset for unsupervised 3d anomaly detection and localization.arXiv preprint arXiv:2112.09045, 2021

Paul Bergmann, Xin Jin, David Sattlegger, and Carsten Ste- ger. The mvtec 3d-ad dataset for unsupervised 3d anomaly detection and localization.arXiv preprint arXiv:2112.09045,

work page arXiv
[7]

Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and localization.In IJCV, 2022

Paul Bergmann, Kilian Batzner, Michael Fauser, David Sat- tlegger, and Carsten Steger. Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and localization.In IJCV, 2022. 3

2022
[8]

Towards generic anomaly detection and understanding: Large-scale visual-linguistic model (gpt-4v) takes the lead.arXiv preprint arXiv:2311.02782, 2023

Yunkang Cao, Xiaohao Xu, Chen Sun, Xiaonan Huang, and Weiming Shen. Towards generic anomaly detection and understanding: Large-scale visual-linguistic model (gpt-4v) takes the lead.arXiv preprint arXiv:2311.02782, 2023. 1, 2

work page arXiv 2023
[9]

A unified anomaly synthesis strategy with gradint ascent for industrial anomaly detection and localization.In ECCV, 2024

Qiyu Chen, Huiyuan Luo, Chengkan LV , and Zhengtao Zhang. A unified anomaly synthesis strategy with gradint ascent for industrial anomaly detection and localization.In ECCV, 2024. 3

2024
[10]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, and et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Internvl: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, and et al. Internvl: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks. In CVPR, 2024. 2

2024
[12]

Padim: a patch distribution model- ing framework for anomaly detection and localization.In 1st International Workshop on Industrial Machine Learning,

Thomas Defard, Aleksandr Setkov, Angelique Loesch, and Romaric Audigier. Padim: a patch distribution model- ing framework for anomaly detection and localization.In 1st International Workshop on Industrial Machine Learning,
[13]

Anomaly detection via reverse distillation from one-class embedding.In CVPR, 2022

Hanqiu Deng and Xingyu Li. Anomaly detection via reverse distillation from one-class embedding.In CVPR, 2022. 1

2022
[14]

Internlm-xcomposer2: Mastering free-form text-image composition and compre- hension in vision-language large model

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, and et al. Internlm-xcomposer2: Mastering free-form text-image composition and compre- hension in vision-language large model.arXiv preprint arXiv:2401.16420, 2024. 2

work page arXiv 2024
[15]

Manta: A large-scale multi-view and visual-text anomaly detection dataset for tiny objects.In CVPR, 2025

Lei Fan, Dongdong Fan, Zhiguang Hu, Yiwen Ding, Donglin Di, Kai Yi, Maurice Pagnucco, and Yang Song. Manta: A large-scale multi-view and visual-text anomaly detection dataset for tiny objects.In CVPR, 2025. 3

2025
[16]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Google Gemini Team. Gemini 2.5: Pushing the fron- tier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 2, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Gemma 3 Technical Report

Google Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Anomalygpt: Detecting in- dustrial anomalies using large vision-language models.In AAAI, 2024

Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Anomalygpt: Detecting in- dustrial anomalies using large vision-language models.In AAAI, 2024. 1, 3, 4

2024
[19]

Dinomaly: The less is more philosophy in multi-class unsupervised anomaly detection.In CVPR, 2025

Jia Guo, Shuai Lu, Weihang Zhang, Fang Chen, Huiqi Li, and Hongen Liao. Dinomaly: The less is more philosophy in multi-class unsupervised anomaly detection.In CVPR, 2025. 1, 7

2025
[20]

Target before shooting: Accurate anomaly detection and localization under one millisecond via cascade patch retrieval.arXiv preprint arXiv: 2308.06748, 2023

Jianfei Hu Hanxi Li, Bo Li, Hao Chen, Yongbin Zheng, and Chunhua Shen. Target before shooting: Accurate anomaly detection and localization under one millisecond via cascade patch retrieval.arXiv preprint arXiv: 2308.06748, 2023. 1

work page arXiv 2023
[21]

Diad: A diffusion-based framework for multi-class anomaly detection.In AAAI, 2024

Haoyang He, Jiangning Zhang, Hongxu Chen, Xuhai Chen, Zhishan Li, Xu Chen, Yabiao Wang, Chengjie Wang, and Lei Xie. Diad: A diffusion-based framework for multi-class anomaly detection.In AAAI, 2024. 1

2024
[22]

Lora: Low rank adaptation of large language models.In ICLR,

Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low rank adaptation of large language models.In ICLR,
[23]

Stepan Jezek, Martin Jonak, Radim Burget, Pavel Dvorak, and Milos Skotak. Deep learning-based defect detection of metal parts: evaluating current methods in complex con- ditions.In 13th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops,
[24]

Mmad: The first-ever comprehensive benchmark for multi- modal large language models in indutrial anomaly detection

Xi Jiang, Jian Li, Hanqiu Deng, Yong Liu, Bin-Bin Gao, Yifeng Zhou, Jialin Li, ChengjieWang, and Feng Zheng. Mmad: The first-ever comprehensive benchmark for multi- modal large language models in indutrial anomaly detection. In ICLR, 2025. 3, 4, 5

2025
[25]

Fabgpt: An efficient large multimodal model for complex wafer defect knowledge queries.arXiv preprint arXiv:2407.10810, 2024

Yuqi Jiang, Xudong Lu, Qian Jin, Qi Sun, Hanming Wu, and Cheng Zhuo. Fabgpt: An efficient large multimodal model for complex wafer defect knowledge queries.arXiv preprint arXiv:2407.10810, 2024. 3

work page arXiv 2024
[26]

Vmad: Visual-enhanced multimodal large lan- guage model for zero-shot anomaly detection.IEEE Trans- actions on Automation Science and Engineering, 2024

Yuqi Jiang, Xudong Lu, Qian Jin, Qi Sun, Hanming Wu, and Cheng Zhuo. Vmad: Visual-enhanced multimodal large lan- guage model for zero-shot anomaly detection.IEEE Trans- actions on Automation Science and Engineering, 2024. 3

2024
[27]

Texture-ad: An anomaly detection dataset and benchmark for real algorithm development.arXiv preprint arXiv:2409.06367, 2024

Tianwu Lei, Bohan Wang, Silin Chen, Shurong Cao, and Ningmu Zou. Texture-ad: An anomaly detection dataset and benchmark for real algorithm development.arXiv preprint arXiv:2409.06367, 2024. 3

work page arXiv 2024
[28]

Myr- iad: Large multimodal model by applying vision experts for industrial anomaly detection.arXiv preprint arXiv: 2310.19070, 2023

Yuanze Li, Haolin Wang, Shihao Yuan, Ming Liu, Debin Zhao, Yiwen Guo, Chen Xu, and Guangming Shi. Myr- iad: Large multimodal model by applying vision experts for industrial anomaly detection.arXiv preprint arXiv: 2310.19070, 2023. 1, 3

work page arXiv 2023
[29]

Visual instruction tuning.In NeurIPS, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.In NeurIPS, 2023. 2

2023
[30]

Llava- next: Improved reasoning, ocr, and world knowledge

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuan- han Zhang, Sheng Shen, and Yong Jae Lee. Llava- next: Improved reasoning, ocr, and world knowledge. https://github.com/LLaVA-VL/LLaVA-NeXT, 2024. 2

2024
[31]

Exploring intrinsic normal prototypes within a single image for universal anomaly detection.In CVPR, 2025

Wei Luo, Yunkang Cao, Haiming Yao, Xiaotian Zhang, Jianan Lou, Yuqi Cheng, Weiming Shen, and Wenyong Yu. Exploring intrinsic normal prototypes within a single image for universal anomaly detection.In CVPR, 2025. 1, 7

2025
[32]

The llama 4 herd: The beginning of a new era of natively multimodal ai innovation

Meta. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence,
[33]

Turning up the heat: Min-p sampling for creative and coherent llm outputs.In ICLR, 2025

Nguyen Nhat Minh, Andrew Baker, Clement Neo, Allen Roush, Andreas Kirsch, and Ravid Shwartz-Ziv. Turning up the heat: Min-p sampling for creative and coherent llm outputs.In ICLR, 2025. 6, 2

2025
[34]

Gpt-4v(ision) system card

OpenAI. Gpt-4v(ision) system card. https://cdn.openai.com/papers/GPTV System Card.pdf,
[35]

Gpt-4o system card.https://cdn.openai.com/gpt- 4o-system-card.pdf, 2024

OpenAI. Gpt-4o system card.https://cdn.openai.com/gpt- 4o-system-card.pdf, 2024. 7

2024
[36]

Gpt-5 system card.https://cdn.openai.com/gpt-5- system-card.pdf, 2025

OpenAI. Gpt-5 system card.https://cdn.openai.com/gpt-5- system-card.pdf, 2025. 7

2025
[37]

Openai o3 and o4-mini system card

OpenAI. Openai o3 and o4-mini system card. https://cdn.openai.com/pdf/2221c875-02dc-4789-800b- e7758f3722c1/o3-and-o4-mini-system-card.pdf, 2025. 2, 5, 7

2025
[38]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023. 2

work page internal anchor Pith review arXiv 2023
[39]

Qvq-max: Think with evidence

Alibaba Qwen Team. Qvq-max: Think with evidence. https://qwenlm.github.io/blog/qvq-max-preview, 2025. 7

2025
[40]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Er- mon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.In NeurIPS, 2023. 5

2023
[41]

Towards total recall in industrial anomaly detection.In CVPR, 2022

Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Scholkopf, Thomas Brox, and Peter Gehler. Towards total recall in industrial anomaly detection.In CVPR, 2022. 1, 7

2022
[42]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, and et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf frame- work.arXiv preprint arXiv:2409.19256, 2024. 2

work page internal anchor Pith review arXiv 2024
[44]

Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.1180, 2023

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, and et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.1180, 2023. 2

work page arXiv 2023
[45]

Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detec- tion.In CVPR, 2024

Chengjie Wang, Wenbing Zhu, Bin-Bin Gao, Zhenye Gan, Jiangning Zhang, Zhihao Gu, Shuguang Qian, Mingang Chen, and Lizhuang Ma. Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detec- tion.In CVPR, 2024. 2, 3

2024
[46]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wen- bin Ge, and et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Cogvlm: Visual expert for pretrained language 11 models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, and et al. Cogvlm: Visual expert for pretrained lan- guage models.arXiv preprint arXiv:2311.03079, 2023. 2

work page arXiv 2023
[48]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, XingguangWei, Zhaoyang Liu, and et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265,

work page internal anchor Pith review Pith/arXiv arXiv
[49]

Patel, and Isht Dwivedi

Jiacong Xu, Shao-Yuan Lo, Bardia Safaei, Vishal M. Patel, and Isht Dwivedi. Towards zero-shot anomaly detection and reasoning with multimodal large language models.In CVPR,
[50]

Customizing visual-language founda- tion models for multi-modal anomaly detection and reason- ing.arXiv preprint arXiv:2403.11083, 2024

Xiaohao Xu, Yunkang Cao, Yongqi Chen, Weiming Shen, and Xiaonan Huang. Customizing visual-language founda- tion models for multi-modal anomaly detection and reason- ing.arXiv preprint arXiv:2403.11083, 2024. 1, 2

work page arXiv 2024
[51]

3cad: A large-scale real- world 3c product dataset for unsupervised anomaly detec- tion.In AAAI, 2025

Enquan Yang, Peng Xing, Hanyang Sun, Wenbo Guo, Yuan- wei Ma, Zechao Li, and Dan Zeng. 3cad: A large-scale real- world 3c product dataset for unsupervised anomaly detec- tion.In AAAI, 2025. 3

2025
[52]

Focus the discrepancy: Intra- and inter- correlation learning for image anomaly detection.In ICCV,

Xincheng Yao, Ruoqi Li, Zefeng Qian, Yan Luo, and Chongyang Zhang. Focus the discrepancy: Intra- and inter- correlation learning for image anomaly detection.In ICCV,
[53]

Explicit boundary guided semi-push- pull contrastive learning for supervised anomaly detection

Xincheng Yao, Ruoqi Li, Jing Zhang, Jun Sun, and Chongyang Zhang. Explicit boundary guided semi-push- pull contrastive learning for supervised anomaly detection. In CVPR, 2023. 1

2023
[54]

One-for-all: Proposal masked cross-class anomaly detection.In AAAI, 2023

Xincheng Yao, Chongyang Zhang, Ruoqi Li, Jun Sun, and Zhenyu Liu. One-for-all: Proposal masked cross-class anomaly detection.In AAAI, 2023. 1

2023
[55]

Resad: A simple framework for class generalizable anomaly detection.In neurIPS, 2024

Xincheng Yao, Zixin Chen, Guangtao Zhai, and Chongyang Zhang. Resad: A simple framework for class generalizable anomaly detection.In neurIPS, 2024. 1

2024
[56]

Hierarchical gaussian mixture normaliz- ing flow modeling for unified anomaly detection.In ECCV,

Xincheng Yao, Ruoqi Li, Zefeng Qian, Lu Wang, and Chongyang Zhang. Hierarchical gaussian mixture normaliz- ing flow modeling for unified anomaly detection.In ECCV,
[57]

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration.arXiv preprint arXiv:2311.04257, 2023. 2

work page arXiv 2023
[58]

A unified model for multi-class anomaly detection.arXiv preprint arXiv:2206.03687, 2022

Zhiyuan You, Lei Cui, Yujun Shen, Kai Yang, Xin Lu, Yu Zheng, and Xinyi Le. A unified model for multi-class anomaly detection.arXiv preprint arXiv:2206.03687, 2022. 1

work page arXiv 2022
[59]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xi- angpeng Wei, Hao Zhou, Jingjing Li...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Sage: A visual lan- guage model for anomaly detection via fact enhancement and entropy-aware alignment.In ACM MM, 2025

Guoxin Zang, Xue Li, Donglin Di, Lanshun Nie, Dechen Zhan, Yang Song, and Lei Fan. Sage: A visual lan- guage model for anomaly detection via fact enhancement and entropy-aware alignment.In ACM MM, 2025. 3

2025
[61]

Gpt-4v-ad: Exploring grounding potential of vqa-oriented gpt-4v for zero-shot anomaly detection.arXiv preprint arXiv:2311.02612, 2023

Jiangning Zhang, Haoyang He, Xuhai Chen2, Zhucun Xue, Yabiao Wang, Chengjie Wang, Lei Xie, and Yong Liu. Gpt-4v-ad: Exploring grounding potential of vqa-oriented gpt-4v for zero-shot anomaly detection.arXiv preprint arXiv:2311.02612, 2023. 1, 2

work page arXiv 2023
[62]

Pku-goodsad: A supermarket goods dataset for unsupervised anomaly detection and segmentation.IEEE Robotics and Au- tomation Letters, 2024

Jian Zhang, Runwei Ding, Miaoju Ban, and Linhui Dai. Pku-goodsad: A supermarket goods dataset for unsupervised anomaly detection and segmentation.IEEE Robotics and Au- tomation Letters, 2024. 3

2024
[63]

Anomalyclip: Object-agnostic prompt learning for zero-shot anomaly detection.In ICLR, 2024

Qihang Zhou, Guansong Pang, Yu Tian, Shibo He, and Jim- ing Chen. Anomalyclip: Object-agnostic prompt learning for zero-shot anomaly detection.In ICLR, 2024. 1

2024
[64]

Minigpt-4: Enhancing vision-language understanding with advanced large language models.In ICLR, 2024

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.In ICLR, 2024. 2

2024
[65]

Toward generalist anomaly detection via in-context residual learning with few-shot sam- ple prompts.In CVPR, 2024

JiaWen Zhu and Guansong Pang. Toward generalist anomaly detection via in-context residual learning with few-shot sam- ple prompts.In CVPR, 2024. 1

2024
[66]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, and et al. Internvl3: Exploring advanced training and test-time recipes for open-source mul- timodal models.arXiv preprint arXiv:2504.10479, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Real-iad d³: A real-world 2d/pseudo-3d/3d dataset for industrial anomaly detection.In CVPR, 2025

Wenbing Zhu, Lidong Wang, Ziqing Zhou1, Chengjie Wang, Yurui Pan, Ruoyi Zhang, Zhuhao Chen, Linjie Cheng, Bin- Bin Gao, Jiangning Zhang, Zhenye Gan, Yuxie Wang, Yu- long Chen, Shuguang Qian, Mingmin Chi, Bo Peng, and Lizhuang Ma. Real-iad d³: A real-world 2d/pseudo-3d/3d dataset for industrial anomaly detection.In CVPR, 2025. 3

2025
[68]

Spot-the-difference self-supervised pre- training for anomaly detection and segmentation.In ECCV,

Yang Zou, Jongheon Jeong, Latha Pemula, Dongqing Zhang, and Onkar Dabeer. Spot-the-difference self-supervised pre- training for anomaly detection and segmentation.In ECCV,
[69]

Bag of Words

3 MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models Supplementary Material A. More Discussions Chain-of-Thought Data.One concern about our work is that our MMR-AD dataset relies on Chain-of-Thought (CoT) reasoning, which is primarily designed for language- centric reasoning tasks, whi...

2048
[70]

The second image has an irregular dark spot on the lower left side of the rim, which appears to be a different texture or material

**Surface Texture and Color**: The first image shows a smooth surface with consistent coloration around the rim. The second image has an irregular dark spot on the lower left side of the rim, which appears to be a different texture or material
[71]

**Shape and Symmetry**: Both images appear circular and symmetrical overall, but the presence of the dark spot disrupts this symmetry in the second image
[72]

Based on these observations, the dark spot is likely an anomaly

**Light Reflection**: The light reflection patterns seem similar in both images, indicating no significant changes in the material properties except for the dark spot. Based on these observations, the dark spot is likely an anomaly. </think><answer>Yes.{'bbox_2d': [128, 409, 205, 476], 'label': 'dark spot'}</answer> <think>T odetermine if there are any an...
[73]

The second image appears to have some irregularities

**Visual Inspection**: The first image shows a clean and uniform surface with no visible defects. The second image appears to have some irregularities
[74]

This area has a rough texture and color variation compared to the smooth surface in the first image

**Surface Analysis**: Upon closer inspection of the second image, there seems to be an area where the material looks different from the rest. This area has a rough texture and color variation compared to the smooth surface in the first image
[75]

By analyzing the position relative to the overall structure, the coordinates can be estimated as [97, 296, 184, 370]

**Location Identification**: The anomaly is located towards the bottom left quadrant of the image. By analyzing the position relative to the overall structure, the coordinates can be estimated as [97, 296, 184, 370]. Based on this analysis, there is indeed an anomaly present in the second image. </think><answer>Yes. {'bbox_2d': [97, 296, 184, 370], 'label...
[76]

- There is no visible damage or discoloration on the insulation

**Cable Insulation Color and Condition**: - The colors of the cables (green, blue, brown) appear consistent between both images. - There is no visible damage or discoloration on the insulation
[77]

- In the second image, the green cable's copper strands seem slightly more spread out than in the first image, but this differenceis minimal and may not be significant

**Copper Wire Strands**: - In the first image, the copper wire strands are tightly packed and evenly distributed within each cable. - In the second image, the green cable's copper strands seem slightly more spread out than in the first image, but this differenceis minimal and may not be significant. - However, the blue cable in the second image shows a no...
[78]

Based on this analysis, the only notable anomaly is the protruding copper strands in the blue cable of the second image

**Overall Cable Structure**: - The overall structure of the cables appears similar in both images except for the mentioned protrusion in the blue cable. Based on this analysis, the only notable anomaly is the protruding copper strands in the blue cable of the second image. </think><answer>Yes.{'bbox_2d': [94, 205, 240, 336], 'label': 'Protruding Copper St...
[79]

**Shape and Structure**: Both capsules appear to have the same overall shape and structure
[80]

actavis" and

**Color and Markings**: The color distribution and markings ("actavis" and "500") seem consistent between the two images

Showing first 80 references.