JUDO: A Juxtaposed Domain-Oriented Multimodal Reasoner for Industrial Anomaly QA

Hogun Park; Hyunju Kang; Jaewon Kim; Woohyun Lee

arxiv: 2605.20284 · v1 · pith:N723EBIWnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI· cs.LG

JUDO: A Juxtaposed Domain-Oriented Multimodal Reasoner for Industrial Anomaly QA

Hyunju Kang , Woohyun Lee , Jaewon Kim , Hogun Park This is my paper

Pith reviewed 2026-05-21 07:46 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords industrial anomaly detectionmultimodal reasoningdomain knowledgedefect segmentationsupervised fine-tuningreinforcement learningvisual comparisonanomaly QA

0 comments

The pith

JUDO improves anomaly question answering by juxtaposing defect images with normal references and training models with domain-specific fine-tuning plus reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents JUDO as a multimodal framework that adds domain knowledge to large vision-language models for industrial defect understanding. It segments anomalies by placing query images next to normal ones for direct visual comparison and then applies supervised fine-tuning followed by reinforcement learning with custom rewards to steer the model toward domain-appropriate reasoning. A sympathetic reader cares because general multimodal models frequently fail at precise, context-aware answers in specialized manufacturing settings where small visual differences matter. If the approach holds, it shows a practical way to adapt existing models without starting from scratch, leading to more reliable automated inspection that mimics how human experts compare samples.

Core claim

JUDO is a Juxtaposed Domain-Oriented Multimodal Reasoner that segments defect regions through visual comparison of query images against normal images as domain context and injects domain knowledge via supervised fine-tuning before guiding reasoning with reinforcement learning using tailored rewards, yielding higher performance on the MMAD benchmark than models such as Qwen2.5-VL-7B and GPT-4o.

What carries the argument

Juxtaposition of query images with normal images for fine-grained visual comparative inspection, paired with supervised fine-tuning followed by group relative policy optimization using domain-oriented reward signals.

If this is right

Defect segmentation becomes more precise because the model can perform direct side-by-side visual comparison using normal images.
Responses in industrial anomaly QA incorporate more accurate domain context after the supervised fine-tuning stage.
Reinforcement learning with tailored rewards steers the model toward domain-oriented reasoning paths rather than generic ones.
Overall benchmark scores on MMAD rise above those of general-purpose multimodal models such as GPT-4o.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The juxtaposition technique could transfer to other fine-grained visual comparison tasks such as medical image review where normal references are available.
Custom reward signals in the reinforcement stage might reduce the chance of the model inventing nonexistent defects.
Similar domain-injection steps could be tested on smaller models to see whether they close the gap to larger general models without extra scale.

Load-bearing premise

That placing normal reference images next to queries gives enough visual context for correct defect segmentation and that supervised fine-tuning plus reward-guided reinforcement learning will produce effective domain reasoning without adding biases or errors.

What would settle it

Evaluating JUDO on the MMAD benchmark after removing the normal reference images and checking whether accuracy falls below that of Qwen2.5-VL-7B or GPT-4o.

Figures

Figures reproduced from arXiv: 2605.20284 by Hogun Park, Hyunju Kang, Jaewon Kim, Woohyun Lee.

**Figure 2.** Figure 2: Performance across different stages. The effectiveness of JUDO’s learning-based approach becomes evident in the subsequent steps, as shown in the result. The most significant performance leap comes from Stage 2’s Domain Injection (+ GRPO + DomInj), which internalizes domain knowledge through supervised fine-tuning and increases accuracy to 79.82%. This result is substantially higher than the RAG-based met… view at source ↗

**Figure 3.** Figure 3: Response comparison between Base GRPO, Base GRPO + RAG and JUDO. The anoma [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Examples of JUDO’s output on the MMAD dataset. The anomalous region in the query [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Stage 2 domain Q&A construction pipeline. Illustration of generating domain-specific [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Stage 1 training dataset examples 17 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Stage 2 training dataset examples 18 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Examples of JUDO’s output on the MMAD dataset. The anomalous region in the query [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

read the original abstract

Industrial anomaly detection has been significantly advanced by Large Multimodal Models (LMMs), enabling diverse human instructions beyond detection, particularly through visually grounded reasoning for better image understanding. However, LMMs lack domain-specific knowledge, which limits their ability to generate accurate responses in complex industrial scenarios. In this work, we present JUDO, Juxtaposed Domain-Oriented Multimodal Reasoner, a framework that efficiently incorporates domain knowledge and context in visual and textual reasoning. Through visual reasoning, our model segments the defect region by juxtaposing query images with normal images as visual domain context, enabling a fine-grained visual comparative inspection. Furthermore, we inject domain knowledge through supervised fine-tuning (SFT) to enhance context understanding and subsequently guide domain reasoning through reinforcement learning (GRPO) with tailored rewards, opting for a domain-oriented reasoning process. Experimental results demonstrate that JUDO achieves superior performance on the MMAD benchmark, surpassing models such as Qwen2.5-VL-7B and GPT-4o. These results highlight the importance of enhancing domain knowledge and context for effective reasoning in anomaly understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

JUDO pairs query images with normal references for visual comparison then layers SFT and GRPO to add domain knowledge for industrial anomaly QA, with the main open question being whether the custom rewards actually drive better reasoning.

read the letter

The main thing to know is that JUDO juxtaposes the query image with a normal reference to let the model do direct visual comparison for defect segmentation, then runs supervised fine-tuning followed by GRPO using tailored rewards to steer outputs toward industrial domain knowledge. The paper reports stronger results on the MMAD benchmark than Qwen2.5-VL-7B and GPT-4o. That combination of visual context plus the two-stage training is the concrete new piece. It takes standard LMM adaptation practices and applies them specifically to the gap where generic models lack manufacturing context, which is a practical target. The work is clear about the problem it is solving and offers a straightforward pipeline that could be tried in similar applied settings. The soft spot sits in the GRPO stage. The tailored rewards are presented as the mechanism that enforces domain-oriented reasoning, yet the description leaves the exact formulation and any ablations on bias or proxy signals thin. Without those details it is difficult to separate genuine capability gains from benchmark-specific fitting. The experimental claims would also benefit from more visible controls on data handling and variance, even if the full paper expands on the abstract. This is for researchers working on multimodal models for real industrial tasks such as anomaly detection and visual QA in manufacturing. Someone already experimenting with domain adaptation or RL fine-tuning for vision-language models would pick up usable ideas here. I would send it for peer review. The idea is focused and the application area is relevant, so it is worth a referee's time even if the reward design and result robustness need tightening.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces JUDO, a Juxtaposed Domain-Oriented Multimodal Reasoner for industrial anomaly QA. It proposes a framework that performs visual reasoning by juxtaposing query images with normal images to enable fine-grained defect segmentation, injects domain knowledge via supervised fine-tuning (SFT), and guides domain-oriented reasoning through group relative policy optimization (GRPO) with tailored rewards. The central claim is that this yields superior performance on the MMAD benchmark relative to baselines including Qwen2.5-VL-7B and GPT-4o.

Significance. If the performance gains are shown to be robust and generalizable, the work would usefully demonstrate how explicit visual domain context and reinforcement learning with domain-specific rewards can mitigate the lack of industrial knowledge in current LMMs. The juxtaposition mechanism and staged SFT-then-GRPO pipeline represent a concrete methodological contribution worth further exploration in anomaly understanding tasks.

major comments (2)

[Abstract] Abstract: the claim that JUDO achieves superior performance on MMAD (surpassing Qwen2.5-VL-7B and GPT-4o) is presented without any description of the experimental protocol, evaluation metrics, baseline implementations, statistical significance, or data splits. This directly undermines verification of the central empirical claim.
[Abstract] Training description: the GRPO stage relies on 'tailored rewards' to enforce domain-oriented reasoning and accurate defect segmentation, yet no explicit reward formulation, weighting, or proxy signals are supplied. Because the rewards are listed among the free parameters and are load-bearing for the claim that RL adds genuine capability rather than benchmark artifacts, their absence prevents assessment of whether the method avoids bias or incorrect knowledge injection.

minor comments (1)

[Abstract] The abstract would be strengthened by a one-sentence characterization of the MMAD benchmark (e.g., number of images, anomaly types, or question formats) to contextualize the reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We provide detailed responses to the major comments below and outline the revisions we intend to implement.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that JUDO achieves superior performance on MMAD (surpassing Qwen2.5-VL-7B and GPT-4o) is presented without any description of the experimental protocol, evaluation metrics, baseline implementations, statistical significance, or data splits. This directly undermines verification of the central empirical claim.

Authors: We acknowledge that the abstract does not elaborate on the experimental details due to space constraints. However, the manuscript's Section 4 fully specifies the experimental protocol, including the use of the MMAD benchmark, evaluation metrics (e.g., accuracy, precision, recall), baseline model implementations, data splits, and statistical significance testing through repeated experiments. To enhance the abstract's informativeness, we will incorporate a brief mention of the evaluation setup and direct readers to the detailed experimental section. revision: yes
Referee: [Abstract] Training description: the GRPO stage relies on 'tailored rewards' to enforce domain-oriented reasoning and accurate defect segmentation, yet no explicit reward formulation, weighting, or proxy signals are supplied. Because the rewards are listed among the free parameters and are load-bearing for the claim that RL adds genuine capability rather than benchmark artifacts, their absence prevents assessment of whether the method avoids bias or incorrect knowledge injection.

Authors: The tailored rewards for the GRPO stage are explicitly formulated in Section 3.2 of the paper, detailing the reward components for domain-oriented reasoning, accurate defect segmentation, and their respective weightings and proxy signals. This design aims to guide the model towards genuine capability enhancement rather than artifacts. We will revise the abstract to include a short reference to the reward structure and its role in the training pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims rest on external benchmark comparisons, not self-referential definitions or fitted inputs.

full rationale

The paper presents JUDO as a framework that juxtaposes query and normal images for visual defect segmentation, applies SFT for domain knowledge, then uses GRPO with tailored rewards for reasoning. It reports superior MMAD results versus Qwen2.5-VL-7B and GPT-4o. No equations, derivations, or self-citations are shown that reduce the central claims to inputs by construction. The chain is self-contained against external model comparisons and benchmarks, with no evidence of self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions from multimodal learning and reinforcement learning for domain adaptation. No new physical entities are postulated. The approach depends on the effectiveness of the proposed training process and visual comparison technique.

free parameters (1)

tailored rewards for GRPO
Rewards are described as tailored for domain-oriented reasoning but their specific design and any fitting process are not detailed in the abstract.

axioms (2)

domain assumption Juxtaposing query images with normal images enables fine-grained visual comparative inspection for defect segmentation
This is invoked as the core mechanism for visual reasoning in the abstract.
domain assumption Supervised fine-tuning followed by GRPO with tailored rewards enhances context understanding and guides domain-oriented reasoning
This is the stated process for injecting and applying domain knowledge.

pith-pipeline@v0.9.0 · 5738 in / 1623 out tokens · 54631 ms · 2026-05-21T07:46:43.081791+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Stage 3: Domain-oriented GRPO with Rdomain = λ · ϕ(Egen)·ϕ(Epdomain) / norms, Rseg F1 on 16×16 patches, choice/format rewards
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

JUDO achieves 81.20% on MMAD via SFT+GRPO on MVTec/VisA/GoodsAD

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 4 internal anchors

[1]

Artificial Intelligence Review , volume=

A survey of deep learning for industrial visual anomaly detection , author=. Artificial Intelligence Review , volume=. 2025 , publisher=

work page 2025
[2]

Bergmann, Paul and Batzner, Kilian and Fauser, Michael and Sattlegger, David and Steger, Carsten , journal=. The. 2021 , publisher=

work page 2021
[3]

Special Lecture on IE , volume=

Variational autoencoder based anomaly detection using reconstruction probability , author=. Special Lecture on IE , volume=. 2015 , publisher=

work page 2015
[4]

Proceedings of the International Conference on Information Processing in Medical Imaging , pages=

Unsupervised anomaly detection with generative adversarial networks to guide marker discovery , author=. Proceedings of the International Conference on Information Processing in Medical Imaging , pages=

work page
[5]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Towards total recall in industrial anomaly detection , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[6]

Proceedings of the International Conference on Pattern Recognition , pages=

Padim: a patch distribution modeling framework for anomaly detection and localization , author=. Proceedings of the International Conference on Pattern Recognition , pages=

work page
[7]

Bergmann, Paul and Fauser, Michael and Sattlegger, David and Steger, Carsten , booktitle=

work page
[8]

Advances in Neural Information Processing Systems , volume=

Visual instruction tuning , author=. Advances in Neural Information Processing Systems , volume=

work page
[9]

2023 , url =

OpenAI , title =. 2023 , url =

work page 2023
[10]

Jiang, Xi and Li, Jian and Deng, Hanqiu and Liu, Yong and Gao, Bin-Bin and Zhou, Yifeng and Li, Jialin and Wang, Chengjie and Zheng, Feng , booktitle=

work page
[11]

Gu, Zhaopeng and Zhu, Bingke and Zhu, Guibo and Chen, Yingying and Tang, Ming and Wang, Jinqiao , booktitle=

work page
[12]

Chao, Yuhao and Liu, Jie and Tang, Jie and Wu, Gangshan , journal=

work page
[13]

Zhao, Shifang and Lin, Yiheng and Han, Lu and Zhao, Yao and Wei, Yunchao , journal=

work page
[14]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Yang and others , journal=

work page
[15]

Hurst, Aaron and Lerer, Adam and Goucher, Adam P and Perelman, Adam and Ramesh, Aditya and Clark, Aidan and Ostrow, AJ and Welihinda, Akila and Hayes, Alan and Radford, Alec and others , journal=

work page
[16]

Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Pengfei and others , journal=

work page
[17]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling , author=. arXiv preprint arXiv:2412.05271 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Zhu, Jinguo and Wang, Weiyun and Chen, Zhe and Liu, Zhaoyang and Ye, Shenglong and Gu, Lixin and Tian, Hao and Duan, Yuchen and Su, Weijie and Shao, Jie and others , journal=

work page
[19]

Wang, Weiyun and Gao, Zhangwei and Gu, Lixin and Pu, Hengjun and Cui, Long and Wei, Xingguang and Liu, Zhaoyang and Jing, Linglin and Ye, Shenglong and Shao, Jie and others , journal=

work page
[20]

arXiv preprint arXiv:2504.07491 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Xiaomi, LLM and Xia, Bingquan and Shen, Bowen and Zhu, Dawei and Zhang, Di and Wang, Gang and Zhang, Hailin and Liu, Huaqiu and Xiao, Jiebao and Dong, Jinhao and others , journal=

work page
[22]

Agrawal, Pravesh and Antoniak, Szymon and Hanna, Emma Bou and Bout, Baptiste and Chaplot, Devendra and Chudnovsky, Jessica and Costa, Diogo and De Monicault, Baudouin and Garg, Saurabh and Gervet, Theophile and others , journal=

work page
[23]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[24]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detection , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[25]

Mengcheng Lan and Chaofeng Chen and Yue Zhou and Jiaxing Xu and Yiping Ke and Xinjiang Wang and Litong Feng and Wayne Zhang , booktitle=

work page
[26]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Towards zero-shot anomaly detection and reasoning with multimodal large language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[27]

Shao, Hao and Qian, Shengju and Xiao, Han and Song, Guanglu and Zong, Zhuofan and Wang, Letian and Liu, Yu and Li, Hongsheng , journal=

work page
[28]

Nils Reimers and Iryna Gurevych , booktitle=

work page
[29]

arXiv preprint arXiv:2404.00213 , year=

Injecting new knowledge into large language models via supervised fine-tuning , author=. arXiv preprint arXiv:2404.00213 , year=

work page arXiv
[30]

Injecting Domain-Specific Knowledge into Large Language Models: A Comprehensive Survey

Song, Zirui and Yan, Bin and Liu, Yuhan and Fang, Miao and Li, Mingzhe and Yan, Rui and Chen, Xiuying. Injecting Domain-Specific Knowledge into Large Language Models: A Comprehensive Survey. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025. 2025

work page 2025
[31]

Retrieval-augmented generation for

Zhao, Penghao and Zhang, Hailin and Yu, Qinhan and Wang, Zhengren and Geng, Yunteng and Fu, Fangcheng and Yang, Ling and Zhang, Wentao and Jiang, Jie and Cui, Bin , journal=. Retrieval-augmented generation for. 2026 , publisher=

work page 2026
[32]

Proceedings of the European Conference on Computer Vision , pages=

Spot-the-difference self-supervised pre-training for anomaly detection and segmentation , author=. Proceedings of the European Conference on Computer Vision , pages=

work page
[33]

International Journal of Computer Vision , volume=

Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and localization , author=. International Journal of Computer Vision , volume=. 2022 , publisher=

work page 2022
[34]

2024 , publisher=

Zhang, Jian and Ding, Runwei and Ban, Miaoju and Dai, Linhui , journal=. 2024 , publisher=

work page 2024
[35]

Zhou, Qihang and Pang, Guansong and Tian, Yu and He, Shibo and Chen, Jiming , booktitle=

work page
[36]

Fino1: On the transferability of reasoning-enhanced llms and reinforcement learning to finance.arXiv preprint arXiv:2502.08127, 2025

Fino1: On the transferability of reasoning enhanced llms to finance , author=. arXiv preprint arXiv:2502.08127 , year=

work page arXiv
[37]

Liu, Zhiqiang and Gan, Chengtao and Wang, Junjie and Zhang, Yichi and Bo, Zhongpu and Sun, Mengshu and Chen, Huajun and Zhang, Wen , booktitle=

work page
[38]

Agrawal, Garima and Pal, Kuntal and Deng, Yuli and Liu, Huan and Chen, Ying-Chih , booktitle=

work page
[39]

Prabhakar, Vignesh and Islam, Md Amirul and Atanas, Adam and Wang, Yao-Ting and Han, Joah and Jhunjhunwala, Aastha and Apte, Rucha and Clark, Robert and Xu, Kang and Wang, Zihan and others , journal=

work page
[40]

Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , journal=

work page
[41]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[42]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Forty-second International Conference on Machine Learning , year=

Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse , author=. Forty-second International Conference on Machine Learning , year=

work page
[44]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Thinking in space: How multimodal large language models see, remember, and recall spaces , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[45]

OpenAI GPT-5 System Card

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Artificial Intelligence Review , volume=

A survey of deep learning for industrial visual anomaly detection , author=. Artificial Intelligence Review , volume=. 2025 , publisher=

work page 2025

[2] [2]

Bergmann, Paul and Batzner, Kilian and Fauser, Michael and Sattlegger, David and Steger, Carsten , journal=. The. 2021 , publisher=

work page 2021

[3] [3]

Special Lecture on IE , volume=

Variational autoencoder based anomaly detection using reconstruction probability , author=. Special Lecture on IE , volume=. 2015 , publisher=

work page 2015

[4] [4]

Proceedings of the International Conference on Information Processing in Medical Imaging , pages=

Unsupervised anomaly detection with generative adversarial networks to guide marker discovery , author=. Proceedings of the International Conference on Information Processing in Medical Imaging , pages=

work page

[5] [5]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Towards total recall in industrial anomaly detection , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[6] [6]

Proceedings of the International Conference on Pattern Recognition , pages=

Padim: a patch distribution modeling framework for anomaly detection and localization , author=. Proceedings of the International Conference on Pattern Recognition , pages=

work page

[7] [7]

Bergmann, Paul and Fauser, Michael and Sattlegger, David and Steger, Carsten , booktitle=

work page

[8] [8]

Advances in Neural Information Processing Systems , volume=

Visual instruction tuning , author=. Advances in Neural Information Processing Systems , volume=

work page

[9] [9]

2023 , url =

OpenAI , title =. 2023 , url =

work page 2023

[10] [10]

Jiang, Xi and Li, Jian and Deng, Hanqiu and Liu, Yong and Gao, Bin-Bin and Zhou, Yifeng and Li, Jialin and Wang, Chengjie and Zheng, Feng , booktitle=

work page

[11] [11]

Gu, Zhaopeng and Zhu, Bingke and Zhu, Guibo and Chen, Yingying and Tang, Ming and Wang, Jinqiao , booktitle=

work page

[12] [12]

Chao, Yuhao and Liu, Jie and Tang, Jie and Wu, Gangshan , journal=

work page

[13] [13]

Zhao, Shifang and Lin, Yiheng and Han, Lu and Zhao, Yao and Wei, Yunchao , journal=

work page

[14] [14]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Yang and others , journal=

work page

[15] [15]

Hurst, Aaron and Lerer, Adam and Goucher, Adam P and Perelman, Adam and Ramesh, Aditya and Clark, Aidan and Ostrow, AJ and Welihinda, Akila and Hayes, Alan and Radford, Alec and others , journal=

work page

[16] [16]

Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Pengfei and others , journal=

work page

[17] [17]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling , author=. arXiv preprint arXiv:2412.05271 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Zhu, Jinguo and Wang, Weiyun and Chen, Zhe and Liu, Zhaoyang and Ye, Shenglong and Gu, Lixin and Tian, Hao and Duan, Yuchen and Su, Weijie and Shao, Jie and others , journal=

work page

[19] [19]

Wang, Weiyun and Gao, Zhangwei and Gu, Lixin and Pu, Hengjun and Cui, Long and Wei, Xingguang and Liu, Zhaoyang and Jing, Linglin and Ye, Shenglong and Shao, Jie and others , journal=

work page

[20] [20]

arXiv preprint arXiv:2504.07491 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Xiaomi, LLM and Xia, Bingquan and Shen, Bowen and Zhu, Dawei and Zhang, Di and Wang, Gang and Zhang, Hailin and Liu, Huaqiu and Xiao, Jiebao and Dong, Jinhao and others , journal=

work page

[22] [22]

Agrawal, Pravesh and Antoniak, Szymon and Hanna, Emma Bou and Bout, Baptiste and Chaplot, Devendra and Chudnovsky, Jessica and Costa, Diogo and De Monicault, Baudouin and Garg, Saurabh and Gervet, Theophile and others , journal=

work page

[23] [23]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[24] [24]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detection , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[25] [25]

Mengcheng Lan and Chaofeng Chen and Yue Zhou and Jiaxing Xu and Yiping Ke and Xinjiang Wang and Litong Feng and Wayne Zhang , booktitle=

work page

[26] [26]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Towards zero-shot anomaly detection and reasoning with multimodal large language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[27] [27]

Shao, Hao and Qian, Shengju and Xiao, Han and Song, Guanglu and Zong, Zhuofan and Wang, Letian and Liu, Yu and Li, Hongsheng , journal=

work page

[28] [28]

Nils Reimers and Iryna Gurevych , booktitle=

work page

[29] [29]

arXiv preprint arXiv:2404.00213 , year=

Injecting new knowledge into large language models via supervised fine-tuning , author=. arXiv preprint arXiv:2404.00213 , year=

work page arXiv

[30] [30]

Injecting Domain-Specific Knowledge into Large Language Models: A Comprehensive Survey

Song, Zirui and Yan, Bin and Liu, Yuhan and Fang, Miao and Li, Mingzhe and Yan, Rui and Chen, Xiuying. Injecting Domain-Specific Knowledge into Large Language Models: A Comprehensive Survey. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025. 2025

work page 2025

[31] [31]

Retrieval-augmented generation for

Zhao, Penghao and Zhang, Hailin and Yu, Qinhan and Wang, Zhengren and Geng, Yunteng and Fu, Fangcheng and Yang, Ling and Zhang, Wentao and Jiang, Jie and Cui, Bin , journal=. Retrieval-augmented generation for. 2026 , publisher=

work page 2026

[32] [32]

Proceedings of the European Conference on Computer Vision , pages=

Spot-the-difference self-supervised pre-training for anomaly detection and segmentation , author=. Proceedings of the European Conference on Computer Vision , pages=

work page

[33] [33]

International Journal of Computer Vision , volume=

Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and localization , author=. International Journal of Computer Vision , volume=. 2022 , publisher=

work page 2022

[34] [34]

2024 , publisher=

Zhang, Jian and Ding, Runwei and Ban, Miaoju and Dai, Linhui , journal=. 2024 , publisher=

work page 2024

[35] [35]

Zhou, Qihang and Pang, Guansong and Tian, Yu and He, Shibo and Chen, Jiming , booktitle=

work page

[36] [36]

Fino1: On the transferability of reasoning-enhanced llms and reinforcement learning to finance.arXiv preprint arXiv:2502.08127, 2025

Fino1: On the transferability of reasoning enhanced llms to finance , author=. arXiv preprint arXiv:2502.08127 , year=

work page arXiv

[37] [37]

Liu, Zhiqiang and Gan, Chengtao and Wang, Junjie and Zhang, Yichi and Bo, Zhongpu and Sun, Mengshu and Chen, Huajun and Zhang, Wen , booktitle=

work page

[38] [38]

Agrawal, Garima and Pal, Kuntal and Deng, Yuli and Liu, Huan and Chen, Ying-Chih , booktitle=

work page

[39] [39]

Prabhakar, Vignesh and Islam, Md Amirul and Atanas, Adam and Wang, Yao-Ting and Han, Joah and Jhunjhunwala, Aastha and Apte, Rucha and Clark, Robert and Xu, Kang and Wang, Zihan and others , journal=

work page

[40] [40]

Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , journal=

work page

[41] [41]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page

[42] [42]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

Forty-second International Conference on Machine Learning , year=

Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse , author=. Forty-second International Conference on Machine Learning , year=

work page

[44] [44]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Thinking in space: How multimodal large language models see, remember, and recall spaces , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[45] [45]

OpenAI GPT-5 System Card

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

work page internal anchor Pith review Pith/arXiv arXiv