JUDO: A Juxtaposed Domain-Oriented Multimodal Reasoner for Industrial Anomaly QA
Pith reviewed 2026-05-21 07:46 UTC · model grok-4.3
The pith
JUDO improves anomaly question answering by juxtaposing defect images with normal references and training models with domain-specific fine-tuning plus reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
JUDO is a Juxtaposed Domain-Oriented Multimodal Reasoner that segments defect regions through visual comparison of query images against normal images as domain context and injects domain knowledge via supervised fine-tuning before guiding reasoning with reinforcement learning using tailored rewards, yielding higher performance on the MMAD benchmark than models such as Qwen2.5-VL-7B and GPT-4o.
What carries the argument
Juxtaposition of query images with normal images for fine-grained visual comparative inspection, paired with supervised fine-tuning followed by group relative policy optimization using domain-oriented reward signals.
If this is right
- Defect segmentation becomes more precise because the model can perform direct side-by-side visual comparison using normal images.
- Responses in industrial anomaly QA incorporate more accurate domain context after the supervised fine-tuning stage.
- Reinforcement learning with tailored rewards steers the model toward domain-oriented reasoning paths rather than generic ones.
- Overall benchmark scores on MMAD rise above those of general-purpose multimodal models such as GPT-4o.
Where Pith is reading between the lines
- The juxtaposition technique could transfer to other fine-grained visual comparison tasks such as medical image review where normal references are available.
- Custom reward signals in the reinforcement stage might reduce the chance of the model inventing nonexistent defects.
- Similar domain-injection steps could be tested on smaller models to see whether they close the gap to larger general models without extra scale.
Load-bearing premise
That placing normal reference images next to queries gives enough visual context for correct defect segmentation and that supervised fine-tuning plus reward-guided reinforcement learning will produce effective domain reasoning without adding biases or errors.
What would settle it
Evaluating JUDO on the MMAD benchmark after removing the normal reference images and checking whether accuracy falls below that of Qwen2.5-VL-7B or GPT-4o.
Figures
read the original abstract
Industrial anomaly detection has been significantly advanced by Large Multimodal Models (LMMs), enabling diverse human instructions beyond detection, particularly through visually grounded reasoning for better image understanding. However, LMMs lack domain-specific knowledge, which limits their ability to generate accurate responses in complex industrial scenarios. In this work, we present JUDO, Juxtaposed Domain-Oriented Multimodal Reasoner, a framework that efficiently incorporates domain knowledge and context in visual and textual reasoning. Through visual reasoning, our model segments the defect region by juxtaposing query images with normal images as visual domain context, enabling a fine-grained visual comparative inspection. Furthermore, we inject domain knowledge through supervised fine-tuning (SFT) to enhance context understanding and subsequently guide domain reasoning through reinforcement learning (GRPO) with tailored rewards, opting for a domain-oriented reasoning process. Experimental results demonstrate that JUDO achieves superior performance on the MMAD benchmark, surpassing models such as Qwen2.5-VL-7B and GPT-4o. These results highlight the importance of enhancing domain knowledge and context for effective reasoning in anomaly understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces JUDO, a Juxtaposed Domain-Oriented Multimodal Reasoner for industrial anomaly QA. It proposes a framework that performs visual reasoning by juxtaposing query images with normal images to enable fine-grained defect segmentation, injects domain knowledge via supervised fine-tuning (SFT), and guides domain-oriented reasoning through group relative policy optimization (GRPO) with tailored rewards. The central claim is that this yields superior performance on the MMAD benchmark relative to baselines including Qwen2.5-VL-7B and GPT-4o.
Significance. If the performance gains are shown to be robust and generalizable, the work would usefully demonstrate how explicit visual domain context and reinforcement learning with domain-specific rewards can mitigate the lack of industrial knowledge in current LMMs. The juxtaposition mechanism and staged SFT-then-GRPO pipeline represent a concrete methodological contribution worth further exploration in anomaly understanding tasks.
major comments (2)
- [Abstract] Abstract: the claim that JUDO achieves superior performance on MMAD (surpassing Qwen2.5-VL-7B and GPT-4o) is presented without any description of the experimental protocol, evaluation metrics, baseline implementations, statistical significance, or data splits. This directly undermines verification of the central empirical claim.
- [Abstract] Training description: the GRPO stage relies on 'tailored rewards' to enforce domain-oriented reasoning and accurate defect segmentation, yet no explicit reward formulation, weighting, or proxy signals are supplied. Because the rewards are listed among the free parameters and are load-bearing for the claim that RL adds genuine capability rather than benchmark artifacts, their absence prevents assessment of whether the method avoids bias or incorrect knowledge injection.
minor comments (1)
- [Abstract] The abstract would be strengthened by a one-sentence characterization of the MMAD benchmark (e.g., number of images, anomaly types, or question formats) to contextualize the reported gains.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our manuscript. We provide detailed responses to the major comments below and outline the revisions we intend to implement.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that JUDO achieves superior performance on MMAD (surpassing Qwen2.5-VL-7B and GPT-4o) is presented without any description of the experimental protocol, evaluation metrics, baseline implementations, statistical significance, or data splits. This directly undermines verification of the central empirical claim.
Authors: We acknowledge that the abstract does not elaborate on the experimental details due to space constraints. However, the manuscript's Section 4 fully specifies the experimental protocol, including the use of the MMAD benchmark, evaluation metrics (e.g., accuracy, precision, recall), baseline model implementations, data splits, and statistical significance testing through repeated experiments. To enhance the abstract's informativeness, we will incorporate a brief mention of the evaluation setup and direct readers to the detailed experimental section. revision: yes
-
Referee: [Abstract] Training description: the GRPO stage relies on 'tailored rewards' to enforce domain-oriented reasoning and accurate defect segmentation, yet no explicit reward formulation, weighting, or proxy signals are supplied. Because the rewards are listed among the free parameters and are load-bearing for the claim that RL adds genuine capability rather than benchmark artifacts, their absence prevents assessment of whether the method avoids bias or incorrect knowledge injection.
Authors: The tailored rewards for the GRPO stage are explicitly formulated in Section 3.2 of the paper, detailing the reward components for domain-oriented reasoning, accurate defect segmentation, and their respective weightings and proxy signals. This design aims to guide the model towards genuine capability enhancement rather than artifacts. We will revise the abstract to include a short reference to the reward structure and its role in the training pipeline. revision: yes
Circularity Check
No circularity: performance claims rest on external benchmark comparisons, not self-referential definitions or fitted inputs.
full rationale
The paper presents JUDO as a framework that juxtaposes query and normal images for visual defect segmentation, applies SFT for domain knowledge, then uses GRPO with tailored rewards for reasoning. It reports superior MMAD results versus Qwen2.5-VL-7B and GPT-4o. No equations, derivations, or self-citations are shown that reduce the central claims to inputs by construction. The chain is self-contained against external model comparisons and benchmarks, with no evidence of self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations.
Axiom & Free-Parameter Ledger
free parameters (1)
- tailored rewards for GRPO
axioms (2)
- domain assumption Juxtaposing query images with normal images enables fine-grained visual comparative inspection for defect segmentation
- domain assumption Supervised fine-tuning followed by GRPO with tailored rewards enhances context understanding and guides domain-oriented reasoning
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Stage 3: Domain-oriented GRPO with Rdomain = λ · ϕ(Egen)·ϕ(Epdomain) / norms, Rseg F1 on 16×16 patches, choice/format rewards
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
JUDO achieves 81.20% on MMAD via SFT+GRPO on MVTec/VisA/GoodsAD
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Artificial Intelligence Review , volume=
A survey of deep learning for industrial visual anomaly detection , author=. Artificial Intelligence Review , volume=. 2025 , publisher=
work page 2025
-
[2]
Bergmann, Paul and Batzner, Kilian and Fauser, Michael and Sattlegger, David and Steger, Carsten , journal=. The. 2021 , publisher=
work page 2021
-
[3]
Special Lecture on IE , volume=
Variational autoencoder based anomaly detection using reconstruction probability , author=. Special Lecture on IE , volume=. 2015 , publisher=
work page 2015
-
[4]
Proceedings of the International Conference on Information Processing in Medical Imaging , pages=
Unsupervised anomaly detection with generative adversarial networks to guide marker discovery , author=. Proceedings of the International Conference on Information Processing in Medical Imaging , pages=
-
[5]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Towards total recall in industrial anomaly detection , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[6]
Proceedings of the International Conference on Pattern Recognition , pages=
Padim: a patch distribution modeling framework for anomaly detection and localization , author=. Proceedings of the International Conference on Pattern Recognition , pages=
-
[7]
Bergmann, Paul and Fauser, Michael and Sattlegger, David and Steger, Carsten , booktitle=
-
[8]
Advances in Neural Information Processing Systems , volume=
Visual instruction tuning , author=. Advances in Neural Information Processing Systems , volume=
- [9]
-
[10]
Jiang, Xi and Li, Jian and Deng, Hanqiu and Liu, Yong and Gao, Bin-Bin and Zhou, Yifeng and Li, Jialin and Wang, Chengjie and Zheng, Feng , booktitle=
-
[11]
Gu, Zhaopeng and Zhu, Bingke and Zhu, Guibo and Chen, Yingying and Tang, Ming and Wang, Jinqiao , booktitle=
-
[12]
Chao, Yuhao and Liu, Jie and Tang, Jie and Wu, Gangshan , journal=
-
[13]
Zhao, Shifang and Lin, Yiheng and Han, Lu and Zhao, Yao and Wei, Yunchao , journal=
-
[14]
Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Yang and others , journal=
-
[15]
Hurst, Aaron and Lerer, Adam and Goucher, Adam P and Perelman, Adam and Ramesh, Aditya and Clark, Aidan and Ostrow, AJ and Welihinda, Akila and Hayes, Alan and Radford, Alec and others , journal=
-
[16]
Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Pengfei and others , journal=
-
[17]
Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling , author=. arXiv preprint arXiv:2412.05271 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Zhu, Jinguo and Wang, Weiyun and Chen, Zhe and Liu, Zhaoyang and Ye, Shenglong and Gu, Lixin and Tian, Hao and Duan, Yuchen and Su, Weijie and Shao, Jie and others , journal=
-
[19]
Wang, Weiyun and Gao, Zhangwei and Gu, Lixin and Pu, Hengjun and Cui, Long and Wei, Xingguang and Liu, Zhaoyang and Jing, Linglin and Ye, Shenglong and Shao, Jie and others , journal=
-
[20]
arXiv preprint arXiv:2504.07491 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Xiaomi, LLM and Xia, Bingquan and Shen, Bowen and Zhu, Dawei and Zhang, Di and Wang, Gang and Zhang, Hailin and Liu, Huaqiu and Xiao, Jiebao and Dong, Jinhao and others , journal=
-
[22]
Agrawal, Pravesh and Antoniak, Szymon and Hanna, Emma Bou and Bout, Baptiste and Chaplot, Devendra and Chudnovsky, Jessica and Costa, Diogo and De Monicault, Baudouin and Garg, Saurabh and Gervet, Theophile and others , journal=
-
[23]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[24]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detection , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[25]
Mengcheng Lan and Chaofeng Chen and Yue Zhou and Jiaxing Xu and Yiping Ke and Xinjiang Wang and Litong Feng and Wayne Zhang , booktitle=
-
[26]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Towards zero-shot anomaly detection and reasoning with multimodal large language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[27]
Shao, Hao and Qian, Shengju and Xiao, Han and Song, Guanglu and Zong, Zhuofan and Wang, Letian and Liu, Yu and Li, Hongsheng , journal=
-
[28]
Nils Reimers and Iryna Gurevych , booktitle=
-
[29]
arXiv preprint arXiv:2404.00213 , year=
Injecting new knowledge into large language models via supervised fine-tuning , author=. arXiv preprint arXiv:2404.00213 , year=
-
[30]
Injecting Domain-Specific Knowledge into Large Language Models: A Comprehensive Survey
Song, Zirui and Yan, Bin and Liu, Yuhan and Fang, Miao and Li, Mingzhe and Yan, Rui and Chen, Xiuying. Injecting Domain-Specific Knowledge into Large Language Models: A Comprehensive Survey. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025. 2025
work page 2025
-
[31]
Retrieval-augmented generation for
Zhao, Penghao and Zhang, Hailin and Yu, Qinhan and Wang, Zhengren and Geng, Yunteng and Fu, Fangcheng and Yang, Ling and Zhang, Wentao and Jiang, Jie and Cui, Bin , journal=. Retrieval-augmented generation for. 2026 , publisher=
work page 2026
-
[32]
Proceedings of the European Conference on Computer Vision , pages=
Spot-the-difference self-supervised pre-training for anomaly detection and segmentation , author=. Proceedings of the European Conference on Computer Vision , pages=
-
[33]
International Journal of Computer Vision , volume=
Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and localization , author=. International Journal of Computer Vision , volume=. 2022 , publisher=
work page 2022
-
[34]
Zhang, Jian and Ding, Runwei and Ban, Miaoju and Dai, Linhui , journal=. 2024 , publisher=
work page 2024
-
[35]
Zhou, Qihang and Pang, Guansong and Tian, Yu and He, Shibo and Chen, Jiming , booktitle=
-
[36]
Fino1: On the transferability of reasoning enhanced llms to finance , author=. arXiv preprint arXiv:2502.08127 , year=
-
[37]
Liu, Zhiqiang and Gan, Chengtao and Wang, Junjie and Zhang, Yichi and Bo, Zhongpu and Sun, Mengshu and Chen, Huajun and Zhang, Wen , booktitle=
-
[38]
Agrawal, Garima and Pal, Kuntal and Deng, Yuli and Liu, Huan and Chen, Ying-Chih , booktitle=
-
[39]
Prabhakar, Vignesh and Islam, Md Amirul and Atanas, Adam and Wang, Yao-Ting and Han, Joah and Jhunjhunwala, Aastha and Apte, Rucha and Clark, Robert and Xu, Kang and Wang, Zihan and others , journal=
-
[40]
Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , journal=
-
[41]
Advances in Neural Information Processing Systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[42]
Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Forty-second International Conference on Machine Learning , year=
Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse , author=. Forty-second International Conference on Machine Learning , year=
-
[44]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Thinking in space: How multimodal large language models see, remember, and recall spaces , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[45]
Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.