IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools

Fangfang Lin; Huawei Cao; Jiyuan Wang; Kejin Cui; Mengmeng Wang; Min Qiu; Rongbin Tan; Shuhan Song{\S}; Yi Wang; Yue Wang

arxiv: 2605.20682 · v1 · pith:5Z7SR4HXnew · submitted 2026-05-20 · 💻 cs.CV

IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools

Rongbin Tan , Fangfang Lin , Zhenlong Yuan , Min Qiu , Kejin Cui , Mengmeng Wang , Yi Wang , Zijian Song

show 5 more authors

Zhiyuan Wang Jiyuan Wang Yue Wang Shuhan Song{\S} Huawei Cao

This is my paper

Pith reviewed 2026-05-21 05:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords industrial anomaly detectionopen-vocabulary detectionmultimodal large language modelsagentic frameworksreinforcement learningzero-shot learningtool augmentationanomaly localization

0 comments

The pith

IndusAgent uses a tool-augmented agent and gated reinforcement learning to achieve state-of-the-art zero-shot performance in open-vocabulary industrial anomaly detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces IndusAgent to overcome limitations in multimodal large language models when detecting anomalies in industrial images without specific training examples. It builds a dataset called Indus-CoT that combines visual observations with expert knowledge on normal parts, then trains the model to call external tools like region cropping and feature enhancement only when they help. A gated reinforcement learning method optimizes for accurate classification, localization, and reasoning while keeping tool use efficient. If successful, this would allow AI systems to inspect a wider range of factory products and defects more reliably than current methods.

Core claim

The central discovery is that orchestrating tools such as dynamic region cropping, high-frequency feature enhancement, and prior retrieval through a gated reinforcement learning objective on the Indus-CoT dataset enables multimodal models to resolve visual ambiguities and achieve superior zero-shot anomaly detection across industrial benchmarks.

What carries the argument

IndusAgent, a tool-augmented agentic framework that dynamically orchestrates external tools including region cropping and feature enhancement under a gated reinforcement learning objective to optimize anomaly classification and localization.

If this is right

The framework allows active resolution of visual ambiguities in anomaly detection tasks.
It achieves state-of-the-art zero-shot performance on benchmarks including MVTec-AD, VisA, MPDD, DTD, and SDD.
Tool invocation happens only when beneficial due to the gating mechanism.
Joint optimization of classification, localization, reasoning, and efficient tool usage improves overall robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could reduce reliance on extensive labeled data for training anomaly detectors in new industrial domains.
Similar agentic tool orchestration might apply to other vision tasks prone to hallucinations in multimodal models.
Testing on real-time factory lines could reveal whether the computational overhead of tool calls remains practical.
Extending the prior retrieval tool with more diverse expert knowledge might further improve generalization to novel defect types.

Load-bearing premise

The combination of the Indus-CoT dataset, tool orchestration, and gated reinforcement learning produces genuine generalization across industrial scenarios rather than improvements specific to the tested benchmarks.

What would settle it

Evaluation on a new set of industrial images featuring previously unseen product categories and anomaly types where IndusAgent does not outperform standard multimodal models without the agent tools.

Figures

Figures reproduced from arXiv: 2605.20682 by Fangfang Lin, Huawei Cao, Jiyuan Wang, Kejin Cui, Mengmeng Wang, Min Qiu, Rongbin Tan, Shuhan Song{\S}, Yi Wang, Yue Wang, Zhenlong Yuan, Zhiyuan Wang, Zijian Song.

**Figure 1.** Figure 1: Comparison of anomaly detection paradigms using MLLMs. (a) Standard MLLMs suffer from unaligned reasoning and structural hallucinations, often misinterpreting legitimate variations. (b) Ordinary Chain-of-Thought (CoT) reasoning is insufficient; without domain knowledge and localized perception, the model misjudges subtle defects as normal reflections due to perceptual dilution. (c) Our proposed framework… view at source ↗

**Figure 2.** Figure 2: The overall architecture of IndusAgent. Our training pipeline consists of three sequential stages: (1) Indus-CoT Construction, where a frontier model (Qwen3-VL-Max) synthesizes structured reasoning trajectories to form high-quality positive and negative examples; (2) Agentic Fine-Tuning, which aligns a lightweight base model (Qwen3-VL-8B) with domain-specific diagnostic protocols; and (3) Tool-Augmented … view at source ↗

**Figure 3.** Figure 3: Zero-shot Comparison. Finding 1: Domain-specific alignment is critical. MLLM reasoning alone remains unreliable for complex industrial samples. For example, Qwen3-VL-Instruct performs poorly on VisA, while Agentic SFT and RL substantially improve performance, indicating that robust IAD requires task-specific diagnostic alignment rather than open-ended reasoning alone. Finding 2: Active tooling complements… view at source ↗

**Figure 4.** Figure 4: Case Study between Qwen3-VL-8B and our method. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Case Study between Qwen3-VL-8B and our method. [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Case Study between Qwen3-VL-8B and our method. [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

read the original abstract

Multimodal large language models (MLLMs) have shown remarkable capability in bridging visual perception and textual reasoning, enabling zero-shot understanding across diverse industrial scenarios. However, their performance in open-vocabulary industrial anomaly detection (IAD) is often limited by domain-misaligned reasoning and hallucinated structural inferences. To address these challenges, we propose \textbf{IndusAgent}, a tool-augmented agentic framework for open-vocabulary IAD. Specifically, we first construct \textbf{Indus-CoT}, a structured dataset that integrates global visual observations, high-resolution local patches, and expert normalcy priors, providing supervision for fine-tuning the model on rigorous industrial inspection trajectories. Building on this, IndusAgent dynamically orchestrates a set of external tools, including dynamic region cropping, high-frequency feature enhancement, and prior retrieval, thus enabling the agent to actively resolve visual ambiguities and disentangle subtle anomalies. Furthermore, we introduce a gated reinforcement learning objective that jointly optimizes anomaly classification, localization accuracy, anomaly type reasoning, and efficient tool usage, ensuring that tool invocation occurs only when beneficial. Extensive evaluations on five industrial anomaly benchmarks, including MVTec-AD, VisA, MPDD, DTD, and SDD, demonstrate that IndusAgent achieves state-of-the-art zero-shot performance among all existing methods, validating our robustness and generalization capacity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IndusAgent adds agentic tools and gated RL to MLLMs for open-vocabulary industrial anomaly detection with claimed SOTA zero-shot results, but the zero-shot framing rests on an unverified separation in Indus-CoT.

read the letter

The main thing to take from this paper is that IndusAgent uses an MLLM agent with external tools and a gated RL objective to improve open-vocabulary anomaly detection for industrial settings, and it reports leading zero-shot numbers on five benchmarks. The work introduces the Indus-CoT dataset, which combines global views, local patches, and expert normalcy priors to train the model on inspection-style reasoning. From there, the agent can call tools for dynamic cropping, feature enhancement, and prior lookup. The gated RL part is meant to optimize several goals at once while only triggering tools when they add value. This approach handles some of the weaknesses in direct MLLM use, like misaligned reasoning on subtle defects. The multi-part objective in RL gives a way to balance accuracy with efficiency, which is useful in practice. The soft spot sits with the zero-shot evaluation. The stress-test concern about Indus-CoT overlapping with the test sets like MVTec-AD or VisA is worth checking closely. The paper describes building the dataset from observations and priors but does not appear to include a clear statement ruling out any shared images or annotations with the evaluation benchmarks. If there is any leakage, the generalization argument weakens considerably. Assuming the experiments include proper ablations and the data is clean, the method offers a workable path for handling new defect categories in manufacturing. This paper targets readers working on applied anomaly detection or multimodal agents in specialized domains. Someone looking for ideas on tool-augmented systems for vision tasks would find the orchestration details relevant. It deserves peer review. The core framework is well-motivated and the results are presented across multiple datasets, so referees can assess the claims once the data sourcing is clarified.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes IndusAgent, a tool-augmented agentic framework for open-vocabulary industrial anomaly detection. It first builds the Indus-CoT dataset by integrating global visual observations, high-resolution local patches, and expert normalcy priors to supervise fine-tuning on industrial inspection trajectories. The framework then dynamically orchestrates external tools (dynamic region cropping, high-frequency feature enhancement, prior retrieval) and introduces a gated reinforcement learning objective that jointly optimizes anomaly classification, localization accuracy, anomaly type reasoning, and efficient tool usage. Extensive evaluations on five benchmarks (MVTec-AD, VisA, MPDD, DTD, SDD) are reported to establish state-of-the-art zero-shot performance.

Significance. If the zero-shot claims hold without data leakage and with proper controls, the work could meaningfully advance open-vocabulary IAD by showing how agentic tool use and gated RL can mitigate domain misalignment and hallucinations in MLLMs. The structured Indus-CoT construction and explicit optimization of tool invocation are concrete contributions that, if reproducible, would provide a useful template for other perception-reasoning tasks.

major comments (2)

[§3] §3 (Indus-CoT construction): The description provides no explicit statement on the provenance of images, masks, or annotations used to build Indus-CoT, nor any confirmation that none of the five evaluation benchmarks (MVTec-AD, VisA, MPDD, DTD, SDD) contributed data. This separation is load-bearing for the zero-shot SOTA claim; overlap would collapse the generalization argument.
[§5] §5 (Experiments): The abstract and result summary assert SOTA zero-shot performance yet supply no quantitative metrics (e.g., AUROC, PRO, pixel-level scores), ablation tables, or baseline details. Without these, the central empirical claim cannot be verified or compared to prior methods.

minor comments (2)

[Abstract] The abstract is overly long; moving key quantitative highlights and a one-sentence statement of the no-overlap guarantee to the abstract would improve readability.
[§4] Notation for the gated RL objective (reward components, gating mechanism) should be formalized with equations rather than prose to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. We have carefully considered each comment and provide point-by-point responses below. We believe these clarifications and planned revisions will strengthen the paper.

read point-by-point responses

Referee: [§3] §3 (Indus-CoT construction): The description provides no explicit statement on the provenance of images, masks, or annotations used to build Indus-CoT, nor any confirmation that none of the five evaluation benchmarks (MVTec-AD, VisA, MPDD, DTD, SDD) contributed data. This separation is load-bearing for the zero-shot SOTA claim; overlap would collapse the generalization argument.

Authors: We thank the referee for highlighting the importance of explicitly documenting data provenance to support the zero-shot claims. We agree that this information should be stated clearly. In the revised manuscript, we will include an explicit statement in §3 detailing the sources of images, masks, and annotations for Indus-CoT and confirming that none of the evaluation benchmarks (MVTec-AD, VisA, MPDD, DTD, SDD) contributed any data to its construction. This addition will strengthen the generalization argument. revision: yes
Referee: [§5] §5 (Experiments): The abstract and result summary assert SOTA zero-shot performance yet supply no quantitative metrics (e.g., AUROC, PRO, pixel-level scores), ablation tables, or baseline details. Without these, the central empirical claim cannot be verified or compared to prior methods.

Authors: We acknowledge the referee's point that the abstract and result summary would benefit from including specific quantitative metrics to allow immediate verification of the SOTA claims. Although the detailed results, including AUROC, PRO, pixel-level scores, ablation tables, and baseline comparisons, are provided in §5, we will revise the abstract and result summary to incorporate key numerical results and references to the relevant tables and figures. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework evaluated on external benchmarks

full rationale

The paper describes construction of the Indus-CoT dataset from global observations, local patches and expert priors, followed by tool orchestration and a gated reinforcement learning objective, then reports empirical zero-shot performance on the external benchmarks MVTec-AD, VisA, MPDD, DTD and SDD. No equations, derivations or first-principles results are presented that reduce by construction to fitted inputs, self-definitions or self-citation chains. The central claims rest on measured performance against independent test sets rather than any renaming, prediction-from-fit or load-bearing self-reference, rendering the reported method self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review prevents exhaustive identification; the approach implicitly relies on assumptions about tool effectiveness and RL balancing that are not quantified here.

axioms (1)

domain assumption External tools for region cropping, feature enhancement, and prior retrieval resolve visual ambiguities in industrial images.
Invoked when describing how the agent disentangles subtle anomalies.

pith-pipeline@v0.9.0 · 5815 in / 1149 out tokens · 39678 ms · 2026-05-21T05:30:03.731675+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose IndusAgent, a tool-augmented agentic framework... gated reinforcement learning objective that jointly optimizes anomaly classification, localization accuracy, anomaly type reasoning, and efficient tool usage
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat embedding and J-cost orbit unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Indus-CoT... global visual observations, high-resolution local patches, and expert normalcy priors

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

151 extracted references · 151 canonical work pages · 20 internal anchors

[1]

Anylayout: Versatile advertising poster layout generation with MLLMs

Anonymous. Anylayout: Versatile advertising poster layout generation with MLLMs. InSubmitted to The Fourteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=viX7rUMzwg. under review

work page 2025
[2]

The claude 4 model family.Anthropic Technical Report, 2025

Anthropic. The claude 4 model family.Anthropic Technical Report, 2025

work page 2025
[3]

Aota, L.T.T

T. Aota, L.T.T. Tong, and T. Okatani. Zero-shot versus many-shot: Unsupervised texture anomaly detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5564–5572, 2023

work page 2023
[4]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Bergmann, S

P. Bergmann, S. Löwe, M. Fauser, D. Sattlegger, and C. Steger. Improving unsupervised defect seg- mentation by applying structural similarity to autoencoders. In13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, 2018

work page 2018
[7]

Bergmann, M

P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger. Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9592–9600, 2019

work page 2019
[8]

Context-dpo: Aligning language models for context-faithfulness

Baolong Bi, Shaohan Huang, Yiwei Wang, Tianchi Yang, Zihan Zhang, Haizhen Huang, Lingrui Mei, Junfeng Fang, Zehao Li, Furu Wei, et al. Context-dpo: Aligning language models for context-faithfulness. ACL 2025, 2024

work page 2025
[9]

Decoding by contrasting knowledge: Enhancing llms’ confidence on edited facts.ACL 2025, 2024

Baolong Bi, Shenghua Liu, Lingrui Mei, Yiwei Wang, Pengliang Ji, and Xueqi Cheng. Decoding by contrasting knowledge: Enhancing llms’ confidence on edited facts.ACL 2025, 2024

work page 2025
[10]

Refinex: Learning to refine pre-training data at scale from expert-guided programs.arXiv preprint arXiv:2507.03253, 2025

Baolong Bi, Shenghua Liu, Xingzhang Ren, Dayiheng Liu, Junyang Lin, Yiwei Wang, Lingrui Mei, Junfeng Fang, Jiafeng Guo, and Xueqi Cheng. Refinex: Learning to refine pre-training data at scale from expert-guided programs.arXiv preprint arXiv:2507.03253, 2025

work page arXiv 2025
[11]

Reward and guidance through rubrics: Promoting exploration to improve multi-domain reasoning.arXiv preprint arXiv:2511.12344, 2025

Baolong Bi, Shenghua Liu, Yiwei Wang, Siqian Tong, Lingrui Mei, Yuyao Ge, Yilong Xu, Jiafeng Guo, and Xueqi Cheng. Reward and guidance through rubrics: Promoting exploration to improve multi-domain reasoning.arXiv preprint arXiv:2511.12344, 2025

work page arXiv 2025
[12]

Parameters vs

Baolong Bi, Shenghua Liu, Yiwei Wang, Yilong Xu, Junfeng Fang, Lingrui Mei, and Xueqi Cheng. Parameters vs. context: Fine-grained control of knowledge reliance in language models.arXiv preprint arXiv:2503.15888, 2025

work page arXiv 2025
[13]

Prism: Self- pruning intrinsic selection method for training-free multimodal data selection.ArXiv, abs/2502.12119,

Jinhe Bi, Yifan Wang, Danqi Yan, Xun Xiao, Artur Hecker, V olker Tresp, and Yunpu Ma. Prism: Self- pruning intrinsic selection method for training-free multimodal data selection.ArXiv, abs/2502.12119,

work page arXiv
[14]

URLhttps://api.semanticscholar.org/CorpusID:276421326

work page
[15]

Cot-kinetics: A theoretical modeling assessing lrm reasoning process.ArXiv, abs/2505.13408, 2025

Jinhe Bi, Danqi Yan, Yifan Wang, Wenke Huang, Haokun Chen, Guancheng Wan, Mang Ye, Xun Xiao, Hin rich Schuetze, V olker Tresp, and Yunpu Ma. Cot-kinetics: A theoretical modeling assessing lrm reasoning process.ArXiv, abs/2505.13408, 2025

work page arXiv 2025
[16]

EchoRL: Reinforcement learning via rollout echoing

Jinhe Bi, Aniri, Minglai Yang, Xingcheng Zhou, Wenke Huang, Sikuan Yan, Yujun Wang, Zixuan Cao, Michael Färber, Xun Xiao, V olker Tresp, and Yunpu Ma. EchoRL: Reinforcement learning via rollout echoing. InForty-third International Conference on Machine Learning, 2026

work page 2026
[17]

Y . Chao, J. Liu, J. Tang, and G. Wu. Anomalyr1: A grpo-based end-to-end mllm for industrial anomaly detection.arXiv preprint arXiv:2504.11914, 2025. 10

work page arXiv 2025
[18]

B. Chen, C. Shu, E. Shareghi, N. Collier, K. Narasimhan, and S. Yao. Fireact: Toward language agent fine-tuning. InInternational Conference on Learning Representations, 2024

work page 2024
[19]

Cross-view tracking for multi-human 3d pose estimation at over 100 fps

Long Chen, Haizhou Ai, Rui Chen, Zijie Zhuang, and Shuang Liu. Cross-view tracking for multi-human 3d pose estimation at over 100 fps. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3279–3288, 2020

work page 2020
[20]

Layout2scene: 3d semantic layout guided scene generation via geometry and appearance diffusion priors.arXiv preprint arXiv:2501.02519, 2025

Minglin Chen, Longguang Wang, Sheng Ao, Ye Zhang, Kai Xu, and Yulan Guo. Layout2scene: 3d semantic layout guided scene generation via geometry and appearance diffusion priors.arXiv preprint arXiv:2501.02519, 2025

work page arXiv 2025
[21]

Graph2scene: Versatile 3d indoor scene generation with interaction-aware scene graph

Minglin Chen, Rongkun Yang, Qibin Hu, Kaiwen Xue, Shunbo Zhou, and Yulan Guo. Graph2scene: Versatile 3d indoor scene generation with interaction-aware scene graph. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11313–11320. IEEE, 2025

work page 2025
[22]

Finger: Content aware fine-grained evaluation with reasoning for ai-generated videos

Rui Chen, Lei Sun, Jing Tang, Geng Li, and Xiangxiang Chu. Finger: Content aware fine-grained evaluation with reasoning for ai-generated videos. InProceedings of the 33rd ACM International Conference on Multimedia, pages 3517–3526, 2025

work page 2025
[23]

Self-supervised neuron segmen- tation with multi-agent reinforcement learning

Yinda Chen, Wei Huang, Shenglong Zhou, Qi Chen, and Zhiwei Xiong. Self-supervised neuron segmen- tation with multi-agent reinforcement learning. InIJCAI, 2023

work page 2023
[24]

Empower: Evolutionary medical prompt optimization with reinforcement learning

Yinda Chen, Yangfan He, Jing Yang, Dapeng Zhang, Zhenlong Yuan, Muhammad Attique Khan, Jamel Baili, and Lip Yee. Empower: Evolutionary medical prompt optimization with reinforcement learning. IEEE Journal of Biomedical and Health Informatics, 2026

work page 2026
[25]

Zhe Chen, Jiannan Wang, Yuchen Hao, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[26]

A vocabulary recommendation method for spatiotemporal data discovery based on bayesian network and ontologies.Big Earth Data, 3(3):220–231, 2019

Kejin Cui, Yongyao Jiang, Yun Li, and Dieter Pfoser. A vocabulary recommendation method for spatiotemporal data discovery based on bayesian network and ontologies.Big Earth Data, 3(3):220–231, 2019

work page 2019
[28]

Defard, A

T. Defard, A. Setkov, A. Loesch, and R. Rompel. Padim: a patch distribution modeling framework for anomaly detection and localization. InInternational Conference on Pattern Recognition, pages 475–489. Springer, 2021

work page 2021
[29]

Driess, F

D. Driess, F. Xia, M.S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, et al. Palm-e: An embodied multimodal language model. InInternational Conference on Machine Learning, pages 8469–8488. PMLR, 2023

work page 2023
[30]

Multi-scale graph learning for anti-sparse downscaling

Yingda Fan, Runlong Yu, Janet R Barclay, Alison P Appling, Yiming Sun, Yiqun Xie, and Xiaowei Jia. Multi-scale graph learning for anti-sparse downscaling. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27969–27977, 2025

work page 2025
[31]

K. Feng, K. Gong, B. Li, Z. Guo, Y . Wang, T. Peng, B. Wang, and X. Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Z. Gu, B. Zhu, G. Zhu, Y . Chen, M. Tang, and J. Wang. Anomalygpt: Detecting industrial anomalies using large vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, 2024

work page 2024
[33]

Sampling enhanced contrastive multi-view remote sensing data clustering with long-short range information mining

Renxiang Guan, Tianrui Liu, Wenxuan Tu, Chang Tang, Wenhan Luo, and Xinwang Liu. Sampling enhanced contrastive multi-view remote sensing data clustering with long-short range information mining. IEEE Transactions on Knowledge and Data Engineering, pages 1–15, 2025

work page 2025
[34]

Prototype-driven multi-view attribute-missing graph clustering.IEEE Transactions on Multimedia, 27:9454–9466, 2025

Renxiang Guan, Wenxuan Tu, Dayu Hu, Weixuan Liang, Ke Liang, Yaowen Hu, Yue Liu, and Xinwang Liu. Prototype-driven multi-view attribute-missing graph clustering.IEEE Transactions on Multimedia, 27:9454–9466, 2025

work page 2025
[35]

D. Guo, J. Shao, H. Qiu, J. Bu, Z. Li, J. Zhang, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Gupta and A

T. Gupta and A. Kembhavi. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953– 14962, 2023. 11

work page 2023
[37]

Ppgspeech: A wearable silent speech interface leveraging neck-worn photoplethysmogra- phy.IEEE Internet of Things Journal, 13(4):6692–6703, 2026

Lingde Hu, Wenbo Zhang, Wenkang Zhang, Yu He, Seokmin Choi, Yang Gao, Jagmohan Chauhan, and Zhanpeng Jin. Ppgspeech: A wearable silent speech interface leveraging neck-worn photoplethysmogra- phy.IEEE Internet of Things Journal, 13(4):6692–6703, 2026. doi: 10.1109/JIOT.2025.3639152

work page doi:10.1109/jiot.2025.3639152 2026
[38]

Y . Hu, O. Stretcu, C.-T. Lu, K. Viswanathan, K. Hata, E. Luo, R. Krishna, and A. Fuxman. Visual program distillation: Distilling tools and programmatic reasoning into vision-language models.arXiv preprint arXiv:2312.03052, 2023

work page arXiv 2023
[39]

Jeong, Y

J. Jeong, Y . Zou, T. Kim, D. Zhang, A. Nadar, and O. Dabeer. Winclip: Zero-/few-shot anomaly classification and segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19606–19616, 2023

work page 2023
[40]

M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim. Visual prompt tuning. InEuropean conference on computer vision, pages 709–727. Springer, 2022

work page 2022
[41]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

C. Li, W. Wu, H. Zhang, Y . Xia, S. Mao, L. Dong, I. Vuli´c, and F. Wei. Imagine while reasoning in space: Multimodal visualization-of-thought.arXiv preprint arXiv:2501.07542, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Factguard: Agentic video misinformation detection via reinforcement learning

Zehao Li, Hongwei Yu, Hao Jiang, Qiang Sheng, Yilong Xu, Baolong Bi, Yang Li, Zhenlong Yuan, Yujun Cai, and Zhaoqi Wang. Factguard: Agentic video misinformation detection via reinforcement learning. arXiv preprint arXiv:2602.22963, 2026

work page arXiv 2026
[44]

ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval

Zixu Li, Yupeng Hu, Zhiwei Chen, Mingyu Zhang, Zhiheng Fu, and Liqiang Nie. Conesep: Cone- based robust noise-unlearning compositional network for composed image retrieval, 2026. URL https: //arxiv.org/abs/2604.20358

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

Habit: Chrono-synergia robust progressive learning framework for composed image retrieval

Zixu Li, Yupeng Hu, Zhiwei Chen, Shiqi Zhang, Qinlei Huang, Zhiheng Fu, and Yinwei Wei. Habit: Chrono-synergia robust progressive learning framework for composed image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6762–6770, 2026

work page 2026
[46]

TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Yongqi Li, and Liqiang Nie. Tema: Anchor the image, follow the text for multi-modification composed image retrieval, 2026. URL https://arxiv.org/ abs/2604.21806

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

A decoupling paradigm with prompt learning for remote sensing image change captioning.IEEE Transactions on Geoscience and Remote Sensing, 61:1–18, 2023

Chenyang Liu, Rui Zhao, Jianqi Chen, Zipeng Qi, Zhengxia Zou, and Zhenwei Shi. A decoupling paradigm with prompt learning for remote sensing image change captioning.IEEE Transactions on Geoscience and Remote Sensing, 61:1–18, 2023

work page 2023
[48]

Change-agent: Toward interactive comprehensive remote sensing change interpretation and analysis.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

Chenyang Liu, Keyan Chen, Haotian Zhang, Zipeng Qi, Zhengxia Zou, and Zhenwei Shi. Change-agent: Toward interactive comprehensive remote sensing change interpretation and analysis.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

work page 2024
[49]

Chenyang Liu, Keyan Chen, Rui Zhao, Zhengxia Zou, and Zhenwei Shi. Text2earth: Unlocking text- driven remote sensing image generation with a global-scale dataset and a foundation model.IEEE Geoscience and Remote Sensing Magazine, 13(3):238–259, 2025. doi: 10.1109/MGRS.2025.3560455

work page doi:10.1109/mgrs.2025.3560455 2025
[50]

H. Liu, C. Li, Q. Wu, and Y .J. Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[51]

Liu et al

H. Liu et al. Multimodal segmentation with large vision-language models.arXiv preprint, 2025

work page 2025
[52]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Llava-next: Improved reasoning, ocr, and world knowledge

Haotian Liu, Chunyuan Li, Yuheng Li, et al. Llava-next: Improved reasoning, ocr, and world knowledge. arXiv preprint, 2024

work page 2024
[54]

S. Liu, H. Cheng, H. Liu, H. Zhang, F. Li, T. Ren, X. Zou, J. Yang, H. Su, J. Zhu, L. Zhang, and J. Gao. Llava-plus: Learning to use tools for creating multimodal agents.arXiv preprint arXiv:2311.05437, 2023

work page arXiv 2023
[55]

End-to-end multi-modal diffusion mamba

Chunhao Lu, Qiang Lu, Meichen Dong, and Jake Luo. End-to-end multi-modal diffusion mamba. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20529–20540, 2025

work page 2025
[56]

Instructhoi: Context-aware instruction for multi-modal reasoning in human-object interaction detection

Jinguo Luo, Weihong Ren, Quanlong Zheng, Yanhao Zhang, Zhenlong Yuan, Zhiyong Wang, Haonan Lu, and Honghai Liu. Instructhoi: Context-aware instruction for multi-modal reasoning in human-object interaction detection. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. 12

work page
[57]

Z. Ma, J. Zhang, Z. Liu, J. Zhang, J. Tan, M. Shu, J. C. Niebles, S. Heinecke, H. Wang, C. Xiong, and R. Krishna. Taco: Learning multi-modal action models with synthetic chains-of-thought-and-action. arXiv preprint arXiv:2412.05479, 2024

work page arXiv 2024
[58]

J. Miao, P. Du, Y . Fan, Y . Liu, Y . Wang, R. He, L. Huang, and Y . Wang. Agentiad: Agentic industrial anomaly detection via adaptive memory augmentation.arXiv preprint arXiv:2512.13671, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

H. Mu, J. Qiu, Y . Zheng, H. Qi, M. Chen, et al. Diffusionad: Norm-guided diffusion for anomaly detection. arXiv preprint arXiv:2303.08730, 2023

work page arXiv 2023
[60]

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

OpenAI. Gpt-4v(ision) system card.arXiv preprint arXiv:2309.17421, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

Learning to reason with large language models

OpenAI. Learning to reason with large language models. OpenAI Blog, 2024. URL https://openai. com/index/learning-to-reason-with-llms/

work page 2024
[62]

Hello gpt-4o.OpenAI Blog, 2024

OpenAI. Hello gpt-4o.OpenAI Blog, 2024. URLhttps://openai.com/index/hello-gpt-4o/

work page 2024
[63]

Gpt-4.1 technical report.arXiv preprint, 2025

OpenAI. Gpt-4.1 technical report.arXiv preprint, 2025

work page 2025
[64]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, pages 27730–27744, 2022

work page 2022
[65]

Y . Peng, G. Zhang, M. Zhang, Z. You, J. Liu, Q. Zhu, K. Yang, X. Xu, X. Geng, and X. Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Adaptive label correction for robust medical image segmentation with noisy labels.arXiv preprint arXiv:2503.12218, 2025

Chengxuan Qian, Kai Han, Jianxia Ding, Chongwen Lyu, Zhenlong Yuan, Jun Chen, and Zhe Liu. Adaptive label correction for robust medical image segmentation with noisy labels.arXiv preprint arXiv:2503.12218, 2025

work page arXiv 2025
[67]

DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning

Chengxuan Qian, Shuo Xing, Shawn Li, Yue Zhao, and Zhengzhong Tu. Decalign: Hierarchical cross- modal alignment for decoupled multimodal representation learning.arXiv preprint arXiv:2503.11892, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Do large language models perform latent multi-hop reasoning without exploiting shortcuts? abs/2411.16679, 2024

Xiangyan Qu, Zhenlong Yuan, Jing Tang, Rui Chen, Datao Tang, Meng Yu, Lei Sun, Yancheng Bai, Xiangxiang Chu, Gaopeng Gou, Gang Xiong, and Yujun Cai. From scale to speed: Adaptive test-time scaling for image editing.CoRR, abs/2603.00141, 2026. doi: 10.48550/ARXIV .2603.00141. URL https://doi.org/10.48550/arXiv.2603.00141

work page internal anchor Pith review doi:10.48550/arxiv 2026
[69]

K. Roth, L. Pemula, J. Zepeda, B. Schölkopf, T. Brox, and P. Gehler. Towards total recall in industrial anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14318–14328, 2022

work page 2022
[70]

Rudolph, T

M. Rudolph, T. Wehrbein, B. Rosenhahn, and B. Wandt. Fully convolutional cross-scale-flows for image-based defect detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1088–1097, 2022

work page 2022
[71]

Schick, J

T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, et al. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[72]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[73]

Z. Shao, P. Wang, Q. Zhu, R. Zheng, Y . Zhao, M. Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[74]

Gpromptshield: Elevating resilience in graph prompt tuning against adversarial attacks

Shuhan Song, Ping Li, Ming Dun, Maolei Huang, Huawei Cao, and Xiaochun Ye. Gpromptshield: Elevating resilience in graph prompt tuning against adversarial attacks. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[75]

Equipping graph autoencoders: Revisiting masking strategies from a robustness perspective

Shuhan Song, Ping Li, Ming Dun, Yuan Zhang, Huawei Cao, and Xiaochun Ye. Equipping graph autoencoders: Revisiting masking strategies from a robustness perspective. InProceedings of the 2025 SIAM International Conference on Data Mining (SDM), pages 366–375. SIAM, 2025

work page 2025
[76]

Spmgae: Self-purified masked graph autoencoders release robust expression power.Neurocomputing, 611:128631, 2025

Shuhan Song, Ping Li, Ming Dun, Yuan Zhang, Huawei Cao, and Xiaochun Ye. Spmgae: Self-purified masked graph autoencoders release robust expression power.Neurocomputing, 611:128631, 2025

work page 2025
[77]

Physical autoregressive model for robotic manipulation without action pretraining.arXiv preprint arXiv:2508.09822, 2025

Zijian Song, Sihan Qin, Tianshui Chen, Liang Lin, and Guangrun Wang. Physical autoregressive model for robotic manipulation without action pretraining.arXiv preprint arXiv:2508.09822, 2025. 13

work page arXiv 2025
[78]

Human-centric open-future task discovery: Formulation, benchmark, and scalable tree-based search

Zijian Song, Xiaoxin Lin, Tao Pu, Zhenlong Yuan, Guangrun Wang, and Liang Lin. Human-centric open-future task discovery: Formulation, benchmark, and scalable tree-based search. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 17724–17732, 2026

work page 2026
[79]

Video-coe: Reinforcing video event prediction via chain of events.arXiv preprint arXiv:2603.14935, 2026

Qile Su, Jing Tang, Rui Chen, Lei Sun, and Xiangxiang Chu. Video-coe: Reinforcing video event prediction via chain of events.arXiv preprint arXiv:2603.14935, 2026

work page arXiv 2026
[80]

Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y . Shen, et al. Aligning large multimodal models with factually augmented rlhf. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[81]

Tabernik, S

D. Tabernik, S. Šela, J. Skvarˇc, and D. Skoˇcaj. Spatially-adaptive filter units for deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9388–9396, 2020

work page 2020

Showing first 80 references.

[1] [1]

Anylayout: Versatile advertising poster layout generation with MLLMs

Anonymous. Anylayout: Versatile advertising poster layout generation with MLLMs. InSubmitted to The Fourteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=viX7rUMzwg. under review

work page 2025

[2] [2]

The claude 4 model family.Anthropic Technical Report, 2025

Anthropic. The claude 4 model family.Anthropic Technical Report, 2025

work page 2025

[3] [3]

Aota, L.T.T

T. Aota, L.T.T. Tong, and T. Okatani. Zero-shot versus many-shot: Unsupervised texture anomaly detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5564–5572, 2023

work page 2023

[4] [4]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Bergmann, S

P. Bergmann, S. Löwe, M. Fauser, D. Sattlegger, and C. Steger. Improving unsupervised defect seg- mentation by applying structural similarity to autoencoders. In13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, 2018

work page 2018

[7] [7]

Bergmann, M

P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger. Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9592–9600, 2019

work page 2019

[8] [8]

Context-dpo: Aligning language models for context-faithfulness

Baolong Bi, Shaohan Huang, Yiwei Wang, Tianchi Yang, Zihan Zhang, Haizhen Huang, Lingrui Mei, Junfeng Fang, Zehao Li, Furu Wei, et al. Context-dpo: Aligning language models for context-faithfulness. ACL 2025, 2024

work page 2025

[9] [9]

Decoding by contrasting knowledge: Enhancing llms’ confidence on edited facts.ACL 2025, 2024

Baolong Bi, Shenghua Liu, Lingrui Mei, Yiwei Wang, Pengliang Ji, and Xueqi Cheng. Decoding by contrasting knowledge: Enhancing llms’ confidence on edited facts.ACL 2025, 2024

work page 2025

[10] [10]

Refinex: Learning to refine pre-training data at scale from expert-guided programs.arXiv preprint arXiv:2507.03253, 2025

Baolong Bi, Shenghua Liu, Xingzhang Ren, Dayiheng Liu, Junyang Lin, Yiwei Wang, Lingrui Mei, Junfeng Fang, Jiafeng Guo, and Xueqi Cheng. Refinex: Learning to refine pre-training data at scale from expert-guided programs.arXiv preprint arXiv:2507.03253, 2025

work page arXiv 2025

[11] [11]

Reward and guidance through rubrics: Promoting exploration to improve multi-domain reasoning.arXiv preprint arXiv:2511.12344, 2025

Baolong Bi, Shenghua Liu, Yiwei Wang, Siqian Tong, Lingrui Mei, Yuyao Ge, Yilong Xu, Jiafeng Guo, and Xueqi Cheng. Reward and guidance through rubrics: Promoting exploration to improve multi-domain reasoning.arXiv preprint arXiv:2511.12344, 2025

work page arXiv 2025

[12] [12]

Parameters vs

Baolong Bi, Shenghua Liu, Yiwei Wang, Yilong Xu, Junfeng Fang, Lingrui Mei, and Xueqi Cheng. Parameters vs. context: Fine-grained control of knowledge reliance in language models.arXiv preprint arXiv:2503.15888, 2025

work page arXiv 2025

[13] [13]

Prism: Self- pruning intrinsic selection method for training-free multimodal data selection.ArXiv, abs/2502.12119,

Jinhe Bi, Yifan Wang, Danqi Yan, Xun Xiao, Artur Hecker, V olker Tresp, and Yunpu Ma. Prism: Self- pruning intrinsic selection method for training-free multimodal data selection.ArXiv, abs/2502.12119,

work page arXiv

[14] [14]

URLhttps://api.semanticscholar.org/CorpusID:276421326

work page

[15] [15]

Cot-kinetics: A theoretical modeling assessing lrm reasoning process.ArXiv, abs/2505.13408, 2025

Jinhe Bi, Danqi Yan, Yifan Wang, Wenke Huang, Haokun Chen, Guancheng Wan, Mang Ye, Xun Xiao, Hin rich Schuetze, V olker Tresp, and Yunpu Ma. Cot-kinetics: A theoretical modeling assessing lrm reasoning process.ArXiv, abs/2505.13408, 2025

work page arXiv 2025

[16] [16]

EchoRL: Reinforcement learning via rollout echoing

Jinhe Bi, Aniri, Minglai Yang, Xingcheng Zhou, Wenke Huang, Sikuan Yan, Yujun Wang, Zixuan Cao, Michael Färber, Xun Xiao, V olker Tresp, and Yunpu Ma. EchoRL: Reinforcement learning via rollout echoing. InForty-third International Conference on Machine Learning, 2026

work page 2026

[17] [17]

Y . Chao, J. Liu, J. Tang, and G. Wu. Anomalyr1: A grpo-based end-to-end mllm for industrial anomaly detection.arXiv preprint arXiv:2504.11914, 2025. 10

work page arXiv 2025

[18] [18]

B. Chen, C. Shu, E. Shareghi, N. Collier, K. Narasimhan, and S. Yao. Fireact: Toward language agent fine-tuning. InInternational Conference on Learning Representations, 2024

work page 2024

[19] [19]

Cross-view tracking for multi-human 3d pose estimation at over 100 fps

Long Chen, Haizhou Ai, Rui Chen, Zijie Zhuang, and Shuang Liu. Cross-view tracking for multi-human 3d pose estimation at over 100 fps. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3279–3288, 2020

work page 2020

[20] [20]

Layout2scene: 3d semantic layout guided scene generation via geometry and appearance diffusion priors.arXiv preprint arXiv:2501.02519, 2025

Minglin Chen, Longguang Wang, Sheng Ao, Ye Zhang, Kai Xu, and Yulan Guo. Layout2scene: 3d semantic layout guided scene generation via geometry and appearance diffusion priors.arXiv preprint arXiv:2501.02519, 2025

work page arXiv 2025

[21] [21]

Graph2scene: Versatile 3d indoor scene generation with interaction-aware scene graph

Minglin Chen, Rongkun Yang, Qibin Hu, Kaiwen Xue, Shunbo Zhou, and Yulan Guo. Graph2scene: Versatile 3d indoor scene generation with interaction-aware scene graph. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11313–11320. IEEE, 2025

work page 2025

[22] [22]

Finger: Content aware fine-grained evaluation with reasoning for ai-generated videos

Rui Chen, Lei Sun, Jing Tang, Geng Li, and Xiangxiang Chu. Finger: Content aware fine-grained evaluation with reasoning for ai-generated videos. InProceedings of the 33rd ACM International Conference on Multimedia, pages 3517–3526, 2025

work page 2025

[23] [23]

Self-supervised neuron segmen- tation with multi-agent reinforcement learning

Yinda Chen, Wei Huang, Shenglong Zhou, Qi Chen, and Zhiwei Xiong. Self-supervised neuron segmen- tation with multi-agent reinforcement learning. InIJCAI, 2023

work page 2023

[24] [24]

Empower: Evolutionary medical prompt optimization with reinforcement learning

Yinda Chen, Yangfan He, Jing Yang, Dapeng Zhang, Zhenlong Yuan, Muhammad Attique Khan, Jamel Baili, and Lip Yee. Empower: Evolutionary medical prompt optimization with reinforcement learning. IEEE Journal of Biomedical and Health Informatics, 2026

work page 2026

[25] [25]

Zhe Chen, Jiannan Wang, Yuchen Hao, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[26] [26]

A vocabulary recommendation method for spatiotemporal data discovery based on bayesian network and ontologies.Big Earth Data, 3(3):220–231, 2019

Kejin Cui, Yongyao Jiang, Yun Li, and Dieter Pfoser. A vocabulary recommendation method for spatiotemporal data discovery based on bayesian network and ontologies.Big Earth Data, 3(3):220–231, 2019

work page 2019

[27] [28]

Defard, A

T. Defard, A. Setkov, A. Loesch, and R. Rompel. Padim: a patch distribution modeling framework for anomaly detection and localization. InInternational Conference on Pattern Recognition, pages 475–489. Springer, 2021

work page 2021

[28] [29]

Driess, F

D. Driess, F. Xia, M.S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, et al. Palm-e: An embodied multimodal language model. InInternational Conference on Machine Learning, pages 8469–8488. PMLR, 2023

work page 2023

[29] [30]

Multi-scale graph learning for anti-sparse downscaling

Yingda Fan, Runlong Yu, Janet R Barclay, Alison P Appling, Yiming Sun, Yiqun Xie, and Xiaowei Jia. Multi-scale graph learning for anti-sparse downscaling. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27969–27977, 2025

work page 2025

[30] [31]

K. Feng, K. Gong, B. Li, Z. Guo, Y . Wang, T. Peng, B. Wang, and X. Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [32]

Z. Gu, B. Zhu, G. Zhu, Y . Chen, M. Tang, and J. Wang. Anomalygpt: Detecting industrial anomalies using large vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, 2024

work page 2024

[32] [33]

Sampling enhanced contrastive multi-view remote sensing data clustering with long-short range information mining

Renxiang Guan, Tianrui Liu, Wenxuan Tu, Chang Tang, Wenhan Luo, and Xinwang Liu. Sampling enhanced contrastive multi-view remote sensing data clustering with long-short range information mining. IEEE Transactions on Knowledge and Data Engineering, pages 1–15, 2025

work page 2025

[33] [34]

Prototype-driven multi-view attribute-missing graph clustering.IEEE Transactions on Multimedia, 27:9454–9466, 2025

Renxiang Guan, Wenxuan Tu, Dayu Hu, Weixuan Liang, Ke Liang, Yaowen Hu, Yue Liu, and Xinwang Liu. Prototype-driven multi-view attribute-missing graph clustering.IEEE Transactions on Multimedia, 27:9454–9466, 2025

work page 2025

[34] [35]

D. Guo, J. Shao, H. Qiu, J. Bu, Z. Li, J. Zhang, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [36]

Gupta and A

T. Gupta and A. Kembhavi. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953– 14962, 2023. 11

work page 2023

[36] [37]

Ppgspeech: A wearable silent speech interface leveraging neck-worn photoplethysmogra- phy.IEEE Internet of Things Journal, 13(4):6692–6703, 2026

Lingde Hu, Wenbo Zhang, Wenkang Zhang, Yu He, Seokmin Choi, Yang Gao, Jagmohan Chauhan, and Zhanpeng Jin. Ppgspeech: A wearable silent speech interface leveraging neck-worn photoplethysmogra- phy.IEEE Internet of Things Journal, 13(4):6692–6703, 2026. doi: 10.1109/JIOT.2025.3639152

work page doi:10.1109/jiot.2025.3639152 2026

[37] [38]

Y . Hu, O. Stretcu, C.-T. Lu, K. Viswanathan, K. Hata, E. Luo, R. Krishna, and A. Fuxman. Visual program distillation: Distilling tools and programmatic reasoning into vision-language models.arXiv preprint arXiv:2312.03052, 2023

work page arXiv 2023

[38] [39]

Jeong, Y

J. Jeong, Y . Zou, T. Kim, D. Zhang, A. Nadar, and O. Dabeer. Winclip: Zero-/few-shot anomaly classification and segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19606–19616, 2023

work page 2023

[39] [40]

M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim. Visual prompt tuning. InEuropean conference on computer vision, pages 709–727. Springer, 2022

work page 2022

[40] [41]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [42]

C. Li, W. Wu, H. Zhang, Y . Xia, S. Mao, L. Dong, I. Vuli´c, and F. Wei. Imagine while reasoning in space: Multimodal visualization-of-thought.arXiv preprint arXiv:2501.07542, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [43]

Factguard: Agentic video misinformation detection via reinforcement learning

Zehao Li, Hongwei Yu, Hao Jiang, Qiang Sheng, Yilong Xu, Baolong Bi, Yang Li, Zhenlong Yuan, Yujun Cai, and Zhaoqi Wang. Factguard: Agentic video misinformation detection via reinforcement learning. arXiv preprint arXiv:2602.22963, 2026

work page arXiv 2026

[43] [44]

ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval

Zixu Li, Yupeng Hu, Zhiwei Chen, Mingyu Zhang, Zhiheng Fu, and Liqiang Nie. Conesep: Cone- based robust noise-unlearning compositional network for composed image retrieval, 2026. URL https: //arxiv.org/abs/2604.20358

work page internal anchor Pith review Pith/arXiv arXiv 2026

[44] [45]

Habit: Chrono-synergia robust progressive learning framework for composed image retrieval

Zixu Li, Yupeng Hu, Zhiwei Chen, Shiqi Zhang, Qinlei Huang, Zhiheng Fu, and Yinwei Wei. Habit: Chrono-synergia robust progressive learning framework for composed image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6762–6770, 2026

work page 2026

[45] [46]

TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Yongqi Li, and Liqiang Nie. Tema: Anchor the image, follow the text for multi-modification composed image retrieval, 2026. URL https://arxiv.org/ abs/2604.21806

work page internal anchor Pith review Pith/arXiv arXiv 2026

[46] [47]

A decoupling paradigm with prompt learning for remote sensing image change captioning.IEEE Transactions on Geoscience and Remote Sensing, 61:1–18, 2023

Chenyang Liu, Rui Zhao, Jianqi Chen, Zipeng Qi, Zhengxia Zou, and Zhenwei Shi. A decoupling paradigm with prompt learning for remote sensing image change captioning.IEEE Transactions on Geoscience and Remote Sensing, 61:1–18, 2023

work page 2023

[47] [48]

Change-agent: Toward interactive comprehensive remote sensing change interpretation and analysis.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

Chenyang Liu, Keyan Chen, Haotian Zhang, Zipeng Qi, Zhengxia Zou, and Zhenwei Shi. Change-agent: Toward interactive comprehensive remote sensing change interpretation and analysis.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

work page 2024

[48] [49]

Chenyang Liu, Keyan Chen, Rui Zhao, Zhengxia Zou, and Zhenwei Shi. Text2earth: Unlocking text- driven remote sensing image generation with a global-scale dataset and a foundation model.IEEE Geoscience and Remote Sensing Magazine, 13(3):238–259, 2025. doi: 10.1109/MGRS.2025.3560455

work page doi:10.1109/mgrs.2025.3560455 2025

[49] [50]

H. Liu, C. Li, Q. Wu, and Y .J. Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[50] [51]

Liu et al

H. Liu et al. Multimodal segmentation with large vision-language models.arXiv preprint, 2025

work page 2025

[51] [52]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [53]

Llava-next: Improved reasoning, ocr, and world knowledge

Haotian Liu, Chunyuan Li, Yuheng Li, et al. Llava-next: Improved reasoning, ocr, and world knowledge. arXiv preprint, 2024

work page 2024

[53] [54]

S. Liu, H. Cheng, H. Liu, H. Zhang, F. Li, T. Ren, X. Zou, J. Yang, H. Su, J. Zhu, L. Zhang, and J. Gao. Llava-plus: Learning to use tools for creating multimodal agents.arXiv preprint arXiv:2311.05437, 2023

work page arXiv 2023

[54] [55]

End-to-end multi-modal diffusion mamba

Chunhao Lu, Qiang Lu, Meichen Dong, and Jake Luo. End-to-end multi-modal diffusion mamba. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20529–20540, 2025

work page 2025

[55] [56]

Instructhoi: Context-aware instruction for multi-modal reasoning in human-object interaction detection

Jinguo Luo, Weihong Ren, Quanlong Zheng, Yanhao Zhang, Zhenlong Yuan, Zhiyong Wang, Haonan Lu, and Honghai Liu. Instructhoi: Context-aware instruction for multi-modal reasoning in human-object interaction detection. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. 12

work page

[56] [57]

Z. Ma, J. Zhang, Z. Liu, J. Zhang, J. Tan, M. Shu, J. C. Niebles, S. Heinecke, H. Wang, C. Xiong, and R. Krishna. Taco: Learning multi-modal action models with synthetic chains-of-thought-and-action. arXiv preprint arXiv:2412.05479, 2024

work page arXiv 2024

[57] [58]

J. Miao, P. Du, Y . Fan, Y . Liu, Y . Wang, R. He, L. Huang, and Y . Wang. Agentiad: Agentic industrial anomaly detection via adaptive memory augmentation.arXiv preprint arXiv:2512.13671, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [59]

H. Mu, J. Qiu, Y . Zheng, H. Qi, M. Chen, et al. Diffusionad: Norm-guided diffusion for anomaly detection. arXiv preprint arXiv:2303.08730, 2023

work page arXiv 2023

[59] [60]

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

OpenAI. Gpt-4v(ision) system card.arXiv preprint arXiv:2309.17421, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[60] [61]

Learning to reason with large language models

OpenAI. Learning to reason with large language models. OpenAI Blog, 2024. URL https://openai. com/index/learning-to-reason-with-llms/

work page 2024

[61] [62]

Hello gpt-4o.OpenAI Blog, 2024

OpenAI. Hello gpt-4o.OpenAI Blog, 2024. URLhttps://openai.com/index/hello-gpt-4o/

work page 2024

[62] [63]

Gpt-4.1 technical report.arXiv preprint, 2025

OpenAI. Gpt-4.1 technical report.arXiv preprint, 2025

work page 2025

[63] [64]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, pages 27730–27744, 2022

work page 2022

[64] [65]

Y . Peng, G. Zhang, M. Zhang, Z. You, J. Liu, Q. Zhu, K. Yang, X. Xu, X. Geng, and X. Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[65] [66]

Adaptive label correction for robust medical image segmentation with noisy labels.arXiv preprint arXiv:2503.12218, 2025

Chengxuan Qian, Kai Han, Jianxia Ding, Chongwen Lyu, Zhenlong Yuan, Jun Chen, and Zhe Liu. Adaptive label correction for robust medical image segmentation with noisy labels.arXiv preprint arXiv:2503.12218, 2025

work page arXiv 2025

[66] [67]

DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning

Chengxuan Qian, Shuo Xing, Shawn Li, Yue Zhao, and Zhengzhong Tu. Decalign: Hierarchical cross- modal alignment for decoupled multimodal representation learning.arXiv preprint arXiv:2503.11892, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[67] [68]

Do large language models perform latent multi-hop reasoning without exploiting shortcuts? abs/2411.16679, 2024

Xiangyan Qu, Zhenlong Yuan, Jing Tang, Rui Chen, Datao Tang, Meng Yu, Lei Sun, Yancheng Bai, Xiangxiang Chu, Gaopeng Gou, Gang Xiong, and Yujun Cai. From scale to speed: Adaptive test-time scaling for image editing.CoRR, abs/2603.00141, 2026. doi: 10.48550/ARXIV .2603.00141. URL https://doi.org/10.48550/arXiv.2603.00141

work page internal anchor Pith review doi:10.48550/arxiv 2026

[68] [69]

K. Roth, L. Pemula, J. Zepeda, B. Schölkopf, T. Brox, and P. Gehler. Towards total recall in industrial anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14318–14328, 2022

work page 2022

[69] [70]

Rudolph, T

M. Rudolph, T. Wehrbein, B. Rosenhahn, and B. Wandt. Fully convolutional cross-scale-flows for image-based defect detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1088–1097, 2022

work page 2022

[70] [71]

Schick, J

T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, et al. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[71] [72]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[72] [73]

Z. Shao, P. Wang, Q. Zhu, R. Zheng, Y . Zhao, M. Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[73] [74]

Gpromptshield: Elevating resilience in graph prompt tuning against adversarial attacks

Shuhan Song, Ping Li, Ming Dun, Maolei Huang, Huawei Cao, and Xiaochun Ye. Gpromptshield: Elevating resilience in graph prompt tuning against adversarial attacks. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[74] [75]

Equipping graph autoencoders: Revisiting masking strategies from a robustness perspective

Shuhan Song, Ping Li, Ming Dun, Yuan Zhang, Huawei Cao, and Xiaochun Ye. Equipping graph autoencoders: Revisiting masking strategies from a robustness perspective. InProceedings of the 2025 SIAM International Conference on Data Mining (SDM), pages 366–375. SIAM, 2025

work page 2025

[75] [76]

Spmgae: Self-purified masked graph autoencoders release robust expression power.Neurocomputing, 611:128631, 2025

Shuhan Song, Ping Li, Ming Dun, Yuan Zhang, Huawei Cao, and Xiaochun Ye. Spmgae: Self-purified masked graph autoencoders release robust expression power.Neurocomputing, 611:128631, 2025

work page 2025

[76] [77]

Physical autoregressive model for robotic manipulation without action pretraining.arXiv preprint arXiv:2508.09822, 2025

Zijian Song, Sihan Qin, Tianshui Chen, Liang Lin, and Guangrun Wang. Physical autoregressive model for robotic manipulation without action pretraining.arXiv preprint arXiv:2508.09822, 2025. 13

work page arXiv 2025

[77] [78]

Human-centric open-future task discovery: Formulation, benchmark, and scalable tree-based search

Zijian Song, Xiaoxin Lin, Tao Pu, Zhenlong Yuan, Guangrun Wang, and Liang Lin. Human-centric open-future task discovery: Formulation, benchmark, and scalable tree-based search. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 17724–17732, 2026

work page 2026

[78] [79]

Video-coe: Reinforcing video event prediction via chain of events.arXiv preprint arXiv:2603.14935, 2026

Qile Su, Jing Tang, Rui Chen, Lei Sun, and Xiangxiang Chu. Video-coe: Reinforcing video event prediction via chain of events.arXiv preprint arXiv:2603.14935, 2026

work page arXiv 2026

[79] [80]

Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y . Shen, et al. Aligning large multimodal models with factually augmented rlhf. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[80] [81]

Tabernik, S

D. Tabernik, S. Šela, J. Skvarˇc, and D. Skoˇcaj. Spatially-adaptive filter units for deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9388–9396, 2020

work page 2020