IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools
Pith reviewed 2026-05-21 05:30 UTC · model grok-4.3
The pith
IndusAgent uses a tool-augmented agent and gated reinforcement learning to achieve state-of-the-art zero-shot performance in open-vocabulary industrial anomaly detection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that orchestrating tools such as dynamic region cropping, high-frequency feature enhancement, and prior retrieval through a gated reinforcement learning objective on the Indus-CoT dataset enables multimodal models to resolve visual ambiguities and achieve superior zero-shot anomaly detection across industrial benchmarks.
What carries the argument
IndusAgent, a tool-augmented agentic framework that dynamically orchestrates external tools including region cropping and feature enhancement under a gated reinforcement learning objective to optimize anomaly classification and localization.
If this is right
- The framework allows active resolution of visual ambiguities in anomaly detection tasks.
- It achieves state-of-the-art zero-shot performance on benchmarks including MVTec-AD, VisA, MPDD, DTD, and SDD.
- Tool invocation happens only when beneficial due to the gating mechanism.
- Joint optimization of classification, localization, reasoning, and efficient tool usage improves overall robustness.
Where Pith is reading between the lines
- This method could reduce reliance on extensive labeled data for training anomaly detectors in new industrial domains.
- Similar agentic tool orchestration might apply to other vision tasks prone to hallucinations in multimodal models.
- Testing on real-time factory lines could reveal whether the computational overhead of tool calls remains practical.
- Extending the prior retrieval tool with more diverse expert knowledge might further improve generalization to novel defect types.
Load-bearing premise
The combination of the Indus-CoT dataset, tool orchestration, and gated reinforcement learning produces genuine generalization across industrial scenarios rather than improvements specific to the tested benchmarks.
What would settle it
Evaluation on a new set of industrial images featuring previously unseen product categories and anomaly types where IndusAgent does not outperform standard multimodal models without the agent tools.
Figures
read the original abstract
Multimodal large language models (MLLMs) have shown remarkable capability in bridging visual perception and textual reasoning, enabling zero-shot understanding across diverse industrial scenarios. However, their performance in open-vocabulary industrial anomaly detection (IAD) is often limited by domain-misaligned reasoning and hallucinated structural inferences. To address these challenges, we propose \textbf{IndusAgent}, a tool-augmented agentic framework for open-vocabulary IAD. Specifically, we first construct \textbf{Indus-CoT}, a structured dataset that integrates global visual observations, high-resolution local patches, and expert normalcy priors, providing supervision for fine-tuning the model on rigorous industrial inspection trajectories. Building on this, IndusAgent dynamically orchestrates a set of external tools, including dynamic region cropping, high-frequency feature enhancement, and prior retrieval, thus enabling the agent to actively resolve visual ambiguities and disentangle subtle anomalies. Furthermore, we introduce a gated reinforcement learning objective that jointly optimizes anomaly classification, localization accuracy, anomaly type reasoning, and efficient tool usage, ensuring that tool invocation occurs only when beneficial. Extensive evaluations on five industrial anomaly benchmarks, including MVTec-AD, VisA, MPDD, DTD, and SDD, demonstrate that IndusAgent achieves state-of-the-art zero-shot performance among all existing methods, validating our robustness and generalization capacity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes IndusAgent, a tool-augmented agentic framework for open-vocabulary industrial anomaly detection. It first builds the Indus-CoT dataset by integrating global visual observations, high-resolution local patches, and expert normalcy priors to supervise fine-tuning on industrial inspection trajectories. The framework then dynamically orchestrates external tools (dynamic region cropping, high-frequency feature enhancement, prior retrieval) and introduces a gated reinforcement learning objective that jointly optimizes anomaly classification, localization accuracy, anomaly type reasoning, and efficient tool usage. Extensive evaluations on five benchmarks (MVTec-AD, VisA, MPDD, DTD, SDD) are reported to establish state-of-the-art zero-shot performance.
Significance. If the zero-shot claims hold without data leakage and with proper controls, the work could meaningfully advance open-vocabulary IAD by showing how agentic tool use and gated RL can mitigate domain misalignment and hallucinations in MLLMs. The structured Indus-CoT construction and explicit optimization of tool invocation are concrete contributions that, if reproducible, would provide a useful template for other perception-reasoning tasks.
major comments (2)
- [§3] §3 (Indus-CoT construction): The description provides no explicit statement on the provenance of images, masks, or annotations used to build Indus-CoT, nor any confirmation that none of the five evaluation benchmarks (MVTec-AD, VisA, MPDD, DTD, SDD) contributed data. This separation is load-bearing for the zero-shot SOTA claim; overlap would collapse the generalization argument.
- [§5] §5 (Experiments): The abstract and result summary assert SOTA zero-shot performance yet supply no quantitative metrics (e.g., AUROC, PRO, pixel-level scores), ablation tables, or baseline details. Without these, the central empirical claim cannot be verified or compared to prior methods.
minor comments (2)
- [Abstract] The abstract is overly long; moving key quantitative highlights and a one-sentence statement of the no-overlap guarantee to the abstract would improve readability.
- [§4] Notation for the gated RL objective (reward components, gating mechanism) should be formalized with equations rather than prose to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable feedback on our manuscript. We have carefully considered each comment and provide point-by-point responses below. We believe these clarifications and planned revisions will strengthen the paper.
read point-by-point responses
-
Referee: [§3] §3 (Indus-CoT construction): The description provides no explicit statement on the provenance of images, masks, or annotations used to build Indus-CoT, nor any confirmation that none of the five evaluation benchmarks (MVTec-AD, VisA, MPDD, DTD, SDD) contributed data. This separation is load-bearing for the zero-shot SOTA claim; overlap would collapse the generalization argument.
Authors: We thank the referee for highlighting the importance of explicitly documenting data provenance to support the zero-shot claims. We agree that this information should be stated clearly. In the revised manuscript, we will include an explicit statement in §3 detailing the sources of images, masks, and annotations for Indus-CoT and confirming that none of the evaluation benchmarks (MVTec-AD, VisA, MPDD, DTD, SDD) contributed any data to its construction. This addition will strengthen the generalization argument. revision: yes
-
Referee: [§5] §5 (Experiments): The abstract and result summary assert SOTA zero-shot performance yet supply no quantitative metrics (e.g., AUROC, PRO, pixel-level scores), ablation tables, or baseline details. Without these, the central empirical claim cannot be verified or compared to prior methods.
Authors: We acknowledge the referee's point that the abstract and result summary would benefit from including specific quantitative metrics to allow immediate verification of the SOTA claims. Although the detailed results, including AUROC, PRO, pixel-level scores, ablation tables, and baseline comparisons, are provided in §5, we will revise the abstract and result summary to incorporate key numerical results and references to the relevant tables and figures. revision: yes
Circularity Check
No circularity: empirical framework evaluated on external benchmarks
full rationale
The paper describes construction of the Indus-CoT dataset from global observations, local patches and expert priors, followed by tool orchestration and a gated reinforcement learning objective, then reports empirical zero-shot performance on the external benchmarks MVTec-AD, VisA, MPDD, DTD and SDD. No equations, derivations or first-principles results are presented that reduce by construction to fitted inputs, self-definitions or self-citation chains. The central claims rest on measured performance against independent test sets rather than any renaming, prediction-from-fit or load-bearing self-reference, rendering the reported method self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption External tools for region cropping, feature enhancement, and prior retrieval resolve visual ambiguities in industrial images.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose IndusAgent, a tool-augmented agentic framework... gated reinforcement learning objective that jointly optimizes anomaly classification, localization accuracy, anomaly type reasoning, and efficient tool usage
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat embedding and J-cost orbit unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Indus-CoT... global visual observations, high-resolution local patches, and expert normalcy priors
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Anylayout: Versatile advertising poster layout generation with MLLMs
Anonymous. Anylayout: Versatile advertising poster layout generation with MLLMs. InSubmitted to The Fourteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=viX7rUMzwg. under review
work page 2025
-
[2]
The claude 4 model family.Anthropic Technical Report, 2025
Anthropic. The claude 4 model family.Anthropic Technical Report, 2025
work page 2025
-
[3]
T. Aota, L.T.T. Tong, and T. Okatani. Zero-shot versus many-shot: Unsupervised texture anomaly detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5564–5572, 2023
work page 2023
-
[4]
S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
P. Bergmann, S. Löwe, M. Fauser, D. Sattlegger, and C. Steger. Improving unsupervised defect seg- mentation by applying structural similarity to autoencoders. In13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, 2018
work page 2018
-
[7]
P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger. Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9592–9600, 2019
work page 2019
-
[8]
Context-dpo: Aligning language models for context-faithfulness
Baolong Bi, Shaohan Huang, Yiwei Wang, Tianchi Yang, Zihan Zhang, Haizhen Huang, Lingrui Mei, Junfeng Fang, Zehao Li, Furu Wei, et al. Context-dpo: Aligning language models for context-faithfulness. ACL 2025, 2024
work page 2025
-
[9]
Decoding by contrasting knowledge: Enhancing llms’ confidence on edited facts.ACL 2025, 2024
Baolong Bi, Shenghua Liu, Lingrui Mei, Yiwei Wang, Pengliang Ji, and Xueqi Cheng. Decoding by contrasting knowledge: Enhancing llms’ confidence on edited facts.ACL 2025, 2024
work page 2025
-
[10]
Baolong Bi, Shenghua Liu, Xingzhang Ren, Dayiheng Liu, Junyang Lin, Yiwei Wang, Lingrui Mei, Junfeng Fang, Jiafeng Guo, and Xueqi Cheng. Refinex: Learning to refine pre-training data at scale from expert-guided programs.arXiv preprint arXiv:2507.03253, 2025
-
[11]
Baolong Bi, Shenghua Liu, Yiwei Wang, Siqian Tong, Lingrui Mei, Yuyao Ge, Yilong Xu, Jiafeng Guo, and Xueqi Cheng. Reward and guidance through rubrics: Promoting exploration to improve multi-domain reasoning.arXiv preprint arXiv:2511.12344, 2025
-
[12]
Baolong Bi, Shenghua Liu, Yiwei Wang, Yilong Xu, Junfeng Fang, Lingrui Mei, and Xueqi Cheng. Parameters vs. context: Fine-grained control of knowledge reliance in language models.arXiv preprint arXiv:2503.15888, 2025
-
[13]
Jinhe Bi, Yifan Wang, Danqi Yan, Xun Xiao, Artur Hecker, V olker Tresp, and Yunpu Ma. Prism: Self- pruning intrinsic selection method for training-free multimodal data selection.ArXiv, abs/2502.12119,
-
[14]
URLhttps://api.semanticscholar.org/CorpusID:276421326
-
[15]
Cot-kinetics: A theoretical modeling assessing lrm reasoning process.ArXiv, abs/2505.13408, 2025
Jinhe Bi, Danqi Yan, Yifan Wang, Wenke Huang, Haokun Chen, Guancheng Wan, Mang Ye, Xun Xiao, Hin rich Schuetze, V olker Tresp, and Yunpu Ma. Cot-kinetics: A theoretical modeling assessing lrm reasoning process.ArXiv, abs/2505.13408, 2025
-
[16]
EchoRL: Reinforcement learning via rollout echoing
Jinhe Bi, Aniri, Minglai Yang, Xingcheng Zhou, Wenke Huang, Sikuan Yan, Yujun Wang, Zixuan Cao, Michael Färber, Xun Xiao, V olker Tresp, and Yunpu Ma. EchoRL: Reinforcement learning via rollout echoing. InForty-third International Conference on Machine Learning, 2026
work page 2026
- [17]
-
[18]
B. Chen, C. Shu, E. Shareghi, N. Collier, K. Narasimhan, and S. Yao. Fireact: Toward language agent fine-tuning. InInternational Conference on Learning Representations, 2024
work page 2024
-
[19]
Cross-view tracking for multi-human 3d pose estimation at over 100 fps
Long Chen, Haizhou Ai, Rui Chen, Zijie Zhuang, and Shuang Liu. Cross-view tracking for multi-human 3d pose estimation at over 100 fps. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3279–3288, 2020
work page 2020
-
[20]
Minglin Chen, Longguang Wang, Sheng Ao, Ye Zhang, Kai Xu, and Yulan Guo. Layout2scene: 3d semantic layout guided scene generation via geometry and appearance diffusion priors.arXiv preprint arXiv:2501.02519, 2025
-
[21]
Graph2scene: Versatile 3d indoor scene generation with interaction-aware scene graph
Minglin Chen, Rongkun Yang, Qibin Hu, Kaiwen Xue, Shunbo Zhou, and Yulan Guo. Graph2scene: Versatile 3d indoor scene generation with interaction-aware scene graph. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11313–11320. IEEE, 2025
work page 2025
-
[22]
Finger: Content aware fine-grained evaluation with reasoning for ai-generated videos
Rui Chen, Lei Sun, Jing Tang, Geng Li, and Xiangxiang Chu. Finger: Content aware fine-grained evaluation with reasoning for ai-generated videos. InProceedings of the 33rd ACM International Conference on Multimedia, pages 3517–3526, 2025
work page 2025
-
[23]
Self-supervised neuron segmen- tation with multi-agent reinforcement learning
Yinda Chen, Wei Huang, Shenglong Zhou, Qi Chen, and Zhiwei Xiong. Self-supervised neuron segmen- tation with multi-agent reinforcement learning. InIJCAI, 2023
work page 2023
-
[24]
Empower: Evolutionary medical prompt optimization with reinforcement learning
Yinda Chen, Yangfan He, Jing Yang, Dapeng Zhang, Zhenlong Yuan, Muhammad Attique Khan, Jamel Baili, and Lip Yee. Empower: Evolutionary medical prompt optimization with reinforcement learning. IEEE Journal of Biomedical and Health Informatics, 2026
work page 2026
-
[25]
Zhe Chen, Jiannan Wang, Yuchen Hao, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[26]
Kejin Cui, Yongyao Jiang, Yun Li, and Dieter Pfoser. A vocabulary recommendation method for spatiotemporal data discovery based on bayesian network and ontologies.Big Earth Data, 3(3):220–231, 2019
work page 2019
- [28]
- [29]
-
[30]
Multi-scale graph learning for anti-sparse downscaling
Yingda Fan, Runlong Yu, Janet R Barclay, Alison P Appling, Yiming Sun, Yiqun Xie, and Xiaowei Jia. Multi-scale graph learning for anti-sparse downscaling. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27969–27977, 2025
work page 2025
-
[31]
K. Feng, K. Gong, B. Li, Z. Guo, Y . Wang, T. Peng, B. Wang, and X. Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Z. Gu, B. Zhu, G. Zhu, Y . Chen, M. Tang, and J. Wang. Anomalygpt: Detecting industrial anomalies using large vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, 2024
work page 2024
-
[33]
Renxiang Guan, Tianrui Liu, Wenxuan Tu, Chang Tang, Wenhan Luo, and Xinwang Liu. Sampling enhanced contrastive multi-view remote sensing data clustering with long-short range information mining. IEEE Transactions on Knowledge and Data Engineering, pages 1–15, 2025
work page 2025
-
[34]
Renxiang Guan, Wenxuan Tu, Dayu Hu, Weixuan Liang, Ke Liang, Yaowen Hu, Yue Liu, and Xinwang Liu. Prototype-driven multi-view attribute-missing graph clustering.IEEE Transactions on Multimedia, 27:9454–9466, 2025
work page 2025
-
[35]
D. Guo, J. Shao, H. Qiu, J. Bu, Z. Li, J. Zhang, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
T. Gupta and A. Kembhavi. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953– 14962, 2023. 11
work page 2023
-
[37]
Lingde Hu, Wenbo Zhang, Wenkang Zhang, Yu He, Seokmin Choi, Yang Gao, Jagmohan Chauhan, and Zhanpeng Jin. Ppgspeech: A wearable silent speech interface leveraging neck-worn photoplethysmogra- phy.IEEE Internet of Things Journal, 13(4):6692–6703, 2026. doi: 10.1109/JIOT.2025.3639152
- [38]
- [39]
-
[40]
M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim. Visual prompt tuning. InEuropean conference on computer vision, pages 709–727. Springer, 2022
work page 2022
-
[41]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
C. Li, W. Wu, H. Zhang, Y . Xia, S. Mao, L. Dong, I. Vuli´c, and F. Wei. Imagine while reasoning in space: Multimodal visualization-of-thought.arXiv preprint arXiv:2501.07542, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Factguard: Agentic video misinformation detection via reinforcement learning
Zehao Li, Hongwei Yu, Hao Jiang, Qiang Sheng, Yilong Xu, Baolong Bi, Yang Li, Zhenlong Yuan, Yujun Cai, and Zhaoqi Wang. Factguard: Agentic video misinformation detection via reinforcement learning. arXiv preprint arXiv:2602.22963, 2026
-
[44]
ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval
Zixu Li, Yupeng Hu, Zhiwei Chen, Mingyu Zhang, Zhiheng Fu, and Liqiang Nie. Conesep: Cone- based robust noise-unlearning compositional network for composed image retrieval, 2026. URL https: //arxiv.org/abs/2604.20358
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[45]
Habit: Chrono-synergia robust progressive learning framework for composed image retrieval
Zixu Li, Yupeng Hu, Zhiwei Chen, Shiqi Zhang, Qinlei Huang, Zhiheng Fu, and Yinwei Wei. Habit: Chrono-synergia robust progressive learning framework for composed image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6762–6770, 2026
work page 2026
-
[46]
TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval
Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Yongqi Li, and Liqiang Nie. Tema: Anchor the image, follow the text for multi-modification composed image retrieval, 2026. URL https://arxiv.org/ abs/2604.21806
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[47]
Chenyang Liu, Rui Zhao, Jianqi Chen, Zipeng Qi, Zhengxia Zou, and Zhenwei Shi. A decoupling paradigm with prompt learning for remote sensing image change captioning.IEEE Transactions on Geoscience and Remote Sensing, 61:1–18, 2023
work page 2023
-
[48]
Chenyang Liu, Keyan Chen, Haotian Zhang, Zipeng Qi, Zhengxia Zou, and Zhenwei Shi. Change-agent: Toward interactive comprehensive remote sensing change interpretation and analysis.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024
work page 2024
-
[49]
Chenyang Liu, Keyan Chen, Rui Zhao, Zhengxia Zou, and Zhenwei Shi. Text2earth: Unlocking text- driven remote sensing image generation with a global-scale dataset and a foundation model.IEEE Geoscience and Remote Sensing Magazine, 13(3):238–259, 2025. doi: 10.1109/MGRS.2025.3560455
-
[50]
H. Liu, C. Li, Q. Wu, and Y .J. Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2024
work page 2024
- [51]
-
[52]
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Llava-next: Improved reasoning, ocr, and world knowledge
Haotian Liu, Chunyuan Li, Yuheng Li, et al. Llava-next: Improved reasoning, ocr, and world knowledge. arXiv preprint, 2024
work page 2024
- [54]
-
[55]
End-to-end multi-modal diffusion mamba
Chunhao Lu, Qiang Lu, Meichen Dong, and Jake Luo. End-to-end multi-modal diffusion mamba. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20529–20540, 2025
work page 2025
-
[56]
Jinguo Luo, Weihong Ren, Quanlong Zheng, Yanhao Zhang, Zhenlong Yuan, Zhiyong Wang, Haonan Lu, and Honghai Liu. Instructhoi: Context-aware instruction for multi-modal reasoning in human-object interaction detection. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. 12
- [57]
-
[58]
J. Miao, P. Du, Y . Fan, Y . Liu, Y . Wang, R. He, L. Huang, and Y . Wang. Agentiad: Agentic industrial anomaly detection via adaptive memory augmentation.arXiv preprint arXiv:2512.13671, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [59]
-
[60]
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
OpenAI. Gpt-4v(ision) system card.arXiv preprint arXiv:2309.17421, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
Learning to reason with large language models
OpenAI. Learning to reason with large language models. OpenAI Blog, 2024. URL https://openai. com/index/learning-to-reason-with-llms/
work page 2024
-
[62]
Hello gpt-4o.OpenAI Blog, 2024
OpenAI. Hello gpt-4o.OpenAI Blog, 2024. URLhttps://openai.com/index/hello-gpt-4o/
work page 2024
-
[63]
Gpt-4.1 technical report.arXiv preprint, 2025
OpenAI. Gpt-4.1 technical report.arXiv preprint, 2025
work page 2025
- [64]
-
[65]
Y . Peng, G. Zhang, M. Zhang, Z. You, J. Liu, Q. Zhu, K. Yang, X. Xu, X. Geng, and X. Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
Chengxuan Qian, Kai Han, Jianxia Ding, Chongwen Lyu, Zhenlong Yuan, Jun Chen, and Zhe Liu. Adaptive label correction for robust medical image segmentation with noisy labels.arXiv preprint arXiv:2503.12218, 2025
-
[67]
DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning
Chengxuan Qian, Shuo Xing, Shawn Li, Yue Zhao, and Zhengzhong Tu. Decalign: Hierarchical cross- modal alignment for decoupled multimodal representation learning.arXiv preprint arXiv:2503.11892, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[68]
Xiangyan Qu, Zhenlong Yuan, Jing Tang, Rui Chen, Datao Tang, Meng Yu, Lei Sun, Yancheng Bai, Xiangxiang Chu, Gaopeng Gou, Gang Xiong, and Yujun Cai. From scale to speed: Adaptive test-time scaling for image editing.CoRR, abs/2603.00141, 2026. doi: 10.48550/ARXIV .2603.00141. URL https://doi.org/10.48550/arXiv.2603.00141
work page internal anchor Pith review doi:10.48550/arxiv 2026
-
[69]
K. Roth, L. Pemula, J. Zepeda, B. Schölkopf, T. Brox, and P. Gehler. Towards total recall in industrial anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14318–14328, 2022
work page 2022
-
[70]
M. Rudolph, T. Wehrbein, B. Rosenhahn, and B. Wandt. Fully convolutional cross-scale-flows for image-based defect detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1088–1097, 2022
work page 2022
- [71]
-
[72]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[73]
Z. Shao, P. Wang, Q. Zhu, R. Zheng, Y . Zhao, M. Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[74]
Gpromptshield: Elevating resilience in graph prompt tuning against adversarial attacks
Shuhan Song, Ping Li, Ming Dun, Maolei Huang, Huawei Cao, and Xiaochun Ye. Gpromptshield: Elevating resilience in graph prompt tuning against adversarial attacks. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[75]
Equipping graph autoencoders: Revisiting masking strategies from a robustness perspective
Shuhan Song, Ping Li, Ming Dun, Yuan Zhang, Huawei Cao, and Xiaochun Ye. Equipping graph autoencoders: Revisiting masking strategies from a robustness perspective. InProceedings of the 2025 SIAM International Conference on Data Mining (SDM), pages 366–375. SIAM, 2025
work page 2025
-
[76]
Shuhan Song, Ping Li, Ming Dun, Yuan Zhang, Huawei Cao, and Xiaochun Ye. Spmgae: Self-purified masked graph autoencoders release robust expression power.Neurocomputing, 611:128631, 2025
work page 2025
-
[77]
Zijian Song, Sihan Qin, Tianshui Chen, Liang Lin, and Guangrun Wang. Physical autoregressive model for robotic manipulation without action pretraining.arXiv preprint arXiv:2508.09822, 2025. 13
-
[78]
Human-centric open-future task discovery: Formulation, benchmark, and scalable tree-based search
Zijian Song, Xiaoxin Lin, Tao Pu, Zhenlong Yuan, Guangrun Wang, and Liang Lin. Human-centric open-future task discovery: Formulation, benchmark, and scalable tree-based search. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 17724–17732, 2026
work page 2026
-
[79]
Qile Su, Jing Tang, Rui Chen, Lei Sun, and Xiangxiang Chu. Video-coe: Reinforcing video event prediction via chain of events.arXiv preprint arXiv:2603.14935, 2026
-
[80]
Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y . Shen, et al. Aligning large multimodal models with factually augmented rlhf. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[81]
D. Tabernik, S. Šela, J. Skvarˇc, and D. Skoˇcaj. Spatially-adaptive filter units for deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9388–9396, 2020
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.