Recognition: unknown
Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation
Pith reviewed 2026-05-14 19:44 UTC · model grok-4.3
The pith
Seg-Agent lets off-the-shelf multimodal LLMs segment images from language instructions by running an iterative visual reasoning loop over marked regions on the image itself.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Explicit Multimodal Chain-of-Reasoning, built from generation, selection, and refinement stages that operate on Set-of-Mark visual prompts, enables an unmodified MLLM to perform accurate language-guided segmentation by reasoning directly in the visual domain rather than solely in text.
What carries the argument
Iterative visual reasoning loop that renders candidate regions with Set-of-Mark prompts so the MLLM can see and iteratively refine spatial selections on the image.
If this is right
- Segmentation performance on arbitrary language instructions becomes available at inference time with no dataset collection or model updates.
- The same visual-loop mechanism can be applied to any foundational segmentation model that accepts point or box prompts.
- A single benchmark covering explicit semantic, generic-object, and reasoning-guided cases now exists to measure generalization without retraining.
Where Pith is reading between the lines
- The approach suggests that adding visual feedback loops can improve spatial grounding in other MLLM tasks such as referring expression comprehension or visual question answering.
- Because the method never updates weights, it can be combined with future stronger base MLLMs or segmentation models without additional training cost.
- The introduced benchmark offers a standardized way to compare training-free and training-based methods across diverse instruction types.
Load-bearing premise
That an off-the-shelf multimodal LLM can correctly interpret and act on visual marks overlaid on the image to select and refine regions without any fine-tuning.
What would settle it
A controlled test in which the MLLM, shown the same image with Set-of-Mark prompts, repeatedly fails to select the region matching the language instruction and produces masks whose IoU remains below the supervised baseline.
Figures
read the original abstract
Language-guided segmentation transcends the scope limitations of traditional semantic segmentation, enabling models to segment arbitrary target regions based on natural language instructions. Existing approaches typically adopt a two-stage framework: employing Multimodal Large Language Models (MLLMs) to interpret instructions and generate visual prompts, followed by foundational segmentation models (e.g., SAM) to produce masks. However, due to the limited spatial grounding capabilities of off-the-shelf MLLMs, these methods often rely on extensive training on large-scale datasets to achieve satisfactory accuracy. While recent advances have introduced reasoning mechanisms to improve performance, they predominantly operate within the textual domain, performing chain-of-thought reasoning solely based on abstract text representations without direct visual feedback. In this paper, we propose Seg-Agent, a completely training-free framework that pioneers Explicit Multimodal Chain-of-Reasoning. Unlike prior text-only reasoning, our approach constructs an interactive visual reasoning loop comprising three stages: generation, selection, and refinement. Specifically, we leverage Set-of-Mark (SoM) visual prompting to render candidate regions directly onto the image, allowing the MLLM to ``see'' and iteratively reason about spatial relationships in the visual domain rather than just the textual one. This explicit multimodal interaction enables Seg-Agent to achieve performance comparable to state-of-the-art training-based methods without any parameter updates. Furthermore, to comprehensively evaluate generalization across diverse scenarios, we introduce Various-LangSeg, a novel benchmark covering explicit semantic, generic object, and reasoning-guided segmentation tasks. Extensive experiments demonstrate the effectiveness and robustness of our method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Seg-Agent, a training-free framework for language-guided segmentation that implements explicit multimodal chain-of-reasoning via a three-stage interactive loop (generation, selection, refinement) on off-the-shelf MLLMs. It uses Set-of-Mark (SoM) visual prompting to render candidate regions directly on the image, enabling the MLLM to perform spatial reasoning in the visual domain rather than text-only. The central claim is that this approach achieves performance comparable to state-of-the-art training-based methods without any parameter updates. The paper also presents Various-LangSeg, a new benchmark spanning explicit semantic, generic object, and reasoning-guided segmentation tasks, with experiments demonstrating effectiveness and robustness.
Significance. If the quantitative results and ablations hold, the work would be significant for demonstrating that unmodified MLLMs can achieve competitive language-guided segmentation through test-time visual feedback loops, reducing reliance on large-scale supervised training. The Various-LangSeg benchmark could become a useful standard for evaluating generalization across diverse language-guided scenarios.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): the central claim of 'performance comparable to state-of-the-art training-based methods' is asserted without any quantitative numbers, tables, or error analysis in the provided text; this is load-bearing because the training-free comparability rests entirely on unshown results.
- [§3] §3 (Method): the three-stage visual reasoning loop is presented as correcting the acknowledged limited spatial grounding of off-the-shelf MLLMs via SoM, yet no ablation isolates whether selection/refinement steps reduce grounding errors versus propagating them on reasoning-guided cases; this directly undermines the weakest assumption that unmodified MLLMs can reliably execute iterative spatial selection without adaptation.
minor comments (2)
- [§4.1] The description of Various-LangSeg would benefit from explicit dataset statistics, construction protocol, and split details to allow reproducibility.
- [§3] Notation for the generation-selection-refinement stages could be formalized with pseudocode or equations to clarify the exact interaction with the MLLM.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We provide point-by-point responses to the major comments below.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim of 'performance comparable to state-of-the-art training-based methods' is asserted without any quantitative numbers, tables, or error analysis in the provided text; this is load-bearing because the training-free comparability rests entirely on unshown results.
Authors: The full manuscript includes Section 4 with quantitative tables, comparisons to SOTA training-based methods, and error analyses that support the abstract claim. We will revise to add explicit cross-references from the abstract and §3 to the specific tables and figures in §4. revision: partial
-
Referee: [§3] §3 (Method): the three-stage visual reasoning loop is presented as correcting the acknowledged limited spatial grounding of off-the-shelf MLLMs via SoM, yet no ablation isolates whether selection/refinement steps reduce grounding errors versus propagating them on reasoning-guided cases; this directly undermines the weakest assumption that unmodified MLLMs can reliably execute iterative spatial selection without adaptation.
Authors: We agree that a targeted ablation isolating grounding error reduction across stages (especially on reasoning-guided cases) would strengthen the paper. We will add this analysis in the revised version, measuring error rates before/after selection and refinement on the relevant subset of Various-LangSeg. revision: yes
Circularity Check
No circularity: procedural framework with no derivation chain or fitted parameters
full rationale
The paper describes Seg-Agent as a training-free, procedural framework that applies off-the-shelf MLLMs to SoM-rendered images in a three-stage generation-selection-refinement loop. No equations, parameters, or mathematical derivations are present that could reduce to self-definition, fitted inputs called predictions, or self-citation load-bearing steps. The central claim of comparable performance rests on external benchmark results (Various-LangSeg) rather than any internal reduction to the method's own inputs. This is the expected non-finding for a methods paper whose reasoning chain is entirely empirical and non-derivational.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Off-the-shelf MLLMs can perform reliable spatial selection and iterative refinement when shown Set-of-Mark visual prompts
Reference graph
Works this paper leans on
-
[1]
Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023
work page 2023
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Schwing, and Alexander Kirillov
Bowen Cheng, Alexander G. Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation. 2021
work page 2021
-
[5]
Schwing, Alexander Kirillov, and Rohit Girdhar
Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. 2022
work page 2022
-
[6]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Seg- ment anything. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4015–4026, 2023
work page 2023
-
[7]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Lisa: Reasoning segmen- tation via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmen- tation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024
work page 2024
-
[10]
SegLLM: Multi-round reasoning segmen- tation with large language models
XuDong Wang, Shaolun Zhang, Shufan Li, Kehan Li, Kon- stantinos Kallidromitis, Yusuke Kato, Kazuki Kozuka, and Trevor Darrell. SegLLM: Multi-round reasoning segmen- tation with large language models. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[11]
Text4seg: Reimagining image segmentation as text generation, 2024
Mengcheng Lan, Chaofeng Chen, Yue Zhou, Jiaxing Xu, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Text4seg: Reimagining image segmentation as text generation, 2024
work page 2024
-
[12]
Segformer: Simple and effi- cient design for semantic segmentation with transformers
EnzeXie,WenhaiWang,ZhidingYu,AnimaAnandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and effi- cient design for semantic segmentation with transformers. InNeural Information Processing Systems (NeurIPS), 2021
work page 2021
-
[13]
Torr, and Li Zhang
Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, and Li Zhang. Rethinking seman- tic segmentation from a sequence-to-sequence perspective with transformers. InCVPR, 2021
2021
-
[14]
Gres: Generalized referring expression segmentation, 2023
Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Generalized referring expression segmentation, 2023
work page 2023
-
[15]
A simple yet effective network based on vision transformer for camouflaged object and salient object detection.IEEE Transactions on Image Processing, 2025
Chao Hao, Zitong Yu, Xin Liu, Jun Xu, Huanjing Yue, and Jingyu Yang. A simple yet effective network based on vision transformer for camouflaged object and salient object detection.IEEE Transactions on Image Processing, 2025
2025
-
[16]
Yiyu Liu, Shuo Ye, Chao Hao, and Zitong Yu. Yuv20k: A complexity-driven benchmark and trajectory-aware align- ment model for video camouflaged object detection, 2026
work page 2026
-
[17]
Distribution-specific learning for joint salient and camou- flaged object detection, 2025
Chao Hao, Zitong Yu, Xin Liu, Yuhao Wang, We- icheng Xie, Jingang Shi, Huanjing Yue, and Jingyu Yang. Distribution-specific learning for joint salient and camou- flaged object detection, 2025
work page 2025
-
[18]
Pixellm: Pixel reasoning with large multimodal model
Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26374–26383, 2024
work page 2024
-
[19]
Sam4mllm: Enhance multi- modal large language model for referring expression seg- mentation
Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, and Chu-Song Chen. Sam4mllm: Enhance multi- modal large language model for referring expression seg- mentation. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024
work page 2024
-
[20]
arXiv preprint arXiv:2503.06520 (2025)
Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025
-
[21]
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chun- yuan Li, and Jianfeng Gao. Set-of-mark prompting un- leashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
ReferItGame: Referring to objects in pho- tographs of natural scenes
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame: Referring to objects in pho- tographs of natural scenes. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors,Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 787–798, Doha, Qatar, October 2014. Association for Comp...
work page 2014
-
[23]
Generation andcomprehensionofunambiguousobjectdescriptions
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, and Kevin Murphy. Generation andcomprehensionofunambiguousobjectdescriptions. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), page 11–20. IEEE, June 2016
work page 2016
-
[24]
Androidworld: A dynamic benchmarking environment for autonomous agents, 2024
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents, 2024
work page 2024
-
[25]
Seeclick: Harnessing gui grounding for advanced visual gui agents, 2024
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents, 2024
work page 2024
-
[26]
Lawrence Zitnick, Dhruv Batra, and Devi Parikh
Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, and Devi Parikh. Vqa: Visual question answering, 2016
work page 2016
-
[27]
Taraneh Ghandi, Hamidreza Pourreza, and Hamidreza Mahyar. Deep learning approaches on image captioning: A review.ACM Computing Surveys, 56(3):1–39, October 2023
work page 2023
-
[28]
Zhuosheng Zhang, Aston Zhang, Mu Li, hai zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024. Seg-Agent 13
work page 2024
-
[29]
Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Zhe Chen, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Ping Luo, Yu Qiao, and Jifeng Dai. Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks, 2024
work page 2024
-
[30]
TaoZhang,XiangtaiLi,HaoFei,HaoboYuan,Shengqiong Wu, Shunping Ji, Chen Change Loy, and Shuicheng Yan. Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding.Advances in neural infor- mation processing systems, 37:71737–71767, 2024
work page 2024
-
[31]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding.arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[32]
Learning transferable visual models from natural language supervision, 2021
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021
work page 2021
-
[33]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Heng- shuang Zhao, and Philip H. S. Torr. Lavt: Language-aware vision transformer for referring image segmentation, 2022
2022
-
[35]
arXiv preprint arXiv:2501.04001 , year=
HaoboYuan,XiangtaiLi,TaoZhang,ZilongHuang,Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, and Ming-Hsuan Yang. Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos. arXiv preprint arXiv:2501.04001, 2025
-
[36]
Gsva: Generalized segmenta- tion via multimodal large language models
Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmenta- tion via multimodal large language models. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3858–3869, June 2024
2024
-
[37]
MMR: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmen- tation
Donggon Jang, Yucheol Cho, Suin Lee, Taehyeon Kim, and Daeshik Kim. MMR: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmen- tation. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[38]
Simultaneously localize, segment and rank the camouflaged objects
Yunqiu Lv, Jing Zhang, Yuchao Dai, Aixuan Li, Bowen Liu, Nick Barnes, and Deng-Ping Fan. Simultaneously localize, segment and rank the camouflaged objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11591–11601, 2021
work page 2021
-
[39]
The secrets of salient object segmentation
Yin Li, Xiaodi Hou, Christof Koch, James M Rehg, and Alan L Yuille. The secrets of salient object segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 280–287, 2014
2014
-
[40]
Large-scale training of shadow detectors with noisily-annotated shadow examples
Tomás F Yago Vicente, Le Hou, Chen-Ping Yu, Minh Hoai, and Dimitris Samaras. Large-scale training of shadow detectors with noisily-annotated shadow examples. In Computer Vision–ECCV 2016: 14th European Confer- ence, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14, pages 816–832. Springer, 2016
work page 2016
-
[41]
Casia image tampering detection evaluation database
Jing Dong, Wei Wang, and Tieniu Tan. Casia image tampering detection evaluation database. In2013 IEEE China Summit and International Conference on Signal and Information Processing, pages 422–426, 2013
work page 2013
-
[42]
Lawrence Zitnick, and Piotr Dollár
Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Mi- crosoft coco: Common objects in context, 2015
2015
-
[43]
Salient object detection: A benchmark.IEEE transactions on image processing, 24(12):5706–5722, 2015
Ali Borji, Ming-Ming Cheng, Huaizu Jiang, and Jia Li. Salient object detection: A benchmark.IEEE transactions on image processing, 24(12):5706–5722, 2015
work page 2015
-
[44]
Deng-Ping Fan, Ge-Peng Ji, Guolei Sun, Ming-Ming Cheng,JianbingShen,andLingShao. Camouflagedobject detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2777–2787, 2020
work page 2020
-
[45]
Cris: Clip- driven referring image segmentation
ZhaoqingWang,YuLu,QiangLi,XunqiangTao,Yandong Guo, Mingming Gong, and Tongliang Liu. Cris: Clip- driven referring image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022
2022
-
[46]
Open-vocabulary semantic segmen- tation with mask-adapted clip
Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yi- nan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmen- tation with mask-adapted clip. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7061–7070, 2023
work page 2023
-
[47]
Generalized decoding for pixel, image, and language, 2022
Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee, and Jianfeng Gao. Generalized decoding for pixel, image, and language, 2022
work page 2022
-
[48]
Segment everything everywhere all at once, 2023
Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once, 2023
work page 2023
-
[49]
Perceptiongpt: Effectively fusing visual perception into llm
Renjie Pi, Lewei Yao, Jiahui Gao, Jipeng Zhang, and Tong Zhang. Perceptiongpt: Effectively fusing visual perception into llm. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27124–27133, June 2024
work page 2024
-
[50]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junx- iao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024
work page 2024
-
[51]
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zh...
work page 2025
-
[52]
Shiu-hong Kao, Yu-Wing Tai, and Chi-Keung Tang. Think before you segment: High-quality reasoning seg- mentation with gpt chain of thoughts.arXiv preprint arXiv:2503.07503, 2025
-
[53]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.