Recognition: no theorem link
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Pith reviewed 2026-05-16 12:27 UTC · model grok-4.3
The pith
Reinforcement learning with format and accuracy rewards enables explicit reasoning chains to guide image segmentation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Seg-Zero introduces a decoupled architecture where a reasoning model interprets user intentions, generates explicit reasoning chains, and creates positional prompts for a separate segmentation model to produce pixel-level masks. The system is trained exclusively with reinforcement learning using the GRPO algorithm and a reward mechanism that combines format rewards for proper output structure with accuracy rewards for correct segmentation results, without any explicit reasoning supervision data.
What carries the argument
The decoupled reasoning model and segmentation model pair, trained via GRPO reinforcement learning driven by format and accuracy rewards.
If this is right
- The model produces visible chain-of-thought reasoning during inference.
- Zero-shot performance on the ReasonSeg benchmark reaches 57.5, exceeding previous approaches by a large margin.
- Generalization to out-of-domain segmentation tasks improves without additional supervised training.
- Reasoning capabilities emerge at test time from the reinforcement training process.
Where Pith is reading between the lines
- Similar reward-based training could be applied to develop reasoning in other computer vision tasks such as object detection or image captioning.
- The separation of reasoning and execution modules might allow easier debugging of where errors occur in complex queries.
- Further scaling the reinforcement process could lead to more sophisticated reasoning chains for highly ambiguous segmentation requests.
Load-bearing premise
The combination of format and accuracy rewards through reinforcement learning will consistently generate useful and generalizable reasoning rather than superficial patterns tuned to the training examples.
What would settle it
Run the model on a new set of segmentation queries involving reasoning steps absent from the training distribution, and measure whether the explicit reasoning chains correlate with higher segmentation accuracy or if performance drops to baseline levels.
read the original abstract
Traditional methods for reasoning segmentation rely on supervised fine-tuning with categorical labels and simple descriptions, limiting its out-of-domain generalization and lacking explicit reasoning processes. To address these limitations, we propose Seg-Zero, a novel framework that demonstrates remarkable generalizability and derives explicit chain-of-thought reasoning through cognitive reinforcement. Seg-Zero introduces a decoupled architecture consisting of a reasoning model and a segmentation model. The reasoning model interprets user intentions, generates explicit reasoning chains, and produces positional prompts, which are subsequently used by the segmentation model to generate precious pixel-level masks. We design a sophisticated reward mechanism that integrates both format and accuracy rewards to effectively guide optimization directions. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Seg-Zero achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Experiments show that Seg-Zero-7B achieves a zero-shot performance of 57.5 on the ReasonSeg benchmark, surpassing the prior LISA-7B by 18\%. This significant improvement highlights Seg-Zero's ability to generalize across domains while presenting an explicit reasoning process. Code is available at https://github.com/dvlab-research/Seg-Zero.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Seg-Zero, a decoupled architecture with a reasoning model that generates explicit chain-of-thought and positional prompts, followed by a segmentation model that produces pixel-level masks. The system is trained exclusively via GRPO reinforcement learning using only format and accuracy rewards, with no explicit reasoning supervision or labeled reasoning data. It claims robust zero-shot generalization and reports that Seg-Zero-7B achieves 57.5 on the ReasonSeg benchmark, an 18% improvement over prior LISA-7B.
Significance. If the central performance claim and the emergence of useful reasoning chains are substantiated, the work would demonstrate that simple reward signals in a decoupled RL setup can induce generalizable chain-of-thought for reasoning segmentation without supervised reasoning data. This would be a notable data-efficient advance for zero-shot multimodal tasks. The public code release strengthens reproducibility.
major comments (3)
- [Abstract / Experiments] Abstract and Experiments section: The headline result of 57.5 zero-shot on ReasonSeg (18% above LISA-7B) is stated without any description of the training data mixture, exact baseline implementations, number of runs, or statistical significance testing. This absence prevents verification of the central empirical claim.
- [Method] Method section: The format-plus-accuracy reward is presented as sufficient to induce useful reasoning chains, yet no ablation removes or randomizes the reasoning tokens while keeping the segmentation model fixed. Without such a control, it remains unclear whether the accuracy signal shapes logically grounded prompts or merely optimizes superficial output patterns that score well on the training distribution.
- [Experiments] Experiments section: No qualitative examples of generated reasoning chains are shown, nor is there a comparison of mask quality when the reasoning model is replaced by a non-reasoning baseline. These omissions leave the claim of emergent test-time reasoning unsupported by direct evidence.
minor comments (1)
- [Abstract] Abstract: 'precious pixel-level masks' is likely a typo for 'precise pixel-level masks'.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and constructive suggestions. We address each of the major comments below and outline the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments section: The headline result of 57.5 zero-shot on ReasonSeg (18% above LISA-7B) is stated without any description of the training data mixture, exact baseline implementations, number of runs, or statistical significance testing. This absence prevents verification of the central empirical claim.
Authors: We thank the referee for highlighting this. To address the concern, we will expand the Experiments section to include a comprehensive description of the training data mixture, details on how baselines were implemented (including hyperparameters and code references), results from multiple runs with standard deviations, and appropriate statistical tests. This will allow for better verification of the central claims. revision: yes
-
Referee: [Method] Method section: The format-plus-accuracy reward is presented as sufficient to induce useful reasoning chains, yet no ablation removes or randomizes the reasoning tokens while keeping the segmentation model fixed. Without such a control, it remains unclear whether the accuracy signal shapes logically grounded prompts or merely optimizes superficial output patterns that score well on the training distribution.
Authors: We appreciate this point and acknowledge the value of such an ablation. We will include an ablation study in the revised version where we replace the reasoning model's output with randomized or fixed positional prompts (keeping the segmentation model unchanged) to demonstrate that the learned reasoning chains contribute to the performance gains beyond superficial patterns. revision: yes
-
Referee: [Experiments] Experiments section: No qualitative examples of generated reasoning chains are shown, nor is there a comparison of mask quality when the reasoning model is replaced by a non-reasoning baseline. These omissions leave the claim of emergent test-time reasoning unsupported by direct evidence.
Authors: We will add qualitative visualizations of the reasoning chains generated by Seg-Zero in the Experiments section to provide direct evidence of the emergent reasoning. Furthermore, we will include a comparison experiment replacing the reasoning model with a non-reasoning baseline (e.g., direct instruction without chain-of-thought) and report the resulting mask quality metrics to support the claim. revision: yes
Circularity Check
No circularity: empirical benchmark result independent of training inputs
full rationale
The paper reports a measured zero-shot score of 57.5 on the external ReasonSeg benchmark after GRPO training with format+accuracy rewards. This performance number is obtained by direct evaluation on held-out data and does not reduce to any fitted parameter, self-defined quantity, or self-citation chain. No equations, uniqueness theorems, or ansatzes are presented that would make the central claim equivalent to its inputs by construction. The emergence of reasoning chains is an empirical observation rather than a deductive step that loops back on itself.
Axiom & Free-Parameter Ledger
free parameters (1)
- reward weights for format and accuracy
axioms (1)
- domain assumption Reinforcement learning with format and accuracy rewards can produce explicit, generalizable chain-of-thought reasoning without any supervised reasoning examples.
Forward citations
Cited by 21 Pith papers
-
Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation
Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.
-
From Web to Pixels: Bringing Agentic Search into Visual Perception
WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
-
PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation
The work introduces the UAV Reasoning Segmentation task, the DRSeg benchmark dataset, and PixDLM as a baseline dual-path multimodal language model for reasoning-based segmentation in aerial imagery.
-
Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation
Tarot-SAM3 delivers a training-free pipeline for segmenting images from arbitrary referring expressions via expression reasoning prompts and DINOv3-based mask self-refinement.
-
Topo-R1: Detecting Topological Anomalies via Vision-Language Models
Topo-R1 fine-tunes a vision-language model using a topology-aware reward and GRPO to detect anomalies such as broken or spurious connections in tubular segmentation masks, outperforming standard VLMs.
-
A Unified and Controllable Framework for Layered Image Generation with Visual Effects
LASAGNA produces layered images with integrated visual effects in a single pass, enabling drift-free edits via alpha compositing while releasing a 48K dataset and a 242-sample benchmark.
-
IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
IBISAgent enables MLLMs to perform iterative pixel-level visual reasoning for biomedical object referring and segmentation via text-based clicks and agentic RL, outperforming prior SOTA methods without model modifications.
-
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
-
Affordance Agent Harness: Verification-Gated Skill Orchestration
Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...
-
Video-ToC: Video Tree-of-Cue Reasoning
Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.
-
Image Generators are Generalist Vision Learners
Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.
-
Image Generators are Generalist Vision Learners
Image generation pretraining produces generalist vision models that reframe perception tasks as image synthesis and reach SOTA results on segmentation, depth estimation, and other 2D/3D tasks.
-
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
-
ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring
ARGUS uses a Prosecutor-Defender-Umpire multi-agent setup plus RAG and chain-of-thought rewards to adapt ad policy enforcement to new regulations using minimal fresh labels.
-
SignReasoner: Compositional Reasoning for Complex Traffic Sign Understanding via Functional Structure Units
SignReasoner decomposes traffic signs into functional structure units and uses a two-stage VLM post-training pipeline to achieve state-of-the-art compositional reasoning on a new benchmark.
-
Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs
A two-stage RL method with information gaps and grounding loss trains MLLMs to focus on and precisely crop relevant image regions, yielding SOTA results on high-resolution VQA benchmarks.
-
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.
-
RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation
RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.
-
Affordance Agent Harness: Verification-Gated Skill Orchestration
Affordance Agent Harness is a verification-gated orchestration framework that adaptively combines heterogeneous skills, retrieves episodic memories, and uses self-consistency checks to improve affordance grounding acc...
-
From System 1 to System 2: A Survey of Reasoning Large Language Models
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
-
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.
Reference graph
Works this paper leans on
-
[1]
Segnet: A deep convolutional encoder-decoder architecture for image segmentation
Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern anal- ysis and machine intelligence, 39(12):2481–2495, 2017. 3
work page 2017
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 3, 4, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
One token to seg them all: Language instructed reasoning seg- mentation in videos
Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Zheng Zhang, and Mike Zheng Shou. One token to seg them all: Language instructed reasoning seg- mentation in videos. Advances in Neural Information Pro- cessing Systems, 37:6833–6859, 2025. 2, 3
work page 2025
-
[4]
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017. 3
work page 2017
-
[5]
Rethinking Atrous Convolution for Semantic Image Segmentation
Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for seman- tic image segmentation. arXiv preprint arXiv:1706.05587 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Encoder-decoder with atrous separable convolution for semantic image segmentation
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018. 3
work page 2018
-
[7]
Sam4mllm: Enhance multi- modal large language model for referring expression seg- mentation
Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, and Chu-Song Chen. Sam4mllm: Enhance multi- modal large language model for referring expression seg- mentation. In European Conference on Computer Vision , pages 323–340. Springer, 2024. 2, 3, 7, 9
work page 2024
-
[8]
Per- pixel classification is not all you need for semantic segmen- tation
Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- pixel classification is not all you need for semantic segmen- tation. Advances in neural information processing systems , 34:17864–17875, 2021. 3
work page 2021
-
[9]
Masked-attention mask 11 transformer for universal image segmentation
Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask 11 transformer for universal image segmentation. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022. 3
work page 2022
-
[10]
Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang
Xidong Feng, Ziyu Wan, Muning Wen, Stephen Mar- cus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179 ,
-
[11]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Partimagenet: A large, high- quality dataset of parts
Ju He, Shuo Yang, Shaokang Yang, Adam Kortylewski, Xi- aoding Yuan, Jie-Neng Chen, Shuai Liu, Cheng Yang, Qi- hang Yu, and Alan Yuille. Partimagenet: A large, high- quality dataset of parts. In European Conference on Com- puter Vision, pages 128–145. Springer, 2022. 2
work page 2022
-
[13]
Referitgame: Referring to objects in pho- tographs of natural scenes
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in pho- tographs of natural scenes. In Proceedings of the 2014 con- ference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 2, 3, 5
work page 2014
-
[14]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. In Proceedings of the IEEE/CVF international con- ference on computer vision , pages 4015–4026, 2023. 2, 3, 4
work page 2023
-
[15]
Training language models to self-correct via reinforcement learning
Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, et al. Training language models to self-correct via reinforcement learning. arXiv preprint arXiv:2409.12917, 2024. 2
-
[16]
EvolvingLMMs Lab. Open R1 Multimodal. https : //github.com/EvolvingLMMs- Lab/open- r1- multimodal, 2025. 3
work page 2025
-
[17]
Lisa: Reasoning segmentation via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 9579–9589, 2024. 2, 3, 5, 7, 8, 9
work page 2024
-
[18]
Mini-gemini: Mining the potential of multi-modality vision language models
Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814,
-
[19]
Open-vocabulary semantic segmentation with mask-adapted clip
Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 7061–7070, 2023. 7
work page 2023
-
[20]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Ed- wards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Repre- sentations, 2023. 2
work page 2023
-
[21]
Refinenet: Multi-path refinement networks for high- resolution semantic segmentation
Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Refinenet: Multi-path refinement networks for high- resolution semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1925–1934, 2017. 3
work page 1925
-
[22]
GRES: Gen- eralized referring expression segmentation
Chang Liu, Henghui Ding, and Xudong Jiang. GRES: Gen- eralized referring expression segmentation. In CVPR, 2023. 7
work page 2023
-
[23]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023. 2
work page 2023
-
[24]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR,
-
[25]
Fully convolutional networks for semantic segmentation
Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 3431–3440, 2015. 3
work page 2015
-
[26]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022. 5
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [27]
-
[28]
Perceptiongpt: Effectively fusing visual perception into llm
Renjie Pi, Lewei Yao, Jiahui Gao, Jipeng Zhang, and Tong Zhang. Perceptiongpt: Effectively fusing visual perception into llm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 27124– 27133, 2024. 7
work page 2024
-
[29]
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable train- ing deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD international con- ference on knowledge discovery & data mining, pages 3505– 3506, 2020. 5
work page 2020
-
[30]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024. 3, 4, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Pixellm: Pixel reasoning with large multimodal model
Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26374–26383, 2024. 2, 3, 7
work page 2024
-
[33]
U- net: Convolutional networks for biomedical image segmen- tation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015. 3
work page 2015
-
[34]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, 12 Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 2, 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 5
work page 2019
-
[36]
R1-V Team. R1-V. https://github.com/Deep- Agent/R1-V?tab=readme-ov-file , 2025. 3
work page 2025
-
[37]
Solving olympiad geometry without human demon- strations
Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demon- strations. Nature, 625(7995):476–482, 2024. 2
work page 2024
-
[38]
Solving math word problems with process- and outcome-based feedback
Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geof- frey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[39]
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Peiyi Wang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human anno- tations. arXiv preprint arXiv:2312.08935, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Senqiao Yang, Tianyuan Qu, Xin Lai, Zhuotao Tian, Bohao Peng, Shu Liu, and Jiaya Jia. Lisa++: An improved baseline for reasoning segmentation with large language model.arXiv preprint arXiv:2312.17240, 2023. 3
-
[42]
Lavt: Language-aware vision transformer for referring image segmentation
Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Heng- shuang Zhao, and Philip HS Torr. Lavt: Language-aware vision transformer for referring image segmentation. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18155–18165, 2022. 7
work page 2022
-
[43]
Modeling context in referring expres- sions
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expres- sions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016. 2, 3, 5
work page 2016
-
[44]
Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017. 3
work page 2017
-
[45]
Lyra: An efficient and speech-centric framework for omni-cognition
Zhisheng Zhong, Chengyao Wang, Yuqi Liu, Senqiao Yang, Longxiang Tang, Yuechen Zhang, Jingyao Li, Tianyuan Qu, Yanwei Li, Yukang Chen, et al. Lyra: An efficient and speech-centric framework for omni-cognition. arXiv preprint arXiv:2412.09501, 2024. 3 13
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.