Recognition: unknown
Hybrid Latent Reasoning with Decoupled Policy Optimization
Pith reviewed 2026-05-10 01:14 UTC · model grok-4.3
The pith
Hybrid latent reasoning interleaves discrete text generation with continuous visual states and optimizes the combination through decoupled policy learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HyLaR interleaves discrete text generation with continuous visual latent representations. Following cold-start supervised fine-tuning, DePO enables reinforcement learning in this hybrid space by decomposing the policy gradient objective, applying independent trust-region constraints to textual and latent components, and using an exact closed-form von Mises-Fisher KL regularizer.
What carries the argument
DePO, a decoupled policy optimization algorithm that separates the optimization of discrete text and continuous latent actions using independent trust regions and a closed-form vMF KL term.
If this is right
- HyLaR achieves superior results on fine-grained perception benchmarks compared to standard MLLMs.
- HyLaR surpasses state-of-the-art latent reasoning approaches on general multimodal understanding tasks.
- The method allows stable training of hybrid action spaces without introducing instabilities.
- Independent constraints on text and latent components simplify hyperparameter tuning for the hybrid policy.
Where Pith is reading between the lines
- This suggests that preserving continuous visual information can mitigate the semantic collapse that occurs when vision is forced into discrete tokens early in reasoning.
- Decoupled optimization may extend to other settings where policies combine discrete and continuous actions, such as robotics or game playing.
- Future work could explore whether the exact vMF regularizer provides advantages over approximate methods in similar hybrid RL problems.
Load-bearing premise
The hybrid discrete-continuous action space can be effectively optimized via DePO with independent trust-region constraints and exact closed-form vMF KL regularizer without introducing instabilities or requiring extensive tuning.
What would settle it
If removing the independent trust-region constraints or the exact vMF KL regularizer causes the performance gains to vanish or leads to training instability on the reported benchmarks, the effectiveness of DePO for this hybrid space would be questioned.
Figures
read the original abstract
Chain-of-Thought (CoT) reasoning significantly elevates the complex problem-solving capabilities of multimodal large language models (MLLMs). However, adapting CoT to vision typically discretizes signals to fit LLM inputs, causing early semantic collapse and discarding fine-grained details. While external tools can mitigate this, they introduce a rigid bottleneck, confining reasoning to predefined operations. Although recent latent reasoning paradigms internalize visual states to overcome these limitations, optimizing the resulting hybrid discrete-continuous action space remains challenging. In this work, we propose HyLaR (Hybrid Latent Reasoning), a framework that seamlessly interleaves discrete text generation with continuous visual latent representations. Specifically, following an initial cold-start supervised fine-tuning (SFT), we introduce DePO (Decoupled Policy Optimization) to enable effective reinforcement learning within this hybrid space. DePO decomposes the policy gradient objective, applying independent trust-region constraints to the textual and latent components, alongside an exact closed-form von Mises-Fisher (vMF) KL regularizer. Extensive experiments demonstrate that HyLaR outperforms standard MLLMs and state-of-the-art latent reasoning approaches across fine-grained perception and general multimodal understanding benchmarks. Code is available at https://github.com/EthenCheng/HyLaR.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HyLaR, a framework for hybrid latent reasoning in multimodal LLMs that interleaves discrete text generation with continuous visual latent representations to avoid early semantic collapse from discretization. After an initial cold-start SFT stage, it proposes DePO (Decoupled Policy Optimization) which decomposes the policy gradient objective, applies independent trust-region constraints to the textual and latent components, and incorporates an exact closed-form von Mises-Fisher KL regularizer. Experiments claim that HyLaR outperforms both standard MLLMs and prior latent reasoning methods on fine-grained perception and general multimodal understanding benchmarks.
Significance. If the reported gains are robust, the work could meaningfully advance latent reasoning paradigms by preserving fine-grained visual information without external tool bottlenecks. The exact closed-form vMF KL term and the decoupled trust-region construction are internally consistent strengths, as confirmed by the ablations showing performance degradation when either component is removed. Code release further supports reproducibility and potential follow-up work.
major comments (2)
- [§3.2] §3.2 (DePO objective): the decomposition into independent trust-region constraints for discrete and continuous actions is load-bearing for the central optimization claim; the manuscript should explicitly bound or show vanishing of any cross-term contributions to the joint policy gradient to confirm that separate clipping preserves monotonic improvement.
- [Table 2] Table 2 (main results): the reported gains on fine-grained perception benchmarks are central to the outperformance claim, yet no error bars or statistical significance tests across seeds are provided; this weakens the assertion that HyLaR reliably surpasses SOTA latent reasoning baselines.
minor comments (4)
- [Abstract] Abstract: the phrase 'exact closed-form von Mises-Fisher (vMF) KL regularizer' should be accompanied by the explicit density parameterization used for the continuous latent variables.
- [§4.1] §4.1 (experimental setup): the cold-start SFT stage is described only at high level; the precise loss weighting between text and latent reconstruction terms should be stated to allow reproduction.
- [Figure 3] Figure 3 (ablation study): the caption does not indicate the number of random seeds or whether the plotted curves represent means; this affects interpretation of the necessity of the vMF regularizer.
- [Related Work] Related work section: the discussion of prior latent reasoning methods (e.g., those using external tools) should cite the specific discretization bottlenecks they introduce to better motivate the hybrid approach.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and positive recommendation. We address the two major comments point by point below.
read point-by-point responses
-
Referee: [§3.2] §3.2 (DePO objective): the decomposition into independent trust-region constraints for discrete and continuous actions is load-bearing for the central optimization claim; the manuscript should explicitly bound or show vanishing of any cross-term contributions to the joint policy gradient to confirm that separate clipping preserves monotonic improvement.
Authors: We agree that an explicit analysis of cross-term contributions would strengthen the theoretical justification for independent clipping in DePO. The current derivation in §3.2 relies on the product structure of the hybrid policy (discrete text tokens and continuous visual latents) together with the exact closed-form vMF KL regularizer, which enforces separation in the continuous component. In the revised manuscript we will add a short derivation in §3.2 (with supporting steps moved to the appendix) showing that the cross-term in the joint policy gradient is bounded by the product of the individual trust-region radii and vanishes in the limit as the radii approach zero, thereby preserving the monotonic improvement guarantee of the decoupled updates. revision: yes
-
Referee: [Table 2] Table 2 (main results): the reported gains on fine-grained perception benchmarks are central to the outperformance claim, yet no error bars or statistical significance tests across seeds are provided; this weakens the assertion that HyLaR reliably surpasses SOTA latent reasoning baselines.
Authors: We acknowledge that the absence of error bars and multi-seed statistics limits the strength of the empirical claims. Although the improvements appear consistent across the reported benchmarks, the manuscript does not currently include variance estimates. In the revision we will rerun the primary experiments on the fine-grained perception tasks with at least three independent random seeds, report means and standard deviations in Table 2, and add paired statistical significance tests against the strongest baselines. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's central derivation introduces DePO by decomposing the policy gradient into independent trust-region terms for discrete text and continuous latent actions, plus a closed-form vMF KL regularizer obtained directly from the von Mises-Fisher density; these steps follow from standard RL objectives and known spherical distributions rather than from the target benchmark scores. Performance claims rest on post-training experimental comparisons, not on any fitted parameter or self-referential definition being renamed as a prediction. No self-citation chain, ansatz smuggling, or uniqueness theorem imported from prior author work is load-bearing in the provided derivation. The framework is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Cheng et al
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025) 16 T. Cheng et al
2025
-
[3]
In: Forty-second International Conference on Machine Learning (2025)
Bendada, W., Salha-Galvan, G., Hennequin, R., Bontempelli, T., Bouabça, T., Cazenave, T.: Exploring large action sets with hyperspherical embeddings using von mises-fisher sampling. In: Forty-second International Conference on Machine Learning (2025)
2025
-
[4]
Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Wang, J., Qiao, Y., Lin, D., et al.: Are we on the right way for evaluating large vision- language models? Advances in Neural Information Processing Systems37, 27056– 27087 (2024)
2024
-
[5]
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
In: Conference on Uncertainty in Artificial Intelligence (UAI) (2018)
Davidson, T.R., Falorsi, L., De Cao, N., Kipf, T., Tomczak, J.M.: Hyperspherical variational auto-encoders. In: Conference on Uncertainty in Artificial Intelligence (UAI) (2018)
2018
-
[7]
In: European Conference on Computer Vision
Fu, X., Hu, Y., Li, B., Feng, Y., Wang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: Blink: Multimodal large language models can see but not per- ceive. In: European Conference on Computer Vision. pp. 148–166. Springer (2024)
2024
-
[8]
Google DeepMind: Gemini 3 flash model card. Tech. rep., Google (2025),https: //storage.googleapis.com/deepmind- media/Model- Cards/Gemini- 3- Flash- Model-Card.pdf
2025
-
[9]
In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition
Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., et al.: Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14375–14385 (2024)
2024
-
[10]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Gupta, T., Kembhavi, A.: Visual programming: Compositional visual reasoning without training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14953–14962 (2023)
2023
-
[11]
Training Large Language Models to Reason in a Continuous Latent Space
Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., Tian, Y.: Train- ing large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769 (2024)
work page internal anchor Pith review arXiv 2024
-
[12]
In: The Fourteenth International Conference on Learning Representations (2026)
Hong, J., Zhao, C., Lu, W., Zhu, C., Xu, G., XingYu: Deepeyesv2: Toward agen- tic multimodal model. In: The Fourteenth International Conference on Learning Representations (2026)
2026
-
[13]
Hong, W., Yu, W., Gu, X., Wang, G., Gan, G., Tang, H., Cheng, J., Qi, J., Ji, J., Pan, L., et al.: Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning. arXiv preprint arXiv:2507.01006 (2025)
work page internal anchor Pith review arXiv 2025
-
[14]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y., Lin, S.: Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749 (2025)
work page internal anchor Pith review arXiv 2025
-
[15]
Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
In: The Fourteenth International Con- ference on Learning Representations (2026)
Li, A., Wang, C.L., Fu, D., Yue, K., Cai, Z., Zhu, W.B., Liu, O., Guo, P., Neiswanger, W., Huang, F., Goldstein, T., Goldblum, M.: Zebra-cot: A dataset for interleaved vision-language reasoning. In: The Fourteenth International Con- ference on Learning Representations (2026)
2026
-
[17]
Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025a
Li, B., Sun, X., Liu, J., Wang, Z., Wu, J., Yu, X., Chen, H., Barsoum, E., Chen, M., Liu, Z.: Latent visual reasoning. arXiv preprint arXiv:2509.24251 (2025) Hybrid Latent Reasoning with Decoupled Policy Optimization 17
-
[18]
LLaVA-OneVision: Easy Visual Task Transfer
Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)
work page internal anchor Pith review arXiv 2024
-
[19]
Li, B., Ge, Y., Chen, Y., Ge, Y., Zhang, R., Shan, Y.: Seed-bench-2-plus: Bench- marking multimodal large language models with text-rich visual comprehension. arXiv preprint arXiv:2404.16790 (2024)
-
[20]
In: The Thirteenth International Conference on Learning Representations (2025)
Luo, R., Li, Y., Chen, L., He, W., Lin, T.E., Liu, Z., Zhang, L., Song, Z., Alinejad- Rokny, H., Xia, X., Liu, T., Hui, B., Yang, M.: DEEM: Diffusion models serve as the eyes of large language models for image perception. In: The Thirteenth International Conference on Learning Representations (2025)
2025
-
[21]
In: Forty-second International Conference on Machine Learning (2025)
Mi, Z., Wang, K.C., Qian, G., Ye, H., Liu, R., Tulyakov, S., Aberman, K., Xu, D.: I think, therefore i diffuse: Enabling multimodal in-context reasoning in diffusion models. In: Forty-second International Conference on Machine Learning (2025)
2025
-
[22]
Mull-Tokens: Modality-Agnostic Latent Thinking
Ray,A.,Abdelkader,A.,Mao,C.,Plummer,B.A.,Saenko,K.,Krishna,R.,Guibas, L., Chu, W.S.: Mull-tokens: Modality-agnostic latent thinking. arXiv preprint arXiv:2512.10941 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Proximal Policy Optimization Algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[24]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Shen, H., Zhao, K., Zhao, T., Xu, R., Zhang, Z., Zhu, M., Yin, J.: Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree- based image exploration. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 6613–6629 (2025)
2025
-
[26]
In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Shen, Z., Yan, H., Zhang, L., Hu, Z., Du, Y., He, Y.: CODI: Compressing chain- of-thought into continuous space via self-distillation. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 677–693 (2025)
2025
-
[27]
arXiv preprint arXiv:2512.17312 (2025)
Song, Q., Li, H., Yu, Y., Zhou, H., Yang, L., Bai, S., She, Q., Huang, Z., Zhao, Y.: Codedance: A dynamic tool-integrated mllm for executable visual reasoning. arXiv preprint arXiv:2512.17312 (2025)
-
[28]
Sketch-in-latents: Eliciting unified reasoning in mllms.CoRR, abs/2512.16584, 2025
Tong, J., Gu, J., Lou, Y., Fan, L., Zou, Y., Wu, Y., Ye, J., Li, R.: Sketch-in-latents: Eliciting unified reasoning in mllms. arXiv preprint arXiv:2512.16584 (2025)
-
[29]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9568–9578 (2024)
2024
-
[30]
Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., et al.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, lo- calization, and dense features. arXiv preprint arXiv:2502.14786 (2025)
work page internal anchor Pith review arXiv 2025
-
[31]
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Wang, H., Qu, C., Huang, Z., Chu, W., Lin, F., Chen, W.: Vl-rethinker: Incen- tivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837 (2025)
work page Pith review arXiv 2025
-
[32]
Wang, Q., Shi, Y., Wang, Y., Zhang, Y., Wan, P., Gai, K., Ying, X., Wang, Y.: Monet: Reasoning in latent visual space beyond images and language. arXiv preprint arXiv:2511.21395 (2025)
-
[33]
In: International conference on ma- chine learning
Wang, T., Isola, P.: Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In: International conference on ma- chine learning. pp. 9929–9939. PMLR (2020) 18 T. Cheng et al
2020
-
[34]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)
work page internal anchor Pith review arXiv 2025
-
[35]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Wang, W., Ding, L., Zeng, M., Zhou, X., Shen, L., Luo, Y., Yu, W., Tao, D.: Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp. 7907–7915 (2025)
2025
-
[36]
Forest Before Trees: Latent Superposition for Efficient Visual Reasoning
Wang, Y., Zhang, J., Wu, Y., Lin, Y., Lukas, N., Liu, Y.: Forest before trees: Latent superposition for efficient visual reasoning. arXiv preprint arXiv:2601.06803 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[37]
In: The Fourteenth International Conference on Learning Representations (2026)
Wei, X., Liu, X., Zang, Y., Dong, X., Cao, Y., Wang, J., Qiu, X., Lin, D.: SIM-cot: Supervised implicit chain-of-thought. In: The Fourteenth International Conference on Learning Representations (2026)
2026
-
[38]
https://github.com/huggingface/trl(2020)
von Werra, L., Belkada, Y., Tunstall, L., Beeching, E., Thrush, T., Lambert, N., Huang, S., Rasul, K., Gallouédec, Q.: TRL: Transformer Reinforcement Learning. https://github.com/huggingface/trl(2020)
2020
-
[39]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Wu, P., Xie, S.: V*: Guided visual search as a core mechanism in multimodal llms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13084–13094 (2024)
2024
-
[40]
In: The Thirteenth International Conference on Learning Representations (2025)
Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation. In: The Thirteenth International Conference on Learning Representations (2025)
2025
-
[41]
Yang, Z., Yu, X., Chen, D., Shen, M., Gan, C.: Machine mental imagery: Empower multimodal reasoning with latent visual tokens. arXiv preprint arXiv:2506.17218 (2025)
-
[42]
In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)
Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., YuYue, Dai, W., Fan, T., Liu, G., Liu, J., Liu, L., Liu, X., Lin, H., Lin, Z., Ma, B., Sheng, G., Tong, Y., Zhang, C., Zhang, M., Zhang, R., Zhang, W., Zhu, H., Zhu, J., Chen, J., Chen, J., Wang, C., Yu, H., Song, Y., Wei, X., Zhou, H., Liu, J., Ma, W.Y., Zhang, Y.Q., Yan, L., Wu, Y., Wang, M.: DAPO: An open-...
2025
-
[43]
Advances in neural information processing systems32(2019)
Zhang, B., Sennrich, R.: Root mean square layer normalization. Advances in neural information processing systems32(2019)
2019
-
[44]
Zhang, Y.F., Lu, X., Yin, S., Fu, C., Chen, W., Hu, X., Wen, B., Jiang, K., Liu, C., Zhang, T., et al.: Thyme: Think beyond images. arXiv preprint arXiv:2508.11630 (2025)
work page internal anchor Pith review arXiv 2025
-
[45]
In: The Thirty- ninth Annual Conference on Neural Information Processing Systems (2025)
Zhang, Z., He, X., Yan, W., Shen, A., Zhao, C., Wang, X.E.: Soft thinking: Unlock- ing the reasoning potential of LLMs in continuous concept space. In: The Thirty- ninth Annual Conference on Neural Information Processing Systems (2025)
2025
-
[46]
Multimodal Chain-of-Thought Reasoning in Language Models
Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain- of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023)
work page internal anchor Pith review arXiv 2023
-
[47]
arXiv preprint arXiv:2501.12345 , year=
Zheng, Y., Lu, J., Wang, S., Feng, Z., Kuang, D., Xiong, Y.: Easyr1: An efficient, scalable, multi-modality rl training framework. arXiv preprint arXiv:2501.12345 (2025)
-
[48]
thinking with images
Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., XingYu: Deepeyes: Incentivizing “thinking with images” via reinforcement learning. In: The Fourteenth International Conference on Learning Representations (2026)
2026
-
[49]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) HyLaR: Hybrid Latent Reasoning with Decoupled Policy Optimization (Supplementary Material) Tao Cheng1, Shi-Zhe Chen1, Hao ...
work page internal anchor Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.