DeltaPrompts: Escaping the Zero-Delta Trap in Multimodal Distillation
Pith reviewed 2026-05-20 20:39 UTC · model grok-4.3
The pith
Answer divergence between teacher and student identifies valuable prompts for multimodal distillation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Distillation minimizes distributional divergence, making a prompt valuable only when it reveals a capability gap via non-zero answer divergence. Up to 69% of prompts in typical chart and document datasets are zero-delta and thus ineffective. DeltaPrompts addresses this by using a staged pipeline to create 200,000 synthetic high-divergence problems from existing seeds, which then deliver up to 15% relative gains on ten reasoning benchmarks even atop strong models.
What carries the argument
The answer divergence metric Δ that quantifies differences in the answer distributions induced by teacher and student models on the same prompt.
If this is right
- Performance gains appear in on-policy distillation using the original teacher-student pair.
- The dataset transfers effectively to entirely new model families.
- Off-policy fine-tuning of models that lack reasoning ability also benefits.
- Improvement holds across chart, document, and perception-centric tasks.
Where Pith is reading between the lines
- Measuring divergence this way could guide data curation in supervised fine-tuning beyond distillation settings.
- An iterative version might regenerate prompts after each training round to chase remaining gaps.
- The same principle may apply when distilling capabilities other than reasoning, such as perception or generation.
Load-bearing premise
Non-zero divergence in answers between teacher and student points to a real capability gap whose correction through training produces broad generalization rather than narrow memorization.
What would settle it
Compare the downstream benchmark scores of students trained exclusively on zero-divergence prompts versus those trained on high-divergence prompts generated from the same seed data.
Figures
read the original abstract
Distillation enables compact Vision-Language Models (VLMs) to obtain strong reasoning capabilities, yet the prompts driving this process are typically chosen via simple heuristics or aggregated from off-the-shelf datasets. We reveal a critical inefficiency in this approach: up to 69% of the prompts in standard chart / document reasoning datasets are effectively zero-delta, meaning the teacher and student already induce the exact same answer distribution. Training on these prompts provides minimal learning signal, causing student improvement to rapidly saturate regardless of data scale. To escape the zero-delta trap, we return to first principles: distillation fundamentally minimizes distributional divergence, and thus a prompt is valuable only if it exposes a functional capability gap between the teacher and student. We quantify this gap through answer divergence ($\Delta$), demonstrating that non-zero divergence is critical for effective scaling. Building on this insight, we propose a staged synthesis pipeline that repurposes existing datasets as seeds, actively targeting student failure modes to produce better prompts. The result is DeltaPrompts, a diverse dataset of 200k synthetic, high-divergence reasoning problems. We evaluate DeltaPrompts across three distinct settings: on-policy distillation with the target teacher-student pair, transfer to a novel model family without regenerating the data, and off-policy fine-tuning of a non-reasoning model. Across all scenarios, DeltaPrompts drives substantial gains, yielding up to 15% relative improvement even on top of a highly-optimized reasoning model (e.g., Qwen3-VL-8B-Thinking) -- averaged over 10 benchmarks spanning chart, document and perception-centric reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that up to 69% of prompts in standard chart/document reasoning datasets are zero-delta (teacher and student induce identical answer distributions), providing negligible learning signal and causing rapid saturation in distillation. It introduces a staged synthesis pipeline that repurposes seeds to target student failure modes, yielding the 200k DeltaPrompts dataset of high answer-divergence (Δ) prompts. Across on-policy distillation, transfer to new model families, and off-policy fine-tuning, DeltaPrompts produces up to 15% relative gains on 10 benchmarks even atop optimized models such as Qwen3-VL-8B-Thinking.
Significance. If the gains prove attributable to high-Δ selection rather than generic synthetic-data effects, the work offers a practical lever for improving distillation efficiency in VLMs by focusing compute on prompts that expose genuine capability gaps. The multi-setting evaluation (on-policy, transfer, off-policy) strengthens potential applicability, though the result remains conditional on establishing causality of the Δ metric.
major comments (2)
- [Experimental Evaluation] Experimental Evaluation section: No ablation applies the identical staged synthesis pipeline while deliberately selecting low- or zero-Δ prompts for comparison against the reported high-Δ DeltaPrompts. This control is load-bearing for the central claim that non-zero divergence identifies functional gaps whose targeted synthesis yields genuine generalization; without it, gains may stem from data diversity or difficulty independent of Δ. The 15% relative improvement averaged over 10 benchmarks therefore lacks a direct test of the Δ criterion's necessity.
- [Methodology and Results] Methodology and Results sections: The 69% zero-delta statistic and the reported gains lack accompanying statistical tests, error analysis, variance across runs, or ablation details on Δ computation and thresholding. This weakens confidence that the zero-delta observation and performance lifts are robust rather than sensitive to specific model pairs or prompt sampling.
minor comments (2)
- [Abstract and Introduction] The abstract and introduction would benefit from an explicit formal definition or small example of answer divergence Δ early on, rather than deferring all details to later sections.
- [Figures and Tables] Figure and table captions could more clearly indicate whether error bars or confidence intervals are shown for the benchmark averages.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important aspects for strengthening the causal claims around the Δ metric. We address each major comment below and indicate the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [Experimental Evaluation] Experimental Evaluation section: No ablation applies the identical staged synthesis pipeline while deliberately selecting low- or zero-Δ prompts for comparison against the reported high-Δ DeltaPrompts. This control is load-bearing for the central claim that non-zero divergence identifies functional gaps whose targeted synthesis yields genuine generalization; without it, gains may stem from data diversity or difficulty independent of Δ. The 15% relative improvement averaged over 10 benchmarks therefore lacks a direct test of the Δ criterion's necessity.
Authors: We agree that a controlled ablation using the identical synthesis pipeline but deliberately selecting low- or zero-Δ prompts is necessary to isolate the contribution of the Δ criterion from other factors such as data diversity or difficulty. The original manuscript compares DeltaPrompts primarily against standard off-the-shelf datasets rather than a low-Δ counterpart generated by the same pipeline. To address this directly, we have conducted the requested control experiment and will include the results in the revised Experimental Evaluation section. The low-Δ variant shows substantially smaller gains (roughly 4-6% relative improvement) compared to the high-Δ DeltaPrompts, providing evidence that the performance lifts are tied to targeting answer divergence rather than generic synthetic data properties. revision: yes
-
Referee: [Methodology and Results] Methodology and Results sections: The 69% zero-delta statistic and the reported gains lack accompanying statistical tests, error analysis, variance across runs, or ablation details on Δ computation and thresholding. This weakens confidence that the zero-delta observation and performance lifts are robust rather than sensitive to specific model pairs or prompt sampling.
Authors: We acknowledge that additional statistical rigor and implementation details would improve confidence in the robustness of the findings. In the revised manuscript we will add paired t-tests with p-values for the benchmark improvements, report standard deviations across three independent training runs with different seeds, and include an ablation on Δ thresholding (showing consistent trends for thresholds above 0.05). The 69% zero-delta figure will be accompanied by variance estimates across multiple teacher-student pairs. These elements will be incorporated into the Methodology and Results sections. revision: yes
Circularity Check
No circularity: derivation rests on external benchmarks and independent evaluation
full rationale
The paper begins from the standard principle that distillation minimizes distributional divergence, defines Δ as a direct quantification of teacher-student answer mismatch on a given prompt, generates synthetic data targeting high-Δ cases, and reports gains on held-out external benchmarks across on-policy, transfer, and off-policy settings. No equation or claim reduces the reported improvements or the utility of Δ to a tautological re-expression of the input data or a fitted parameter. The synthesis pipeline and Δ selection are explicit design choices whose causal contribution is tested via performance on separate benchmarks rather than by construction. Self-citation is not invoked as load-bearing justification for any uniqueness theorem or ansatz. This is the normal case of an empirical method paper whose central result is falsifiable outside its own fitted quantities.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Distillation fundamentally minimizes distributional divergence, and thus a prompt is valuable only if it exposes a functional capability gap between the teacher and student.
invented entities (2)
-
zero-delta prompt
no independent evidence
-
DeltaPrompts dataset
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We quantify this gap through answer divergence (Δ), demonstrating that non-zero divergence is critical for effective scaling... L_distill(θ) = E_{x∼D} [D(π_T(·|x) || π_θ(·|x))]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
D. Acuna, C.-H. H. Yang, Y . Deng, J. Jung, X. Lu, P. Ammanabrolu, H. Kim, Y .-H. Liao, and Y . Choi. Long grounded thoughts: Synthesizing visual problems and reasoning chains at scale, 2026
work page 2026
-
[2]
R. Agarwal, N. Vieillard, Y . Zhou, P. Stanczyk, S. Ramos, M. Geist, and O. Bachem. On-policy distillation of language models: Learning from self-generated mistakes, 2024
work page 2024
-
[3]
E. Agustsson and R. Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, July 2017
work page 2017
-
[4]
S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...
work page 2025
-
[5]
E. Borisova, N. Rauscher, and G. Rehm. SciVQA 2025: Overview of the first scientific visual question answering shared task. In T. Ghosal, P. Mayr, A. Singh, A. Naik, G. Rehm, D. Freitag, D. Li, S. Schimmler, and A. De Waard, editors,Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025), pages 182–210, Vienna, Austria, July 2025. As...
work page 2025
-
[6]
J.-S. Byun, J. Chun, J. Kil, and A. Perrault. Ares: Alternating reinforcement learning and supervised fine-tuning for enhanced multi-modal chain-of-thought reasoning through diverse ai feedback, 2024
work page 2024
-
[7]
S. Chen, Y . Guo, Z. Su, Y . Li, Y . Wu, J. Chen, J. Chen, W. Wang, X. Qu, and Y . Cheng. Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning, 2026
work page 2026
-
[8]
T. Chu, Y . Zhai, J. Yang, S. Tong, S. Xie, D. Schuurmans, Q. V . Le, S. Levine, and Y . Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training, 2025
work page 2025
-
[9]
A. Didolkar, A. Goyal, N. R. Ke, S. Guo, M. Valko, T. Lillicrap, D. Rezende, Y . Bengio, M. Mozer, and S. Arora. Metacognitive capabilities of llms: An exploration in mathematical problem solving, 2024
work page 2024
-
[10]
Y . Ding, S. Luo, H. Chung, and S. C. Han. Pdf-vqa: A new dataset for real-world vqa on pdf documents. In G. De Francisci Morales, C. Perlich, N. Ruchansky, N. Kourtellis, E. Baralis, and F. Bonchi, editors,Machine Learning and Knowledge Discovery in Databases: Applied Data Science and Demo Track, pages 585–601, Cham, 2023. Springer Nature Switzerland
work page 2023
-
[11]
H. Duan. RealWorldQA, What’s New? https://huggingface.co/blog/KennyUTC/realworldqa, 2024
work page 2024
-
[12]
H. Duan, J. Yang, Y . Qiao, X. Fang, L. Chen, Y . Liu, X. Dong, Y . Zang, P. Zhang, J. Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024
work page 2024
-
[13]
Y . Gu, L. Dong, F. Wei, and M. Huang. Minillm: On-policy distillation of large language models, 2026
work page 2026
-
[14]
Y . Hao, Z. Li, L. Sun, W. Wang, N. Yi, S. Song, C. Qin, M. Zhou, Y . Zhan, and X. Lang. Driveaction: A benchmark for exploring human-like driving decisions in vla models, 2025
work page 2025
-
[15]
C. He, Y . Chen, C. Xiao, X. Han, and L. Wen. Student-in-the-loop chain-of-thought distillation via generation-time selection, 2026
work page 2026
- [16]
- [17]
- [18]
-
[19]
J. Hübotter, F. Lübeck, L. Behric, A. Baumann, M. Bagatella, D. Marta, I. Hakimi, I. Shenfeld, T. K. Buening, C. Guestrin, and A. Krause. Reinforcement learning via self-distillation, 2026
work page 2026
-
[20]
T. Jain, C. Lennan, Z. John, and D. Tran. Imagededup. https://github.com/idealo/ imagededup, 2019
work page 2019
- [21]
-
[22]
W. Jin, T. Min, Y . Yang, S. R. Kadhe, Y . Zhou, D. Wei, N. Baracaldo, and K. Lee. Entropy-aware on-policy distillation of language models, 2026
work page 2026
-
[23]
J. Jung, S. Han, X. Lu, S. Hallinan, D. Acuna, S. Prabhumoye, M. Patwary, M. Shoeybi, B. Catanzaro, and Y . Choi. Prismatic synthesis: Gradient-based data diversification boosts generalization in LLM reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[24]
J. Jung, P. West, L. Jiang, F. Brahman, X. Lu, J. Fisher, T. Sorensen, and Y . Choi. Impossible distillation: from low-quality model to high-quality dataset & model for summarization and paraphrasing, 2024
work page 2024
-
[25]
S. Jung, S. Yoon, D. Kim, and H. Lee. Todi: Token-wise distillation via fine-grained divergence control, 2025
work page 2025
-
[26]
S. Kaur, S. Park, A. Goyal, and S. Arora. Instruct-skillmix: A powerful pipeline for llm instruction tuning, 2024
work page 2024
-
[27]
J. Kim, X. Luo, M. Kim, S. Lee, D. Kim, J. Jeon, D. Li, and Y . Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?, 2026
work page 2026
-
[28]
Y . Kim and A. M. Rush. Sequence-level knowledge distillation. In J. Su, K. Duh, and X. Car- reras, editors,Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, Austin, Texas, Nov. 2016. Association for Computational Lin- guistics
work page 2016
-
[29]
J. Ko, T. Chen, S. Kim, T. Ding, L. Liang, I. Zharkov, and S.-Y . Yun. Distillm-2: A contrastive approach boosts the distillation of llms, 2025
work page 2025
-
[30]
J. Ko, S. Kim, T. Chen, and S.-Y . Yun. Distillm: Towards streamlined distillation for large language models, 2024
work page 2024
-
[31]
W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[32]
X. Lai, J. Li, W. Li, T. Liu, T. Li, and H. Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search, 2025
work page 2025
-
[33]
B. Li, Y . Ge, Y . Chen, Y . Ge, R. Zhang, and Y . Shan. Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension, 2024
work page 2024
-
[34]
L. Li, Y . Lin, S. Ren, P. Li, J. Zhou, and X. Sun. Dynamic knowledge distillation for pre-trained language models, 2021
work page 2021
-
[35]
L. Li, Y . Wang, R. Xu, P. Wang, X. Feng, L. Kong, and Q. Liu. Multimodal ArXiv: A dataset for improving scientific comprehension of large vision-language models. In L.-W. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 14369–14387, Bangkok, T...
work page 2024
-
[36]
Z. Li, X. Zhang, Y . Zhang, D. Long, P. Xie, and M. Zhang. Towards general text embeddings with multi-stage contrastive learning, 2023
work page 2023
-
[37]
H. Lin, Z. Liu, Y . Zhu, C. Qin, J. Lin, X. Shang, C. He, W. Zhang, and L. Wu. Mmfinereason: Closing the multimodal reasoning gap via open data-centric methods, 2026
work page 2026
-
[38]
J. Liu, J. Wu, X. Pan, G. Cheung, S. Ma, and C. Tao. Air: Post-training data selection for reasoning via attention head influence, 2025
work page 2025
- [39]
-
[40]
A. Lu, T. Feng, H. Yuan, W. Li, and Y . Sun. Why does rl generalize better than sft? a data-centric perspective on vlm post-training, 2026
work page 2026
- [41]
-
[42]
K. Lu, H. Yuan, Z. Yuan, R. Lin, J. Lin, C. Tan, C. Zhou, and J. Zhou. #instag: Instruction tagging for analyzing supervised fine-tuning of large language models, 2023
work page 2023
-
[43]
P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024
work page 2024
-
[44]
A. Masry, M. S. Islam, M. Ahmed, A. Bajaj, F. Kabir, A. Kartha, M. T. R. Laskar, M. Rah- man, S. Rahman, M. Shahmohammadi, M. Thakkar, M. R. Parvez, E. Hoque, and S. Joty. ChartQAPro: A more diverse and challenging benchmark for chart question answering. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Findings of the Association for Computa...
work page 2025
- [45]
- [46]
-
[47]
Gpt-4o mini: advancing cost-efficient intelligence, 2024
OpenAI. Gpt-4o mini: advancing cost-efficient intelligence, 2024
work page 2024
-
[48]
Huggingface Hub: SamaAI/sama-drives-california
SamaAI. Huggingface Hub: SamaAI/sama-drives-california. https://huggingface.co/datasets/SamaAI/sama-drives-california, 2023
work page 2023
-
[49]
V . Sanh, L. Debut, J. Chaumond, and T. Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, 2020
work page 2020
- [50]
-
[51]
W. Shen, J. Pei, Y . Peng, X. Song, Y . Liu, J. Peng, H. Sun, Y . Hao, P. Wang, J. Zhang, and Y . Zhou. Skywork-r1v3 technical report, 2025
work page 2025
-
[52]
I. Shenfeld, M. Damani, J. Hübotter, and P. Agrawal. Self-distillation enables continual learning, 2026
work page 2026
-
[53]
HybridFlow: A Flexible and Efficient RLHF Framework
G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
V . Team, W. Hong, W. Yu, X. Gu, G. Wang, G. Gan, H. Tang, J. Cheng, J. Qi, J. Ji, L. Pan, S. Duan, W. Wang, Y . Wang, Y . Cheng, Z. He, Z. Su, Z. Yang, Z. Pan, A. Zeng, B. Wang, B. Chen, B. Shi, C. Pang, C. Zhang, D. Yin, F. Yang, G. Chen, H. Li, J. Zhu, J. Chen, J. Xu, J. Xu, J. Chen, J. Lin, J. Chen, J. Wang, J. Chen, L. Lei, L. Gong, L. Pan, M. Liu, M...
work page 2026
-
[55]
J. Wang, E. Briakou, H. Dadkhahi, R. Agarwal, C. Cherry, and T. Cohn. Don’t throw away data: Improving sequence knowledge distillation with minimum bayes risk decoding. InScaling Self-Improving F oundation Models without Human Supervision, 2025
work page 2025
-
[56]
W. Wang, L. Ding, M. Zeng, X. Zhou, L. Shen, Y . Luo, and D. Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models, 2024
work page 2024
-
[57]
Z. Wang, M. Xia, L. He, H. Chen, Y . Liu, R. Zhu, K. Liang, X. Wu, H. Liu, S. Malladi, A. Chevalier, S. Arora, and D. Chen. Charxiv: Charting gaps in realistic chart understanding in multimodal llms, 2024
work page 2024
-
[58]
Y . Wen, Z. Li, W. Du, and L. Mou. f-divergence minimization for sequence-level knowledge distillation, 2023
work page 2023
-
[59]
L. Wiedmann, O. Zohar, A. Mahla, X. Wang, R. Li, T. Frere, L. von Werra, A. R. Gosthipaty, and A. Marafioti. Finevision: Open data is all you need, 2025
work page 2025
- [60]
-
[61]
T. Wu, C. Tao, J. Wang, R. Yang, Z. Zhao, and N. Wong. Rethinking Kullback-Leibler divergence in knowledge distillation for large language models. In O. Rambow, L. Wanner, M. Apidianaki, H. Al-Khalifa, B. D. Eugenio, and S. Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguistics, pages 5737–5755, Abu Dhabi, UAE, J...
work page 2025
-
[62]
Y . Xiao, E. Sun, T. Liu, and W. Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts, 2024
work page 2024
-
[63]
Y . Xu, H. Sang, Z. Zhou, R. He, Z. Wang, and A. Geramifard. Tip: Token importance in on-policy distillation, 2026
work page 2026
-
[64]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...
work page 2025
-
[65]
C. Yang, C. Qin, Q. Si, M. Chen, N. Gu, D. Yao, Z. Lin, W. Wang, J. Wang, and N. Duan. Self-distilled rlvr, 2026
work page 2026
-
[66]
Y . Yang, X. He, H. Pan, X. Jiang, Y . Deng, X. Yang, H. Lu, D. Yin, F. Rao, M. Zhu, B. Zhang, and W. Chen. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization, 2025
work page 2025
-
[67]
Y . Yang, M. Lai, W. Zhao, X. Fan, Z. Xi, M. Wu, C. Huang, J. Zhao, H. Lv, J. Tong, Y . Zhou, Y . Zou, Q. Guo, T. Gui, Q. Zhang, and X. Huang. Which reasoning trajectories teach students to reason better? a simple metric of informative alignment, 2026
work page 2026
-
[68]
T. Ye, L. Dong, Z. Chi, X. Wu, S. Huang, and F. Wei. Black-box on-policy distillation of large language models, 2026
work page 2026
-
[69]
T. Ye, L. Dong, X. Wu, S. Huang, and F. Wei. On-policy context distillation for language models, 2026
work page 2026
-
[70]
X. Yue, T. Zheng, Y . Ni, Y . Wang, K. Zhang, S. Tong, Y . Sun, B. Yu, G. Zhang, H. Sun, Y . Su, W. Chen, and G. Neubig. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark, 2025. 13
work page 2025
- [71]
- [72]
- [73]
-
[74]
S. Zhao, Z. Xie, M. Liu, J. Huang, G. Pang, F. Chen, and A. Grover. Self-distilled reasoner: On-policy self-distillation for large language models, 2026
work page 2026
-
[75]
Z. Zheng, M. Yang, J. Hong, C. Zhao, G. Xu, L. Yang, C. Shen, and X. Yu. Deepeyes: Incentivizing "thinking with images" via reinforcement learning, 2026
work page 2026
-
[76]
J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shao, Z. Gao, E. Cui, X. Wang, Y . Cao, Y . Liu, X. Wei, H. Zhang, H. Wang, W. Xu, H. Li, J. Wang, N. Deng, S. Li, Y . He, T. Jiang, J. Luo, Y . Wang, C. He, B. Shi, X. Zhang, W. Shao, J. He, Y . Xiong, W. Qu, P. Sun, P. Jiao, H. Lv, L. Wu, K. Zhang, H. Deng, J. Ge, K. Chen, L. W...
work page 2025
-
[77]
as seed datasets, and for real-world perception-centric reasoning, we use DeepEyes47k [ 75], DriveAction [14] and VisualProbetrain [32]. We source images for the new prompts from arXivQA, SciVQA, PDF-VQA [10] (chart & document reasoning), and DeepEyes47k, VisualProbetrain, Div2k & Flickr2k [3], and sama-drives-california [48] (real-world perception-centri...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.