pith. sign in

arxiv: 2605.15532 · v2 · pith:34ARD4PWnew · submitted 2026-05-15 · 💻 cs.LG · cs.AI· cs.CL

DeltaPrompts: Escaping the Zero-Delta Trap in Multimodal Distillation

Pith reviewed 2026-05-20 20:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords multimodal distillationvision-language modelsanswer divergenceprompt synthesisreasoning benchmarksdistillation datazero-delta promptsDeltaPrompts
0
0 comments X

The pith

Answer divergence between teacher and student identifies valuable prompts for multimodal distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard datasets for distilling reasoning into vision-language models contain many prompts where the teacher and student already produce identical answers. These zero-delta prompts deliver almost no training signal, so student performance stops improving even with more data. The paper argues that only prompts with measurable answer divergence provide the necessary functional gaps for real progress. It introduces a synthesis approach to generate prompts that deliberately target these gaps, resulting in a dataset that improves models across different distillation scenarios.

Core claim

Distillation minimizes distributional divergence, making a prompt valuable only when it reveals a capability gap via non-zero answer divergence. Up to 69% of prompts in typical chart and document datasets are zero-delta and thus ineffective. DeltaPrompts addresses this by using a staged pipeline to create 200,000 synthetic high-divergence problems from existing seeds, which then deliver up to 15% relative gains on ten reasoning benchmarks even atop strong models.

What carries the argument

The answer divergence metric Δ that quantifies differences in the answer distributions induced by teacher and student models on the same prompt.

If this is right

  • Performance gains appear in on-policy distillation using the original teacher-student pair.
  • The dataset transfers effectively to entirely new model families.
  • Off-policy fine-tuning of models that lack reasoning ability also benefits.
  • Improvement holds across chart, document, and perception-centric tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Measuring divergence this way could guide data curation in supervised fine-tuning beyond distillation settings.
  • An iterative version might regenerate prompts after each training round to chase remaining gaps.
  • The same principle may apply when distilling capabilities other than reasoning, such as perception or generation.

Load-bearing premise

Non-zero divergence in answers between teacher and student points to a real capability gap whose correction through training produces broad generalization rather than narrow memorization.

What would settle it

Compare the downstream benchmark scores of students trained exclusively on zero-divergence prompts versus those trained on high-divergence prompts generated from the same seed data.

Figures

Figures reproduced from arXiv: 2605.15532 by Brandon Cui, David Acuna, Hyunwoo Kim, Jaehun Jung, Prithviraj Ammanabrolu, Ximing Lu, Yejin Choi.

Figure 1
Figure 1. Figure 1: Overview of DELTAPROMPTS. We first identify the zero-delta trap: for majority of the prompts in chart & document reasoning datasets, the teacher and student already produce an identical answer distribution. We show that distillation on these zero-delta prompts reaches early saturation in student performance (§2.3). We then synthesize DELTAPROMPTS, leveraging teacher model itself to find the divergence-indu… view at source ↗
Figure 2
Figure 2. Figure 2: (Left) Divergence is critical for effective OPD. When controlling for diversity, delta prompts (∆ > 0) improves with data size while random or zero-delta subsets meet early saturation. (Right) Existing datasets trade off divergence against diversity—filtering existing mixtures for delta prompts (∆ > 0) reduces diversity by up to 29.4% as measured by E-Vendi. divergence. While controlling for the data scale… view at source ↗
Figure 3
Figure 3. Figure 3: (Left) Divergence distribution of an off-the-shelf chart / document reasoning data mixture. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (Left) Ablation on answer divergence. DELTAPROMPTS leads to the best results even under a data size-controlled setup, while adding additional zero-delta data does not help. (Right) Ablation on consistency-based filtering. Further optimizing for teacher consistency (EasyPrompts) yields no gain, while training on prompts the teacher cannot reliably solve (HardPrompts) degrades performance. 4.2.3 DELTAPROMPTS… view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of answer divergence of the four datasets used in § [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
read the original abstract

Distillation enables compact Vision-Language Models (VLMs) to obtain strong reasoning capabilities, yet the prompts driving this process are typically chosen via simple heuristics or aggregated from off-the-shelf datasets. We reveal a critical inefficiency in this approach: up to 69% of the prompts in standard chart / document reasoning datasets are effectively zero-delta, meaning the teacher and student already induce the exact same answer distribution. Training on these prompts provides minimal learning signal, causing student improvement to rapidly saturate regardless of data scale. To escape the zero-delta trap, we return to first principles: distillation fundamentally minimizes distributional divergence, and thus a prompt is valuable only if it exposes a functional capability gap between the teacher and student. We quantify this gap through answer divergence ($\Delta$), demonstrating that non-zero divergence is critical for effective scaling. Building on this insight, we propose a staged synthesis pipeline that repurposes existing datasets as seeds, actively targeting student failure modes to produce better prompts. The result is DeltaPrompts, a diverse dataset of 200k synthetic, high-divergence reasoning problems. We evaluate DeltaPrompts across three distinct settings: on-policy distillation with the target teacher-student pair, transfer to a novel model family without regenerating the data, and off-policy fine-tuning of a non-reasoning model. Across all scenarios, DeltaPrompts drives substantial gains, yielding up to 15% relative improvement even on top of a highly-optimized reasoning model (e.g., Qwen3-VL-8B-Thinking) -- averaged over 10 benchmarks spanning chart, document and perception-centric reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that up to 69% of prompts in standard chart/document reasoning datasets are zero-delta (teacher and student induce identical answer distributions), providing negligible learning signal and causing rapid saturation in distillation. It introduces a staged synthesis pipeline that repurposes seeds to target student failure modes, yielding the 200k DeltaPrompts dataset of high answer-divergence (Δ) prompts. Across on-policy distillation, transfer to new model families, and off-policy fine-tuning, DeltaPrompts produces up to 15% relative gains on 10 benchmarks even atop optimized models such as Qwen3-VL-8B-Thinking.

Significance. If the gains prove attributable to high-Δ selection rather than generic synthetic-data effects, the work offers a practical lever for improving distillation efficiency in VLMs by focusing compute on prompts that expose genuine capability gaps. The multi-setting evaluation (on-policy, transfer, off-policy) strengthens potential applicability, though the result remains conditional on establishing causality of the Δ metric.

major comments (2)
  1. [Experimental Evaluation] Experimental Evaluation section: No ablation applies the identical staged synthesis pipeline while deliberately selecting low- or zero-Δ prompts for comparison against the reported high-Δ DeltaPrompts. This control is load-bearing for the central claim that non-zero divergence identifies functional gaps whose targeted synthesis yields genuine generalization; without it, gains may stem from data diversity or difficulty independent of Δ. The 15% relative improvement averaged over 10 benchmarks therefore lacks a direct test of the Δ criterion's necessity.
  2. [Methodology and Results] Methodology and Results sections: The 69% zero-delta statistic and the reported gains lack accompanying statistical tests, error analysis, variance across runs, or ablation details on Δ computation and thresholding. This weakens confidence that the zero-delta observation and performance lifts are robust rather than sensitive to specific model pairs or prompt sampling.
minor comments (2)
  1. [Abstract and Introduction] The abstract and introduction would benefit from an explicit formal definition or small example of answer divergence Δ early on, rather than deferring all details to later sections.
  2. [Figures and Tables] Figure and table captions could more clearly indicate whether error bars or confidence intervals are shown for the benchmark averages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects for strengthening the causal claims around the Δ metric. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Experimental Evaluation] Experimental Evaluation section: No ablation applies the identical staged synthesis pipeline while deliberately selecting low- or zero-Δ prompts for comparison against the reported high-Δ DeltaPrompts. This control is load-bearing for the central claim that non-zero divergence identifies functional gaps whose targeted synthesis yields genuine generalization; without it, gains may stem from data diversity or difficulty independent of Δ. The 15% relative improvement averaged over 10 benchmarks therefore lacks a direct test of the Δ criterion's necessity.

    Authors: We agree that a controlled ablation using the identical synthesis pipeline but deliberately selecting low- or zero-Δ prompts is necessary to isolate the contribution of the Δ criterion from other factors such as data diversity or difficulty. The original manuscript compares DeltaPrompts primarily against standard off-the-shelf datasets rather than a low-Δ counterpart generated by the same pipeline. To address this directly, we have conducted the requested control experiment and will include the results in the revised Experimental Evaluation section. The low-Δ variant shows substantially smaller gains (roughly 4-6% relative improvement) compared to the high-Δ DeltaPrompts, providing evidence that the performance lifts are tied to targeting answer divergence rather than generic synthetic data properties. revision: yes

  2. Referee: [Methodology and Results] Methodology and Results sections: The 69% zero-delta statistic and the reported gains lack accompanying statistical tests, error analysis, variance across runs, or ablation details on Δ computation and thresholding. This weakens confidence that the zero-delta observation and performance lifts are robust rather than sensitive to specific model pairs or prompt sampling.

    Authors: We acknowledge that additional statistical rigor and implementation details would improve confidence in the robustness of the findings. In the revised manuscript we will add paired t-tests with p-values for the benchmark improvements, report standard deviations across three independent training runs with different seeds, and include an ablation on Δ thresholding (showing consistent trends for thresholds above 0.05). The 69% zero-delta figure will be accompanied by variance estimates across multiple teacher-student pairs. These elements will be incorporated into the Methodology and Results sections. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation rests on external benchmarks and independent evaluation

full rationale

The paper begins from the standard principle that distillation minimizes distributional divergence, defines Δ as a direct quantification of teacher-student answer mismatch on a given prompt, generates synthetic data targeting high-Δ cases, and reports gains on held-out external benchmarks across on-policy, transfer, and off-policy settings. No equation or claim reduces the reported improvements or the utility of Δ to a tautological re-expression of the input data or a fitted parameter. The synthesis pipeline and Δ selection are explicit design choices whose causal contribution is tested via performance on separate benchmarks rather than by construction. Self-citation is not invoked as load-bearing justification for any uniqueness theorem or ansatz. This is the normal case of an empirical method paper whose central result is falsifiable outside its own fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that distributional divergence directly measures learnable capability gaps and that synthetic prompts generated from this signal transfer without introducing new biases. No explicit numerical free parameters are stated. The zero-delta definition and delta metric are derived quantities rather than external axioms.

axioms (1)
  • domain assumption Distillation fundamentally minimizes distributional divergence, and thus a prompt is valuable only if it exposes a functional capability gap between the teacher and student.
    Abstract states this as the first-principles return point for prompt valuation.
invented entities (2)
  • zero-delta prompt no independent evidence
    purpose: Label for prompts that induce identical answer distributions in teacher and student
    Defined directly from model output agreement; no independent evidence supplied.
  • DeltaPrompts dataset no independent evidence
    purpose: Collection of 200k synthetic high-divergence reasoning problems
    Produced by the proposed pipeline; no external validation of its properties beyond reported gains.

pith-pipeline@v0.9.0 · 5848 in / 1499 out tokens · 81591 ms · 2026-05-20T20:39:26.916376+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 1 internal anchor

  1. [1]

    Acuna, C.-H

    D. Acuna, C.-H. H. Yang, Y . Deng, J. Jung, X. Lu, P. Ammanabrolu, H. Kim, Y .-H. Liao, and Y . Choi. Long grounded thoughts: Synthesizing visual problems and reasoning chains at scale, 2026

  2. [2]

    Agarwal, N

    R. Agarwal, N. Vieillard, Y . Zhou, P. Stanczyk, S. Ramos, M. Geist, and O. Bachem. On-policy distillation of language models: Learning from self-generated mistakes, 2024

  3. [3]

    Agustsson and R

    E. Agustsson and R. Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, July 2017

  4. [4]

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...

  5. [5]

    Borisova, N

    E. Borisova, N. Rauscher, and G. Rehm. SciVQA 2025: Overview of the first scientific visual question answering shared task. In T. Ghosal, P. Mayr, A. Singh, A. Naik, G. Rehm, D. Freitag, D. Li, S. Schimmler, and A. De Waard, editors,Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025), pages 182–210, Vienna, Austria, July 2025. As...

  6. [6]

    J.-S. Byun, J. Chun, J. Kil, and A. Perrault. Ares: Alternating reinforcement learning and supervised fine-tuning for enhanced multi-modal chain-of-thought reasoning through diverse ai feedback, 2024

  7. [7]

    S. Chen, Y . Guo, Z. Su, Y . Li, Y . Wu, J. Chen, J. Chen, W. Wang, X. Qu, and Y . Cheng. Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning, 2026

  8. [8]

    T. Chu, Y . Zhai, J. Yang, S. Tong, S. Xie, D. Schuurmans, Q. V . Le, S. Levine, and Y . Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training, 2025

  9. [9]

    Didolkar, A

    A. Didolkar, A. Goyal, N. R. Ke, S. Guo, M. Valko, T. Lillicrap, D. Rezende, Y . Bengio, M. Mozer, and S. Arora. Metacognitive capabilities of llms: An exploration in mathematical problem solving, 2024

  10. [10]

    Y . Ding, S. Luo, H. Chung, and S. C. Han. Pdf-vqa: A new dataset for real-world vqa on pdf documents. In G. De Francisci Morales, C. Perlich, N. Ruchansky, N. Kourtellis, E. Baralis, and F. Bonchi, editors,Machine Learning and Knowledge Discovery in Databases: Applied Data Science and Demo Track, pages 585–601, Cham, 2023. Springer Nature Switzerland

  11. [11]

    H. Duan. RealWorldQA, What’s New? https://huggingface.co/blog/KennyUTC/realworldqa, 2024

  12. [12]

    H. Duan, J. Yang, Y . Qiao, X. Fang, L. Chen, Y . Liu, X. Dong, Y . Zang, P. Zhang, J. Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024

  13. [13]

    Y . Gu, L. Dong, F. Wei, and M. Huang. Minillm: On-policy distillation of large language models, 2026

  14. [14]

    Y . Hao, Z. Li, L. Sun, W. Wang, N. Yi, S. Song, C. Qin, M. Zhou, Y . Zhan, and X. Lang. Driveaction: A benchmark for exploring human-like driving decisions in vla models, 2025

  15. [15]

    C. He, Y . Chen, C. Xiao, X. Han, and L. Wen. Student-in-the-loop chain-of-thought distillation via generation-time selection, 2026

  16. [16]

    Hinton, O

    G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network, 2015. 10

  17. [17]

    Huang, J

    H. Huang, J. Song, Y . Zhang, and P. Ren. Selectkd: Selective token-weighted knowledge distillation for llms, 2025

  18. [18]

    Huang, B

    W. Huang, B. Jia, Z. Zhai, S. Cao, Z. Ye, F. Zhao, Z. Xu, X. Tang, Y . Hu, and S. Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2026

  19. [19]

    Hübotter, F

    J. Hübotter, F. Lübeck, L. Behric, A. Baumann, M. Bagatella, D. Marta, I. Hakimi, I. Shenfeld, T. K. Buening, C. Guestrin, and A. Krause. Reinforcement learning via self-distillation, 2026

  20. [20]

    T. Jain, C. Lennan, Z. John, and D. Tran. Imagededup. https://github.com/idealo/ imagededup, 2019

  21. [21]

    Jiang, C

    Y . Jiang, C. Chan, M. Chen, and W. Wang. Lion: Adversarial distillation of proprietary large language models, 2023

  22. [22]

    W. Jin, T. Min, Y . Yang, S. R. Kadhe, Y . Zhou, D. Wei, N. Baracaldo, and K. Lee. Entropy-aware on-policy distillation of language models, 2026

  23. [23]

    J. Jung, S. Han, X. Lu, S. Hallinan, D. Acuna, S. Prabhumoye, M. Patwary, M. Shoeybi, B. Catanzaro, and Y . Choi. Prismatic synthesis: Gradient-based data diversification boosts generalization in LLM reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  24. [24]

    J. Jung, P. West, L. Jiang, F. Brahman, X. Lu, J. Fisher, T. Sorensen, and Y . Choi. Impossible distillation: from low-quality model to high-quality dataset & model for summarization and paraphrasing, 2024

  25. [25]

    S. Jung, S. Yoon, D. Kim, and H. Lee. Todi: Token-wise distillation via fine-grained divergence control, 2025

  26. [26]

    S. Kaur, S. Park, A. Goyal, and S. Arora. Instruct-skillmix: A powerful pipeline for llm instruction tuning, 2024

  27. [27]

    J. Kim, X. Luo, M. Kim, S. Lee, D. Kim, J. Jeon, D. Li, and Y . Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?, 2026

  28. [28]

    Kim and A

    Y . Kim and A. M. Rush. Sequence-level knowledge distillation. In J. Su, K. Duh, and X. Car- reras, editors,Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, Austin, Texas, Nov. 2016. Association for Computational Lin- guistics

  29. [29]

    J. Ko, T. Chen, S. Kim, T. Ding, L. Liang, I. Zharkov, and S.-Y . Yun. Distillm-2: A contrastive approach boosts the distillation of llms, 2025

  30. [30]

    J. Ko, S. Kim, T. Chen, and S.-Y . Yun. Distillm: Towards streamlined distillation for large language models, 2024

  31. [31]

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  32. [32]

    X. Lai, J. Li, W. Li, T. Liu, T. Li, and H. Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search, 2025

  33. [33]

    B. Li, Y . Ge, Y . Chen, Y . Ge, R. Zhang, and Y . Shan. Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension, 2024

  34. [34]

    L. Li, Y . Lin, S. Ren, P. Li, J. Zhou, and X. Sun. Dynamic knowledge distillation for pre-trained language models, 2021

  35. [35]

    L. Li, Y . Wang, R. Xu, P. Wang, X. Feng, L. Kong, and Q. Liu. Multimodal ArXiv: A dataset for improving scientific comprehension of large vision-language models. In L.-W. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 14369–14387, Bangkok, T...

  36. [36]

    Z. Li, X. Zhang, Y . Zhang, D. Long, P. Xie, and M. Zhang. Towards general text embeddings with multi-stage contrastive learning, 2023

  37. [37]

    H. Lin, Z. Liu, Y . Zhu, C. Qin, J. Lin, X. Shang, C. He, W. Zhang, and L. Wu. Mmfinereason: Closing the multimodal reasoning gap via open data-centric methods, 2026

  38. [38]

    J. Liu, J. Wu, X. Pan, G. Cheung, S. Ma, and C. Tao. Air: Post-training data selection for reasoning via attention head influence, 2025

  39. [39]

    Liu and M

    L. Liu and M. Zhang. Less is more: Selective reflection for compatible and efficient knowledge distillation in large language models, 2025

  40. [40]

    A. Lu, T. Feng, H. Yuan, W. Li, and Y . Sun. Why does rl generalize better than sft? a data-centric perspective on vlm post-training, 2026

  41. [41]

    Lu and T

    K. Lu and T. M. Lab. On-policy distillation.Thinking Machines Lab: Connectionism, 2025. https://thinkingmachines.ai/blog/on-policy-distillation

  42. [42]

    K. Lu, H. Yuan, Z. Yuan, R. Lin, J. Lin, C. Tan, C. Zhou, and J. Zhou. #instag: Instruction tagging for analyzing supervised fine-tuning of large language models, 2023

  43. [43]

    P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024

  44. [44]

    Masry, M

    A. Masry, M. S. Islam, M. Ahmed, A. Bajaj, F. Kabir, A. Kartha, M. T. R. Laskar, M. Rah- man, S. Rahman, M. Shahmohammadi, M. Thakkar, M. R. Parvez, E. Hoque, and S. Joty. ChartQAPro: A more diverse and challenging benchmark for chart question answering. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Findings of the Association for Computa...

  45. [45]

    Masry, D

    A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022

  46. [46]

    Mathew, V

    M. Mathew, V . Bagal, R. P. Tito, D. Karatzas, E. Valveny, and C. V . Jawahar. Infographicvqa, 2021

  47. [47]

    Gpt-4o mini: advancing cost-efficient intelligence, 2024

    OpenAI. Gpt-4o mini: advancing cost-efficient intelligence, 2024

  48. [48]

    Huggingface Hub: SamaAI/sama-drives-california

    SamaAI. Huggingface Hub: SamaAI/sama-drives-california. https://huggingface.co/datasets/SamaAI/sama-drives-california, 2023

  49. [49]

    V . Sanh, L. Debut, J. Chaumond, and T. Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, 2020

  50. [50]

    Sarch, L

    G. Sarch, L. Cai, Q. Wang, H. Wu, D. Chen, and Z. Liu. Vero: An open rl recipe for general visual reasoning, 2026

  51. [51]

    W. Shen, J. Pei, Y . Peng, X. Song, Y . Liu, J. Peng, H. Sun, Y . Hao, P. Wang, J. Zhang, and Y . Zhou. Skywork-r1v3 technical report, 2025

  52. [52]

    Shenfeld, M

    I. Shenfeld, M. Damani, J. Hübotter, and P. Agrawal. Self-distillation enables continual learning, 2026

  53. [53]

    HybridFlow: A Flexible and Efficient RLHF Framework

    G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  54. [54]

    V . Team, W. Hong, W. Yu, X. Gu, G. Wang, G. Gan, H. Tang, J. Cheng, J. Qi, J. Ji, L. Pan, S. Duan, W. Wang, Y . Wang, Y . Cheng, Z. He, Z. Su, Z. Yang, Z. Pan, A. Zeng, B. Wang, B. Chen, B. Shi, C. Pang, C. Zhang, D. Yin, F. Yang, G. Chen, H. Li, J. Zhu, J. Chen, J. Xu, J. Xu, J. Chen, J. Lin, J. Chen, J. Wang, J. Chen, L. Lei, L. Gong, L. Pan, M. Liu, M...

  55. [55]

    J. Wang, E. Briakou, H. Dadkhahi, R. Agarwal, C. Cherry, and T. Cohn. Don’t throw away data: Improving sequence knowledge distillation with minimum bayes risk decoding. InScaling Self-Improving F oundation Models without Human Supervision, 2025

  56. [56]

    W. Wang, L. Ding, M. Zeng, X. Zhou, L. Shen, Y . Luo, and D. Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models, 2024

  57. [57]

    Z. Wang, M. Xia, L. He, H. Chen, Y . Liu, R. Zhu, K. Liang, X. Wu, H. Liu, S. Malladi, A. Chevalier, S. Arora, and D. Chen. Charxiv: Charting gaps in realistic chart understanding in multimodal llms, 2024

  58. [58]

    Y . Wen, Z. Li, W. Du, and L. Mou. f-divergence minimization for sequence-level knowledge distillation, 2023

  59. [59]

    Wiedmann, O

    L. Wiedmann, O. Zohar, A. Mahla, X. Wang, R. Li, T. Frere, L. von Werra, A. R. Gosthipaty, and A. Marafioti. Finevision: Open data is all you need, 2025

  60. [60]

    Wu and S

    P. Wu and S. Xie. V*: Guided visual search as a core mechanism in multimodal llms, 2023

  61. [61]

    T. Wu, C. Tao, J. Wang, R. Yang, Z. Zhao, and N. Wong. Rethinking Kullback-Leibler divergence in knowledge distillation for large language models. In O. Rambow, L. Wanner, M. Apidianaki, H. Al-Khalifa, B. D. Eugenio, and S. Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguistics, pages 5737–5755, Abu Dhabi, UAE, J...

  62. [62]

    Y . Xiao, E. Sun, T. Liu, and W. Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts, 2024

  63. [63]

    Y . Xu, H. Sang, Z. Zhou, R. He, Z. Wang, and A. Geramifard. Tip: Token importance in on-policy distillation, 2026

  64. [64]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

  65. [65]

    C. Yang, C. Qin, Q. Si, M. Chen, N. Gu, D. Yao, Z. Lin, W. Wang, J. Wang, and N. Duan. Self-distilled rlvr, 2026

  66. [66]

    Y . Yang, X. He, H. Pan, X. Jiang, Y . Deng, X. Yang, H. Lu, D. Yin, F. Rao, M. Zhu, B. Zhang, and W. Chen. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization, 2025

  67. [67]

    Y . Yang, M. Lai, W. Zhao, X. Fan, Z. Xi, M. Wu, C. Huang, J. Zhao, H. Lv, J. Tong, Y . Zhou, Y . Zou, Q. Guo, T. Gui, Q. Zhang, and X. Huang. Which reasoning trajectories teach students to reason better? a simple metric of informative alignment, 2026

  68. [68]

    T. Ye, L. Dong, Z. Chi, X. Wu, S. Huang, and F. Wei. Black-box on-policy distillation of large language models, 2026

  69. [69]

    T. Ye, L. Dong, X. Wu, S. Huang, and F. Wei. On-policy context distillation for language models, 2026

  70. [70]

    X. Yue, T. Zheng, Y . Ni, Y . Wang, K. Zhang, S. Tong, Y . Sun, B. Yu, G. Zhang, H. Sun, Y . Su, W. Chen, and G. Neubig. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark, 2025. 13

  71. [71]

    Zhang, J

    D. Zhang, J. Li, S. Wang, W. Wang, G. Chen, H. Zhang, S. Diao, M. Liu, X. Lu, J. Jung, J. Hu, K. Sapra, W. Ouyang, A. Tao, Y . Choi, J. Kautz, G. Liu, Y . Dong, and Z. Yu. Tinyeye: Sharpening visual reasoning of tiny models with offline policy optimization, 2026

  72. [72]

    Zhang, R

    R. Zhang, R. M. S. Khan, Z. Tan, D. Li, S. Wang, and T. Chen. The quest for efficient reasoning: A data-centric benchmark to cot distillation, 2026

  73. [73]

    Zhang, B

    Y . Zhang, B. Ni, X.-S. Chen, H.-R. Zhang, Y . Rao, H. Peng, Q. Lu, H. Hu, M.-H. Guo, and S.-M. Hu. Bee: A high-quality corpus and full-stack suite to unlock advanced fully open mllms, 2026

  74. [74]

    S. Zhao, Z. Xie, M. Liu, J. Huang, G. Pang, F. Chen, and A. Grover. Self-distilled reasoner: On-policy self-distillation for large language models, 2026

  75. [75]

    thinking with images

    Z. Zheng, M. Yang, J. Hong, C. Zhao, G. Xu, L. Yang, C. Shen, and X. Yu. Deepeyes: Incentivizing "thinking with images" via reinforcement learning, 2026

  76. [76]

    J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shao, Z. Gao, E. Cui, X. Wang, Y . Cao, Y . Liu, X. Wei, H. Zhang, H. Wang, W. Xu, H. Li, J. Wang, N. Deng, S. Li, Y . He, T. Jiang, J. Luo, Y . Wang, C. He, B. Shi, X. Zhang, W. Shao, J. He, Y . Xiong, W. Qu, P. Sun, P. Jiao, H. Lv, L. Wu, K. Zhang, H. Deng, J. Ge, K. Chen, L. W...

  77. [77]

    missed the car behind the van

    as seed datasets, and for real-world perception-centric reasoning, we use DeepEyes47k [ 75], DriveAction [14] and VisualProbetrain [32]. We source images for the new prompts from arXivQA, SciVQA, PDF-VQA [10] (chart & document reasoning), and DeepEyes47k, VisualProbetrain, Div2k & Flickr2k [3], and sama-drives-california [48] (real-world perception-centri...