pith. sign in

arxiv: 2606.25432 · v2 · pith:OPEA3CSInew · submitted 2026-06-24 · 💻 cs.LG · cs.AI· cs.CV

Brevity is the Soul of Inference Efficiency: Inducing Concision in VLMs via Data Curation

Pith reviewed 2026-07-01 06:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV
keywords inference efficiencydata curationvision-language modelsbrevityCost-of-Passoutput lengthpretraining dataVLMs
0
0 comments X

The pith

Training VLMs on curated concise data cuts Cost-of-Pass by 35x at nearly identical accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that output length, not just model size, is a major driver of inference cost and that data curation can induce shorter correct answers. Models trained on the curated MAmmoTH-VL subset learn to answer in fewer tokens, which the authors price as FLOPs per correct answer after using per-model regressions to hold length fixed. This yields a 35-fold Cost-of-Pass reduction versus the most verbose 4B comparator within about one accuracy point, plus a 17.55-point accuracy gain when length is matched. The work shows that generic verbosity adds no accuracy value and that the value of structured reasoning verbosity shrinks with scale.

Core claim

A model trained on concise, correct data learns to answer in fewer tokens and therefore has a lower Cost-of-Pass; on controlled evaluations the curated 1B-4B models reach 0.41 TFLOPs per correct answer versus 14.58 for the most verbose comparator while staying within one percentage point of accuracy and delivering large matched-length accuracy gains that increase with scale.

What carries the argument

The VLM curation pipeline applied to the MAmmoTH-VL single-image subset, paired with per-model regression to separate brevity from quality when computing FLOPs per correct answer.

If this is right

  • Matched-length accuracy improves by 17.55 points over the uncurated baseline and the gain grows from +16.7 pp at 1B to +21.2 pp at 4B.
  • Generic verbosity yields no accuracy benefit at any capability level or scale.
  • The accuracy window where reasoning-structured verbosity still pays for its tokens shrinks from 4 of 8 capability groups at 2B to 1 of 8 at 4B.
  • The concise model solves some examples correctly that the verbose reasoning model misses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the curation effect holds, training pipelines could systematically target output length as an optimizable variable rather than an emergent side-effect of scale.
  • The finding suggests that efficiency comparisons across models should routinely report tokens per correct answer in addition to accuracy and parameter count.
  • The same curation approach might be tested on text-only language models or on multi-turn dialogue tasks to check whether brevity remains advantageous outside single-image VLM settings.

Load-bearing premise

The per-model regression that holds output length fixed truly isolates brevity from quality, and the differences between curated and standard MAmmoTH-VL data are the causal driver of the observed brevity patterns.

What would settle it

An experiment in which models trained on the curated data fail to produce shorter outputs or lower Cost-of-Pass than models trained on the uncurated MAmmoTH-VL data when evaluated on the same 20-task suite.

read the original abstract

Inference efficiency is typically pursued by shrinking the model: distillation, pruning, quantization, and sparse routing each lower per-token cost while treating token count as fixed. But output length has been inflating, and it is precisely the component the standard toolkit leaves untouched. Here, we argue that brevity is the missing inference-efficiency lever, and that pretraining data curation is a practical way to pull it: a model trained on concise, correct data learns to answer in fewer tokens; i.e. it has a lower Cost-of-Pass. We apply our VLM curation pipeline to the MAmmoTH-VL single-image subset, and compare models trained on our curated data, the standard MAmmoTH-VL data, and external open-weight frontier VLMs. On a controlled 20-evaluation set and 14 VLMs at 1B-4B activated parameters, we hold output length fixed with a per-model regression, separating brevity from quality, and price models in FLOPs per correct answer. Curation buys a 35x Cost-of-Pass advantage over the most verbose 4B comparator (Qwen3.5-4B) within $\sim$1 pp of accuracy (0.41 vs 14.58 TFLOPs per correct answer; 0.691 vs 0.704 mean accuracy). Curation also buys a +17.55-percentage-point matched-length accuracy gain over the uncurated baseline that grows with model scale (from +16.7 pp at 1B to +21.2 pp at 4B). This brevity improvement concedes no quality: generic verbosity buys no accuracy at any capability or scale, and the window where reasoning-structured verbosity still earns its tokens shrinks from 4 of 8 capability groups at 2B to 1 of 8 at 4B. Per example, the concise model even reaches correct answers the verbose reasoning model misses, marking reasoning as a distinct curation target rather than something brevity gives up. Inference efficiency in this regime is a tokens-per-correct problem, and brevity is the lever that targets it directly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that curating concise, correct pretraining data for VLMs induces shorter outputs without quality loss, yielding large gains in inference efficiency via lower Cost-of-Pass (FLOPs per correct answer). On a 20-task evaluation set across 14 VLMs (1B–4B activated parameters), a per-model regression holds output length fixed to isolate brevity from quality; this produces a 35× Cost-of-Pass advantage over the most verbose 4B comparator (0.41 vs. 14.58 TFLOPs per correct answer at 0.691 vs. 0.704 mean accuracy) and a +17.55 pp matched-length accuracy gain over the uncurated MAmmoTH-VL baseline that increases with scale.

Significance. If the empirical results hold, the work identifies data curation for concision as a practical, orthogonal lever for inference efficiency that directly targets token count rather than per-token cost. The concrete 35× Cost-of-Pass figure, scale-dependent accuracy gains, and comparisons across capability groups and external frontier models provide a falsifiable, quantitative case for brevity as an efficiency target. The use of a regression control to separate length from quality is a methodological strength that strengthens the causal attribution to curation.

major comments (2)
  1. [Abstract and experimental setup (regression control)] The per-model regression that holds output length fixed (Abstract; experimental setup) is load-bearing for both the 35× Cost-of-Pass metric and the +17.55 pp matched-length accuracy claim. The manuscript does not specify the functional form, whether scale or capability-group interactions are modeled, or report any diagnostics (R², residuals, or sensitivity to specification). If the length–accuracy relationship is nonlinear or heterogeneous, the isolation of brevity effects fails and the headline numbers become unreliable.
  2. [Results (Cost-of-Pass and matched-length comparisons)] Table or figure reporting the 14-VLM Cost-of-Pass and matched-length accuracy results: the 35× advantage and the claim that “generic verbosity buys no accuracy at any capability or scale” rest on the regression-adjusted numbers. Without the regression specification, fitted coefficients, or cross-validation details, it is impossible to assess whether the control adequately separates brevity from quality or whether omitted variables (dataset differences beyond length) drive the patterns.
minor comments (2)
  1. [Abstract] Abstract and results: reported accuracies and TFLOPs figures lack error bars, standard errors, or confidence intervals, making it difficult to judge whether the ~1 pp accuracy difference or the +17.55 pp gain is statistically distinguishable from noise.
  2. [Data curation pipeline] Dataset section: no summary statistics (token-length distributions, curation criteria, or overlap with standard MAmmoTH-VL) are provided for the curated versus baseline data, hindering assessment of whether brevity is the causal driver.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting the importance of transparency in our regression analysis. The comments correctly identify that additional methodological details are needed to fully support the Cost-of-Pass and matched-length accuracy claims. We will revise the manuscript to address these points.

read point-by-point responses
  1. Referee: [Abstract and experimental setup (regression control)] The per-model regression that holds output length fixed (Abstract; experimental setup) is load-bearing for both the 35× Cost-of-Pass metric and the +17.55 pp matched-length accuracy claim. The manuscript does not specify the functional form, whether scale or capability-group interactions are modeled, or report any diagnostics (R², residuals, or sensitivity to specification). If the length–accuracy relationship is nonlinear or heterogeneous, the isolation of brevity effects fails and the headline numbers become unreliable.

    Authors: We agree that the regression details are essential for validating the isolation of brevity effects. The current manuscript provides only a high-level description. In revision, we will add a dedicated subsection in the experimental setup that specifies: the functional form (ordinary least-squares linear regression of accuracy on output length, estimated separately for each of the 14 models), inclusion of scale and capability-group interactions (we will report both baseline and interacted specifications), and full diagnostics (R², residual plots, and sensitivity to quadratic terms or alternative controls). This will allow direct assessment of whether the length–accuracy relationship is adequately captured. revision: yes

  2. Referee: [Results (Cost-of-Pass and matched-length comparisons)] Table or figure reporting the 14-VLM Cost-of-Pass and matched-length accuracy results: the 35× advantage and the claim that “generic verbosity buys no accuracy at any capability or scale” rest on the regression-adjusted numbers. Without the regression specification, fitted coefficients, or cross-validation details, it is impossible to assess whether the control adequately separates brevity from quality or whether omitted variables (dataset differences beyond length) drive the patterns.

    Authors: We acknowledge that the results section relies on regression-adjusted quantities without presenting the underlying model details. We will introduce a new appendix table (or expanded main-text table) that reports, for each model: the regression intercept and slope, R², fitted coefficients, and any cross-validation or robustness metrics. We will also discuss how per-model estimation mitigates omitted-variable concerns arising from capability differences, while noting that dataset differences beyond length are controlled by the shared evaluation set. These additions will make the 35× Cost-of-Pass and +17.55 pp claims fully auditable. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical comparisons with statistical control

full rationale

The paper's central claims rest on training separate models on curated versus standard MAmmoTH-VL data, then comparing accuracy and efficiency metrics across 14 VLMs. The per-model regression is used only as a post-hoc statistical control to match output length when reporting accuracy differences and Cost-of-Pass; it does not define any target quantity in terms of itself or rename a fitted parameter as an independent prediction. No equations, self-citations, or uniqueness theorems appear in the provided text that reduce the reported gains to inputs by construction. This is the normal case of an empirical study whose results are falsifiable against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the curation pipeline and the regression control; no free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption The per-model regression accurately separates brevity effects from quality differences across models.
    Abstract states that output length is held fixed with a per-model regression to separate brevity from quality.

pith-pipeline@v0.9.1-grok · 5965 in / 1285 out tokens · 32209 ms · 2026-07-01T06:28:05.026510+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 19 canonical work pages · 8 internal anchors

  1. [1]

    A Long Way to Go: Investigating Length Correlations in

    Singhal, Prasann and Goyal, Tanya and Xu, Jiacheng and Durrett, Greg , journal =. A Long Way to Go: Investigating Length Correlations in. 2024 , note =

  2. [5]

    Proceedings of the Third Conference on Machine Translation (WMT) , year =

    Correcting Length Bias in Neural Machine Translation , author =. Proceedings of the Third Conference on Machine Translation (WMT) , year =

  3. [6]

    International Conference on Learning Representations (ICLR) , year =

    The Curious Case of Neural Text Degeneration , author =. International Conference on Learning Representations (ICLR) , year =

  4. [7]

    International Conference on Learning Representations (ICLR) , year =

    Neural Text Generation with Unlikelihood Training , author =. International Conference on Learning Representations (ICLR) , year =

  5. [9]

    2025 , url =

    Du, Zheng and Kang, Hao and Han, Song and Krishna, Tushar and Zhu, Ligeng , journal =. 2025 , url =

  6. [10]

    Beyond Accuracy: Decomposing the Reasoning Efficiency of

    Kaiser, Daniel and Frigessi, Arnoldo and Ramezani-Kebrya, Ali and Ricaud, Benjamin , journal =. Beyond Accuracy: Decomposing the Reasoning Efficiency of. 2026 , url =

  7. [11]

    2025 , url =

    Kaiser, Daniel and Frigessi, Arnoldo and Ramezani-Kebrya, Ali and Ricaud, Benjamin , journal =. 2025 , url =

  8. [12]

    The Price of Progress: Price Performance and the Future of

    Gundlach, Hans and Lynch, Jayson and Mertens, Matthias and Thompson, Neil , journal =. The Price of Progress: Price Performance and the Future of. 2025 , url =

  9. [14]

    Enhancing Factuality in Detailed Image Captioning with

    Lee, Saehyung and Yoon, Seunghyun and Bui, Trung and Shi, Jing and Yoon, Sungroh , booktitle =. Enhancing Factuality in Detailed Image Captioning with. 2025 , note =

  10. [15]

    Udandarao, Vishaal and Cherti, Mehdi and Karthik, Shyamgopal and Jitsev, Jenia and Albanie, Samuel and Bethge, Matthias , journal =. A Good. 2025 , url =

  11. [17]

    2024 , url =

    Chen, Dongping and others , journal =. 2024 , url =

  12. [18]

    Proceedings of the 36th Annual Conference of the Cognitive Science Society (CogSci) , year =

    Amortized Inference in Probabilistic Reasoning , author =. Proceedings of the 36th Annual Conference of the Cognitive Science Society (CogSci) , year =

  13. [19]

    2019 , url =

    The Bitter Lesson , author =. 2019 , url =

  14. [20]

    Training Compute-Optimal Large Language Models

    Training Compute-Optimal Large Language Models , author =. arXiv preprint arXiv:2203.15556 , year =

  15. [21]

    20/20 Vision Language Models: A Prescription for Better

    DatologyAI , journal =. 20/20 Vision Language Models: A Prescription for Better. 2026 , url =

  16. [22]

    2024 , url =

    Guo, Jarvis and others , journal =. 2024 , url =

  17. [23]

    2025 , howpublished =

  18. [24]

    2026 , howpublished =

    Report capability against a compute budget, not as a single number , author =. 2026 , howpublished =

  19. [25]

    2026 , howpublished =

    Google I/O 2026 keynote: token-volume growth , author =. 2026 , howpublished =

  20. [26]

    2026 , howpublished =

  21. [27]

    2026 , howpublished =

    The world will be capacity-constrained for some time , author =. 2026 , howpublished =

  22. [28]

    2025 , howpublished =

    2025: The State of Generative. 2025 , howpublished =

  23. [29]

    2026 , howpublished =

    Uber caps employee. 2026 , howpublished =

  24. [30]

    2025 , howpublished =

    Measuring Thinking Efficiency in Reasoning Models: The Missing Benchmark , author =. 2025 , howpublished =

  25. [31]

    2026 , howpublished =

    Reasoning effort parameter , author =. 2026 , howpublished =

  26. [32]

    N. Brown. Report capability against a compute budget, not as a single number. Post on X (OpenAI), 2026. URL https://x.com/polynoamial/status/2064210146558136827

  27. [33]

    Chen et al

    D. Chen et al. MLLM -as-a-judge: Assessing multimodal LLM -as-a-judge with vision-language benchmark. arXiv preprint arXiv:2402.04788, 2024. URL https://arxiv.org/abs/2402.04788

  28. [34]

    20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

    DatologyAI. 20/20 vision language models: A prescription for better VLMs through data curation alone. arXiv preprint arXiv:2605.11405, 2026. URL https://arxiv.org/abs/2605.11405

  29. [35]

    Z. Du, H. Kang, S. Han, T. Krishna, and L. Zhu. OckBench : Measuring the efficiency of LLM reasoning. arXiv preprint arXiv:2511.05722, 2025. URL https://arxiv.org/abs/2511.05722

  30. [36]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto. Length-controlled AlpacaEval : A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024. URL https://arxiv.org/abs/2404.04475

  31. [37]

    LLM responses are getting longer

    Epoch AI . LLM responses are getting longer. Epoch AI Data Insight, 2025. URL https://epoch.ai/data-insights/output-length

  32. [38]

    M. H. Erol et al. Cost-of-pass: An economic framework for evaluating language models. arXiv preprint arXiv:2504.13359, 2025. URL https://arxiv.org/abs/2504.13359. Stanford

  33. [39]

    S. J. Gershman and N. D. Goodman. Amortized inference in probabilistic reasoning. In Proceedings of the 36th Annual Conference of the Cognitive Science Society (CogSci), 2014. URL https://dblp.org/rec/conf/cogsci/GershmanG14.html

  34. [40]

    Gundlach, J

    H. Gundlach, J. Lynch, M. Mertens, and N. Thompson. The price of progress: Price performance and the future of AI . arXiv preprint arXiv:2511.23455, 2025. URL https://arxiv.org/abs/2511.23455

  35. [41]

    Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale.arXiv preprint arXiv:2412.05237, 2024

    J. Guo et al. MAmmoTH-VL : Eliciting multimodal reasoning with instruction tuning at scale. arXiv preprint arXiv:2412.05237, 2024. URL https://arxiv.org/abs/2412.05237

  36. [42]

    The Curious Case of Neural Text Degeneration

    A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi. The curious case of neural text degeneration. In International Conference on Learning Representations (ICLR), 2020. URL https://arxiv.org/abs/1904.09751

  37. [43]

    Huang, Z

    Z. Huang, Z. Qiu, Z. Wang, E. M. Ponti, and I. Titov. Post-hoc reward calibration: A case study on length bias. arXiv preprint arXiv:2409.17407, 2024. URL https://arxiv.org/abs/2409.17407

  38. [44]

    M. Jung, S. Lee, E. Kim, and S. Yoon. Visual attention never fades: Selective progressive attention recalibration for detailed image captioning in multimodal large language models. arXiv preprint arXiv:2502.01419, 2025. URL https://arxiv.org/abs/2502.01419

  39. [45]

    Kaiser, A

    D. Kaiser, A. Frigessi, A. Ramezani-Kebrya, and B. Ricaud. CogniLoad : A synthetic natural language reasoning benchmark with tunable length, intrinsic difficulty, and distractor density. arXiv preprint arXiv:2509.18458, 2025. URL https://arxiv.org/abs/2509.18458

  40. [46]

    Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs

    D. Kaiser, A. Frigessi, A. Ramezani-Kebrya, and B. Ricaud. Beyond accuracy: Decomposing the reasoning efficiency of LLMs . arXiv preprint arXiv:2602.09805, 2026. URL https://arxiv.org/abs/2602.09805

  41. [47]

    S. Lee, S. Yoon, T. Bui, J. Shi, and S. Yoon. Enhancing factuality in detailed image captioning with LLM - MLLM collaboration. In International Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=psIymxANmd. OpenReview psIymxANmd

  42. [48]

    Correcting Length Bias in Neural Machine Translation

    K. Murray and D. Chiang. Correcting length bias in neural machine translation. In Proceedings of the Third Conference on Machine Translation (WMT), 2018. URL https://arxiv.org/abs/1808.10006

  43. [49]

    Measuring thinking efficiency in reasoning models: The missing benchmark

    Nous Research . Measuring thinking efficiency in reasoning models: The missing benchmark. Nous Research, 2025. URL https://nousresearch.com/measuring-thinking-efficiency-in-reasoning-models-the-missing-benchmark

  44. [50]

    Reasoning effort parameter

    OpenAI . Reasoning effort parameter. OpenAI API documentation, 2026. URL https://developers.openai.com/api/docs/guides/reasoning

  45. [51]

    R. Park, R. Rafailov, S. Ermon, and C. Finn. Disentangling length from quality in direct preference optimization. arXiv preprint arXiv:2403.19159, 2024. URL https://arxiv.org/abs/2403.19159

  46. [52]

    S. Pichai. Google i/o 2026 keynote: token-volume growth. Google Blog, 2026. URL https://blog.google/innovation-and-ai/sundar-pichai-io-2026/

  47. [53]

    Singhal, T

    P. Singhal, T. Goyal, J. Xu, and G. Durrett. A long way to go: Investigating length correlations in RLHF . Conference on Language Modeling (COLM), 2024. URL https://arxiv.org/abs/2310.03716. arXiv:2310.03716

  48. [54]

    R. S. Sutton. The bitter lesson, 2019. URL http://www.incompleteideas.net/IncIdeas/BitterLesson.html

  49. [55]

    Welleck, I

    S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston. Neural text generation with unlikelihood training. In International Conference on Learning Representations (ICLR), 2020. URL https://arxiv.org/abs/1908.04319

  50. [56]

    Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, et al. Google's neural machine translation system: Bridging the gap between human and machine translation. 2016. URL https://arxiv.org/abs/1609.08144