Brevity is the Soul of Inference Efficiency: Inducing Concision in VLMs via Data Curation
Pith reviewed 2026-07-01 06:28 UTC · model grok-4.3
The pith
Training VLMs on curated concise data cuts Cost-of-Pass by 35x at nearly identical accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A model trained on concise, correct data learns to answer in fewer tokens and therefore has a lower Cost-of-Pass; on controlled evaluations the curated 1B-4B models reach 0.41 TFLOPs per correct answer versus 14.58 for the most verbose comparator while staying within one percentage point of accuracy and delivering large matched-length accuracy gains that increase with scale.
What carries the argument
The VLM curation pipeline applied to the MAmmoTH-VL single-image subset, paired with per-model regression to separate brevity from quality when computing FLOPs per correct answer.
If this is right
- Matched-length accuracy improves by 17.55 points over the uncurated baseline and the gain grows from +16.7 pp at 1B to +21.2 pp at 4B.
- Generic verbosity yields no accuracy benefit at any capability level or scale.
- The accuracy window where reasoning-structured verbosity still pays for its tokens shrinks from 4 of 8 capability groups at 2B to 1 of 8 at 4B.
- The concise model solves some examples correctly that the verbose reasoning model misses.
Where Pith is reading between the lines
- If the curation effect holds, training pipelines could systematically target output length as an optimizable variable rather than an emergent side-effect of scale.
- The finding suggests that efficiency comparisons across models should routinely report tokens per correct answer in addition to accuracy and parameter count.
- The same curation approach might be tested on text-only language models or on multi-turn dialogue tasks to check whether brevity remains advantageous outside single-image VLM settings.
Load-bearing premise
The per-model regression that holds output length fixed truly isolates brevity from quality, and the differences between curated and standard MAmmoTH-VL data are the causal driver of the observed brevity patterns.
What would settle it
An experiment in which models trained on the curated data fail to produce shorter outputs or lower Cost-of-Pass than models trained on the uncurated MAmmoTH-VL data when evaluated on the same 20-task suite.
read the original abstract
Inference efficiency is typically pursued by shrinking the model: distillation, pruning, quantization, and sparse routing each lower per-token cost while treating token count as fixed. But output length has been inflating, and it is precisely the component the standard toolkit leaves untouched. Here, we argue that brevity is the missing inference-efficiency lever, and that pretraining data curation is a practical way to pull it: a model trained on concise, correct data learns to answer in fewer tokens; i.e. it has a lower Cost-of-Pass. We apply our VLM curation pipeline to the MAmmoTH-VL single-image subset, and compare models trained on our curated data, the standard MAmmoTH-VL data, and external open-weight frontier VLMs. On a controlled 20-evaluation set and 14 VLMs at 1B-4B activated parameters, we hold output length fixed with a per-model regression, separating brevity from quality, and price models in FLOPs per correct answer. Curation buys a 35x Cost-of-Pass advantage over the most verbose 4B comparator (Qwen3.5-4B) within $\sim$1 pp of accuracy (0.41 vs 14.58 TFLOPs per correct answer; 0.691 vs 0.704 mean accuracy). Curation also buys a +17.55-percentage-point matched-length accuracy gain over the uncurated baseline that grows with model scale (from +16.7 pp at 1B to +21.2 pp at 4B). This brevity improvement concedes no quality: generic verbosity buys no accuracy at any capability or scale, and the window where reasoning-structured verbosity still earns its tokens shrinks from 4 of 8 capability groups at 2B to 1 of 8 at 4B. Per example, the concise model even reaches correct answers the verbose reasoning model misses, marking reasoning as a distinct curation target rather than something brevity gives up. Inference efficiency in this regime is a tokens-per-correct problem, and brevity is the lever that targets it directly.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that curating concise, correct pretraining data for VLMs induces shorter outputs without quality loss, yielding large gains in inference efficiency via lower Cost-of-Pass (FLOPs per correct answer). On a 20-task evaluation set across 14 VLMs (1B–4B activated parameters), a per-model regression holds output length fixed to isolate brevity from quality; this produces a 35× Cost-of-Pass advantage over the most verbose 4B comparator (0.41 vs. 14.58 TFLOPs per correct answer at 0.691 vs. 0.704 mean accuracy) and a +17.55 pp matched-length accuracy gain over the uncurated MAmmoTH-VL baseline that increases with scale.
Significance. If the empirical results hold, the work identifies data curation for concision as a practical, orthogonal lever for inference efficiency that directly targets token count rather than per-token cost. The concrete 35× Cost-of-Pass figure, scale-dependent accuracy gains, and comparisons across capability groups and external frontier models provide a falsifiable, quantitative case for brevity as an efficiency target. The use of a regression control to separate length from quality is a methodological strength that strengthens the causal attribution to curation.
major comments (2)
- [Abstract and experimental setup (regression control)] The per-model regression that holds output length fixed (Abstract; experimental setup) is load-bearing for both the 35× Cost-of-Pass metric and the +17.55 pp matched-length accuracy claim. The manuscript does not specify the functional form, whether scale or capability-group interactions are modeled, or report any diagnostics (R², residuals, or sensitivity to specification). If the length–accuracy relationship is nonlinear or heterogeneous, the isolation of brevity effects fails and the headline numbers become unreliable.
- [Results (Cost-of-Pass and matched-length comparisons)] Table or figure reporting the 14-VLM Cost-of-Pass and matched-length accuracy results: the 35× advantage and the claim that “generic verbosity buys no accuracy at any capability or scale” rest on the regression-adjusted numbers. Without the regression specification, fitted coefficients, or cross-validation details, it is impossible to assess whether the control adequately separates brevity from quality or whether omitted variables (dataset differences beyond length) drive the patterns.
minor comments (2)
- [Abstract] Abstract and results: reported accuracies and TFLOPs figures lack error bars, standard errors, or confidence intervals, making it difficult to judge whether the ~1 pp accuracy difference or the +17.55 pp gain is statistically distinguishable from noise.
- [Data curation pipeline] Dataset section: no summary statistics (token-length distributions, curation criteria, or overlap with standard MAmmoTH-VL) are provided for the curated versus baseline data, hindering assessment of whether brevity is the causal driver.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for highlighting the importance of transparency in our regression analysis. The comments correctly identify that additional methodological details are needed to fully support the Cost-of-Pass and matched-length accuracy claims. We will revise the manuscript to address these points.
read point-by-point responses
-
Referee: [Abstract and experimental setup (regression control)] The per-model regression that holds output length fixed (Abstract; experimental setup) is load-bearing for both the 35× Cost-of-Pass metric and the +17.55 pp matched-length accuracy claim. The manuscript does not specify the functional form, whether scale or capability-group interactions are modeled, or report any diagnostics (R², residuals, or sensitivity to specification). If the length–accuracy relationship is nonlinear or heterogeneous, the isolation of brevity effects fails and the headline numbers become unreliable.
Authors: We agree that the regression details are essential for validating the isolation of brevity effects. The current manuscript provides only a high-level description. In revision, we will add a dedicated subsection in the experimental setup that specifies: the functional form (ordinary least-squares linear regression of accuracy on output length, estimated separately for each of the 14 models), inclusion of scale and capability-group interactions (we will report both baseline and interacted specifications), and full diagnostics (R², residual plots, and sensitivity to quadratic terms or alternative controls). This will allow direct assessment of whether the length–accuracy relationship is adequately captured. revision: yes
-
Referee: [Results (Cost-of-Pass and matched-length comparisons)] Table or figure reporting the 14-VLM Cost-of-Pass and matched-length accuracy results: the 35× advantage and the claim that “generic verbosity buys no accuracy at any capability or scale” rest on the regression-adjusted numbers. Without the regression specification, fitted coefficients, or cross-validation details, it is impossible to assess whether the control adequately separates brevity from quality or whether omitted variables (dataset differences beyond length) drive the patterns.
Authors: We acknowledge that the results section relies on regression-adjusted quantities without presenting the underlying model details. We will introduce a new appendix table (or expanded main-text table) that reports, for each model: the regression intercept and slope, R², fitted coefficients, and any cross-validation or robustness metrics. We will also discuss how per-model estimation mitigates omitted-variable concerns arising from capability differences, while noting that dataset differences beyond length are controlled by the shared evaluation set. These additions will make the 35× Cost-of-Pass and +17.55 pp claims fully auditable. revision: yes
Circularity Check
No circularity; empirical comparisons with statistical control
full rationale
The paper's central claims rest on training separate models on curated versus standard MAmmoTH-VL data, then comparing accuracy and efficiency metrics across 14 VLMs. The per-model regression is used only as a post-hoc statistical control to match output length when reporting accuracy differences and Cost-of-Pass; it does not define any target quantity in terms of itself or rename a fitted parameter as an independent prediction. No equations, self-citations, or uniqueness theorems appear in the provided text that reduce the reported gains to inputs by construction. This is the normal case of an empirical study whose results are falsifiable against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The per-model regression accurately separates brevity effects from quality differences across models.
Reference graph
Works this paper leans on
-
[1]
A Long Way to Go: Investigating Length Correlations in
Singhal, Prasann and Goyal, Tanya and Xu, Jiacheng and Durrett, Greg , journal =. A Long Way to Go: Investigating Length Correlations in. 2024 , note =
2024
-
[5]
Proceedings of the Third Conference on Machine Translation (WMT) , year =
Correcting Length Bias in Neural Machine Translation , author =. Proceedings of the Third Conference on Machine Translation (WMT) , year =
-
[6]
International Conference on Learning Representations (ICLR) , year =
The Curious Case of Neural Text Degeneration , author =. International Conference on Learning Representations (ICLR) , year =
-
[7]
International Conference on Learning Representations (ICLR) , year =
Neural Text Generation with Unlikelihood Training , author =. International Conference on Learning Representations (ICLR) , year =
-
[9]
2025 , url =
Du, Zheng and Kang, Hao and Han, Song and Krishna, Tushar and Zhu, Ligeng , journal =. 2025 , url =
2025
-
[10]
Beyond Accuracy: Decomposing the Reasoning Efficiency of
Kaiser, Daniel and Frigessi, Arnoldo and Ramezani-Kebrya, Ali and Ricaud, Benjamin , journal =. Beyond Accuracy: Decomposing the Reasoning Efficiency of. 2026 , url =
2026
-
[11]
2025 , url =
Kaiser, Daniel and Frigessi, Arnoldo and Ramezani-Kebrya, Ali and Ricaud, Benjamin , journal =. 2025 , url =
2025
-
[12]
The Price of Progress: Price Performance and the Future of
Gundlach, Hans and Lynch, Jayson and Mertens, Matthias and Thompson, Neil , journal =. The Price of Progress: Price Performance and the Future of. 2025 , url =
2025
-
[14]
Enhancing Factuality in Detailed Image Captioning with
Lee, Saehyung and Yoon, Seunghyun and Bui, Trung and Shi, Jing and Yoon, Sungroh , booktitle =. Enhancing Factuality in Detailed Image Captioning with. 2025 , note =
2025
-
[15]
Udandarao, Vishaal and Cherti, Mehdi and Karthik, Shyamgopal and Jitsev, Jenia and Albanie, Samuel and Bethge, Matthias , journal =. A Good. 2025 , url =
2025
-
[17]
2024 , url =
Chen, Dongping and others , journal =. 2024 , url =
2024
-
[18]
Proceedings of the 36th Annual Conference of the Cognitive Science Society (CogSci) , year =
Amortized Inference in Probabilistic Reasoning , author =. Proceedings of the 36th Annual Conference of the Cognitive Science Society (CogSci) , year =
-
[19]
2019 , url =
The Bitter Lesson , author =. 2019 , url =
2019
-
[20]
Training Compute-Optimal Large Language Models
Training Compute-Optimal Large Language Models , author =. arXiv preprint arXiv:2203.15556 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
20/20 Vision Language Models: A Prescription for Better
DatologyAI , journal =. 20/20 Vision Language Models: A Prescription for Better. 2026 , url =
2026
-
[22]
2024 , url =
Guo, Jarvis and others , journal =. 2024 , url =
2024
-
[23]
2025 , howpublished =
2025
-
[24]
2026 , howpublished =
Report capability against a compute budget, not as a single number , author =. 2026 , howpublished =
2026
-
[25]
2026 , howpublished =
Google I/O 2026 keynote: token-volume growth , author =. 2026 , howpublished =
2026
-
[26]
2026 , howpublished =
2026
-
[27]
2026 , howpublished =
The world will be capacity-constrained for some time , author =. 2026 , howpublished =
2026
-
[28]
2025 , howpublished =
2025: The State of Generative. 2025 , howpublished =
2025
-
[29]
2026 , howpublished =
Uber caps employee. 2026 , howpublished =
2026
-
[30]
2025 , howpublished =
Measuring Thinking Efficiency in Reasoning Models: The Missing Benchmark , author =. 2025 , howpublished =
2025
-
[31]
2026 , howpublished =
Reasoning effort parameter , author =. 2026 , howpublished =
2026
- [32]
-
[33]
D. Chen et al. MLLM -as-a-judge: Assessing multimodal LLM -as-a-judge with vision-language benchmark. arXiv preprint arXiv:2402.04788, 2024. URL https://arxiv.org/abs/2402.04788
-
[34]
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
DatologyAI. 20/20 vision language models: A prescription for better VLMs through data curation alone. arXiv preprint arXiv:2605.11405, 2026. URL https://arxiv.org/abs/2605.11405
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[35]
Z. Du, H. Kang, S. Han, T. Krishna, and L. Zhu. OckBench : Measuring the efficiency of LLM reasoning. arXiv preprint arXiv:2511.05722, 2025. URL https://arxiv.org/abs/2511.05722
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto. Length-controlled AlpacaEval : A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024. URL https://arxiv.org/abs/2404.04475
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
LLM responses are getting longer
Epoch AI . LLM responses are getting longer. Epoch AI Data Insight, 2025. URL https://epoch.ai/data-insights/output-length
2025
- [38]
-
[39]
S. J. Gershman and N. D. Goodman. Amortized inference in probabilistic reasoning. In Proceedings of the 36th Annual Conference of the Cognitive Science Society (CogSci), 2014. URL https://dblp.org/rec/conf/cogsci/GershmanG14.html
2014
-
[40]
H. Gundlach, J. Lynch, M. Mertens, and N. Thompson. The price of progress: Price performance and the future of AI . arXiv preprint arXiv:2511.23455, 2025. URL https://arxiv.org/abs/2511.23455
- [41]
-
[42]
The Curious Case of Neural Text Degeneration
A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi. The curious case of neural text degeneration. In International Conference on Learning Representations (ICLR), 2020. URL https://arxiv.org/abs/1904.09751
work page internal anchor Pith review Pith/arXiv arXiv 2020
- [43]
- [44]
- [45]
-
[46]
Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs
D. Kaiser, A. Frigessi, A. Ramezani-Kebrya, and B. Ricaud. Beyond accuracy: Decomposing the reasoning efficiency of LLMs . arXiv preprint arXiv:2602.09805, 2026. URL https://arxiv.org/abs/2602.09805
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[47]
S. Lee, S. Yoon, T. Bui, J. Shi, and S. Yoon. Enhancing factuality in detailed image captioning with LLM - MLLM collaboration. In International Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=psIymxANmd. OpenReview psIymxANmd
2025
-
[48]
Correcting Length Bias in Neural Machine Translation
K. Murray and D. Chiang. Correcting length bias in neural machine translation. In Proceedings of the Third Conference on Machine Translation (WMT), 2018. URL https://arxiv.org/abs/1808.10006
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[49]
Measuring thinking efficiency in reasoning models: The missing benchmark
Nous Research . Measuring thinking efficiency in reasoning models: The missing benchmark. Nous Research, 2025. URL https://nousresearch.com/measuring-thinking-efficiency-in-reasoning-models-the-missing-benchmark
2025
-
[50]
Reasoning effort parameter
OpenAI . Reasoning effort parameter. OpenAI API documentation, 2026. URL https://developers.openai.com/api/docs/guides/reasoning
2026
- [51]
-
[52]
S. Pichai. Google i/o 2026 keynote: token-volume growth. Google Blog, 2026. URL https://blog.google/innovation-and-ai/sundar-pichai-io-2026/
2026
-
[53]
P. Singhal, T. Goyal, J. Xu, and G. Durrett. A long way to go: Investigating length correlations in RLHF . Conference on Language Modeling (COLM), 2024. URL https://arxiv.org/abs/2310.03716. arXiv:2310.03716
-
[54]
R. S. Sutton. The bitter lesson, 2019. URL http://www.incompleteideas.net/IncIdeas/BitterLesson.html
2019
-
[55]
S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston. Neural text generation with unlikelihood training. In International Conference on Learning Representations (ICLR), 2020. URL https://arxiv.org/abs/1908.04319
-
[56]
Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, et al. Google's neural machine translation system: Bridging the gap between human and machine translation. 2016. URL https://arxiv.org/abs/1609.08144
work page internal anchor Pith review Pith/arXiv arXiv 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.