Brevity is the Soul of Inference Efficiency: Inducing Concision in VLMs via Data Curation

Alvin Deng; Ari Morcos; Bogdan Gaza; DatologyAI: Matthew L. Leavitt; David Schwab; Haakon Mongstad; Haoli Yin; Rishabh Adiga; Siddharth Joshi

arxiv: 2606.25432 · v2 · pith:OPEA3CSInew · submitted 2026-06-24 · 💻 cs.LG · cs.AI· cs.CV

Brevity is the Soul of Inference Efficiency: Inducing Concision in VLMs via Data Curation

DatologyAI: Matthew L. Leavitt , Siddharth Joshi , Haoli Yin , Rishabh Adiga , Haakon Mongstad , Alvin Deng , David Schwab , Bogdan Gaza

show 1 more author

Ari Morcos

This is my paper

Pith reviewed 2026-07-01 06:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords inference efficiencydata curationvision-language modelsbrevityCost-of-Passoutput lengthpretraining dataVLMs

0 comments

The pith

Training VLMs on curated concise data cuts Cost-of-Pass by 35x at nearly identical accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that output length, not just model size, is a major driver of inference cost and that data curation can induce shorter correct answers. Models trained on the curated MAmmoTH-VL subset learn to answer in fewer tokens, which the authors price as FLOPs per correct answer after using per-model regressions to hold length fixed. This yields a 35-fold Cost-of-Pass reduction versus the most verbose 4B comparator within about one accuracy point, plus a 17.55-point accuracy gain when length is matched. The work shows that generic verbosity adds no accuracy value and that the value of structured reasoning verbosity shrinks with scale.

Core claim

A model trained on concise, correct data learns to answer in fewer tokens and therefore has a lower Cost-of-Pass; on controlled evaluations the curated 1B-4B models reach 0.41 TFLOPs per correct answer versus 14.58 for the most verbose comparator while staying within one percentage point of accuracy and delivering large matched-length accuracy gains that increase with scale.

What carries the argument

The VLM curation pipeline applied to the MAmmoTH-VL single-image subset, paired with per-model regression to separate brevity from quality when computing FLOPs per correct answer.

If this is right

Matched-length accuracy improves by 17.55 points over the uncurated baseline and the gain grows from +16.7 pp at 1B to +21.2 pp at 4B.
Generic verbosity yields no accuracy benefit at any capability level or scale.
The accuracy window where reasoning-structured verbosity still pays for its tokens shrinks from 4 of 8 capability groups at 2B to 1 of 8 at 4B.
The concise model solves some examples correctly that the verbose reasoning model misses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the curation effect holds, training pipelines could systematically target output length as an optimizable variable rather than an emergent side-effect of scale.
The finding suggests that efficiency comparisons across models should routinely report tokens per correct answer in addition to accuracy and parameter count.
The same curation approach might be tested on text-only language models or on multi-turn dialogue tasks to check whether brevity remains advantageous outside single-image VLM settings.

Load-bearing premise

The per-model regression that holds output length fixed truly isolates brevity from quality, and the differences between curated and standard MAmmoTH-VL data are the causal driver of the observed brevity patterns.

What would settle it

An experiment in which models trained on the curated data fail to produce shorter outputs or lower Cost-of-Pass than models trained on the uncurated MAmmoTH-VL data when evaluated on the same 20-task suite.

read the original abstract

Inference efficiency is typically pursued by shrinking the model: distillation, pruning, quantization, and sparse routing each lower per-token cost while treating token count as fixed. But output length has been inflating, and it is precisely the component the standard toolkit leaves untouched. Here, we argue that brevity is the missing inference-efficiency lever, and that pretraining data curation is a practical way to pull it: a model trained on concise, correct data learns to answer in fewer tokens; i.e. it has a lower Cost-of-Pass. We apply our VLM curation pipeline to the MAmmoTH-VL single-image subset, and compare models trained on our curated data, the standard MAmmoTH-VL data, and external open-weight frontier VLMs. On a controlled 20-evaluation set and 14 VLMs at 1B-4B activated parameters, we hold output length fixed with a per-model regression, separating brevity from quality, and price models in FLOPs per correct answer. Curation buys a 35x Cost-of-Pass advantage over the most verbose 4B comparator (Qwen3.5-4B) within $\sim$1 pp of accuracy (0.41 vs 14.58 TFLOPs per correct answer; 0.691 vs 0.704 mean accuracy). Curation also buys a +17.55-percentage-point matched-length accuracy gain over the uncurated baseline that grows with model scale (from +16.7 pp at 1B to +21.2 pp at 4B). This brevity improvement concedes no quality: generic verbosity buys no accuracy at any capability or scale, and the window where reasoning-structured verbosity still earns its tokens shrinks from 4 of 8 capability groups at 2B to 1 of 8 at 4B. Per example, the concise model even reaches correct answers the verbose reasoning model misses, marking reasoning as a distinct curation target rather than something brevity gives up. Inference efficiency in this regime is a tokens-per-correct problem, and brevity is the lever that targets it directly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Curation for concise answers cuts Cost-of-Pass sharply in small VLMs while holding accuracy, but the per-model regression is the part that needs the closest look.

read the letter

The paper's central result is that training on data curated for brevity produces VLMs that reach correct answers in far fewer tokens than either the uncurated baseline or most open models of similar size. On their 20-task set the curated 4B model shows roughly 35 times lower FLOPs per correct answer than Qwen3.5-4B at nearly identical accuracy, and it also beats the uncurated MAmmoTH-VL run by 17+ points when length is controlled.

What stands out is the direct attack on output length as an efficiency lever rather than another round of distillation or quantization. The matched-length accuracy comparison and the observation that generic verbosity adds nothing at any scale are both useful. The scale trend (gains from curation grow from 1B to 4B) and the note that the concise model sometimes solves examples the verbose one misses are concrete enough to be worth testing.

The soft spot is the per-model regression used to hold output length fixed. If the length-accuracy relationship is nonlinear or interacts with capability group, the separation of brevity from quality weakens and the headline Cost-of-Pass numbers become less reliable. The abstract gives the 35x and +17.55 pp figures but the full paper needs to show the regression specification, any robustness checks, and basic dataset statistics before the causal claim about curation is fully convincing.

This is aimed at researchers and engineers working on inference cost for 1-4B VLMs who already care about data curation. It is worth a serious referee because the empirical setup is straightforward to replicate and the practical question it raises is real, even if the regression control will probably need tightening.

Referee Report

2 major / 2 minor

Summary. The paper claims that curating concise, correct pretraining data for VLMs induces shorter outputs without quality loss, yielding large gains in inference efficiency via lower Cost-of-Pass (FLOPs per correct answer). On a 20-task evaluation set across 14 VLMs (1B–4B activated parameters), a per-model regression holds output length fixed to isolate brevity from quality; this produces a 35× Cost-of-Pass advantage over the most verbose 4B comparator (0.41 vs. 14.58 TFLOPs per correct answer at 0.691 vs. 0.704 mean accuracy) and a +17.55 pp matched-length accuracy gain over the uncurated MAmmoTH-VL baseline that increases with scale.

Significance. If the empirical results hold, the work identifies data curation for concision as a practical, orthogonal lever for inference efficiency that directly targets token count rather than per-token cost. The concrete 35× Cost-of-Pass figure, scale-dependent accuracy gains, and comparisons across capability groups and external frontier models provide a falsifiable, quantitative case for brevity as an efficiency target. The use of a regression control to separate length from quality is a methodological strength that strengthens the causal attribution to curation.

major comments (2)

[Abstract and experimental setup (regression control)] The per-model regression that holds output length fixed (Abstract; experimental setup) is load-bearing for both the 35× Cost-of-Pass metric and the +17.55 pp matched-length accuracy claim. The manuscript does not specify the functional form, whether scale or capability-group interactions are modeled, or report any diagnostics (R², residuals, or sensitivity to specification). If the length–accuracy relationship is nonlinear or heterogeneous, the isolation of brevity effects fails and the headline numbers become unreliable.
[Results (Cost-of-Pass and matched-length comparisons)] Table or figure reporting the 14-VLM Cost-of-Pass and matched-length accuracy results: the 35× advantage and the claim that “generic verbosity buys no accuracy at any capability or scale” rest on the regression-adjusted numbers. Without the regression specification, fitted coefficients, or cross-validation details, it is impossible to assess whether the control adequately separates brevity from quality or whether omitted variables (dataset differences beyond length) drive the patterns.

minor comments (2)

[Abstract] Abstract and results: reported accuracies and TFLOPs figures lack error bars, standard errors, or confidence intervals, making it difficult to judge whether the ~1 pp accuracy difference or the +17.55 pp gain is statistically distinguishable from noise.
[Data curation pipeline] Dataset section: no summary statistics (token-length distributions, curation criteria, or overlap with standard MAmmoTH-VL) are provided for the curated versus baseline data, hindering assessment of whether brevity is the causal driver.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting the importance of transparency in our regression analysis. The comments correctly identify that additional methodological details are needed to fully support the Cost-of-Pass and matched-length accuracy claims. We will revise the manuscript to address these points.

read point-by-point responses

Referee: [Abstract and experimental setup (regression control)] The per-model regression that holds output length fixed (Abstract; experimental setup) is load-bearing for both the 35× Cost-of-Pass metric and the +17.55 pp matched-length accuracy claim. The manuscript does not specify the functional form, whether scale or capability-group interactions are modeled, or report any diagnostics (R², residuals, or sensitivity to specification). If the length–accuracy relationship is nonlinear or heterogeneous, the isolation of brevity effects fails and the headline numbers become unreliable.

Authors: We agree that the regression details are essential for validating the isolation of brevity effects. The current manuscript provides only a high-level description. In revision, we will add a dedicated subsection in the experimental setup that specifies: the functional form (ordinary least-squares linear regression of accuracy on output length, estimated separately for each of the 14 models), inclusion of scale and capability-group interactions (we will report both baseline and interacted specifications), and full diagnostics (R², residual plots, and sensitivity to quadratic terms or alternative controls). This will allow direct assessment of whether the length–accuracy relationship is adequately captured. revision: yes
Referee: [Results (Cost-of-Pass and matched-length comparisons)] Table or figure reporting the 14-VLM Cost-of-Pass and matched-length accuracy results: the 35× advantage and the claim that “generic verbosity buys no accuracy at any capability or scale” rest on the regression-adjusted numbers. Without the regression specification, fitted coefficients, or cross-validation details, it is impossible to assess whether the control adequately separates brevity from quality or whether omitted variables (dataset differences beyond length) drive the patterns.

Authors: We acknowledge that the results section relies on regression-adjusted quantities without presenting the underlying model details. We will introduce a new appendix table (or expanded main-text table) that reports, for each model: the regression intercept and slope, R², fitted coefficients, and any cross-validation or robustness metrics. We will also discuss how per-model estimation mitigates omitted-variable concerns arising from capability differences, while noting that dataset differences beyond length are controlled by the shared evaluation set. These additions will make the 35× Cost-of-Pass and +17.55 pp claims fully auditable. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical comparisons with statistical control

full rationale

The paper's central claims rest on training separate models on curated versus standard MAmmoTH-VL data, then comparing accuracy and efficiency metrics across 14 VLMs. The per-model regression is used only as a post-hoc statistical control to match output length when reporting accuracy differences and Cost-of-Pass; it does not define any target quantity in terms of itself or rename a fitted parameter as an independent prediction. No equations, self-citations, or uniqueness theorems appear in the provided text that reduce the reported gains to inputs by construction. This is the normal case of an empirical study whose results are falsifiable against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the curation pipeline and the regression control; no free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption The per-model regression accurately separates brevity effects from quality differences across models.
Abstract states that output length is held fixed with a per-model regression to separate brevity from quality.

pith-pipeline@v0.9.1-grok · 5965 in / 1285 out tokens · 32209 ms · 2026-07-01T06:28:05.026510+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 19 canonical work pages · 8 internal anchors

[1]

A Long Way to Go: Investigating Length Correlations in

Singhal, Prasann and Goyal, Tanya and Xu, Jiacheng and Durrett, Greg , journal =. A Long Way to Go: Investigating Length Correlations in. 2024 , note =

2024
[5]

Proceedings of the Third Conference on Machine Translation (WMT) , year =

Correcting Length Bias in Neural Machine Translation , author =. Proceedings of the Third Conference on Machine Translation (WMT) , year =
[6]

International Conference on Learning Representations (ICLR) , year =

The Curious Case of Neural Text Degeneration , author =. International Conference on Learning Representations (ICLR) , year =
[7]

International Conference on Learning Representations (ICLR) , year =

Neural Text Generation with Unlikelihood Training , author =. International Conference on Learning Representations (ICLR) , year =
[9]

2025 , url =

Du, Zheng and Kang, Hao and Han, Song and Krishna, Tushar and Zhu, Ligeng , journal =. 2025 , url =

2025
[10]

Beyond Accuracy: Decomposing the Reasoning Efficiency of

Kaiser, Daniel and Frigessi, Arnoldo and Ramezani-Kebrya, Ali and Ricaud, Benjamin , journal =. Beyond Accuracy: Decomposing the Reasoning Efficiency of. 2026 , url =

2026
[11]

2025 , url =

Kaiser, Daniel and Frigessi, Arnoldo and Ramezani-Kebrya, Ali and Ricaud, Benjamin , journal =. 2025 , url =

2025
[12]

The Price of Progress: Price Performance and the Future of

Gundlach, Hans and Lynch, Jayson and Mertens, Matthias and Thompson, Neil , journal =. The Price of Progress: Price Performance and the Future of. 2025 , url =

2025
[14]

Enhancing Factuality in Detailed Image Captioning with

Lee, Saehyung and Yoon, Seunghyun and Bui, Trung and Shi, Jing and Yoon, Sungroh , booktitle =. Enhancing Factuality in Detailed Image Captioning with. 2025 , note =

2025
[15]

Udandarao, Vishaal and Cherti, Mehdi and Karthik, Shyamgopal and Jitsev, Jenia and Albanie, Samuel and Bethge, Matthias , journal =. A Good. 2025 , url =

2025
[17]

2024 , url =

Chen, Dongping and others , journal =. 2024 , url =

2024
[18]

Proceedings of the 36th Annual Conference of the Cognitive Science Society (CogSci) , year =

Amortized Inference in Probabilistic Reasoning , author =. Proceedings of the 36th Annual Conference of the Cognitive Science Society (CogSci) , year =
[19]

2019 , url =

The Bitter Lesson , author =. 2019 , url =

2019
[20]

Training Compute-Optimal Large Language Models

Training Compute-Optimal Large Language Models , author =. arXiv preprint arXiv:2203.15556 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[21]

20/20 Vision Language Models: A Prescription for Better

DatologyAI , journal =. 20/20 Vision Language Models: A Prescription for Better. 2026 , url =

2026
[22]

2024 , url =

Guo, Jarvis and others , journal =. 2024 , url =

2024
[23]

2025 , howpublished =

2025
[24]

2026 , howpublished =

Report capability against a compute budget, not as a single number , author =. 2026 , howpublished =

2026
[25]

2026 , howpublished =

Google I/O 2026 keynote: token-volume growth , author =. 2026 , howpublished =

2026
[26]

2026 , howpublished =

2026
[27]

2026 , howpublished =

The world will be capacity-constrained for some time , author =. 2026 , howpublished =

2026
[28]

2025 , howpublished =

2025: The State of Generative. 2025 , howpublished =

2025
[29]

2026 , howpublished =

Uber caps employee. 2026 , howpublished =

2026
[30]

2025 , howpublished =

Measuring Thinking Efficiency in Reasoning Models: The Missing Benchmark , author =. 2025 , howpublished =

2025
[31]

2026 , howpublished =

Reasoning effort parameter , author =. 2026 , howpublished =

2026
[32]

N. Brown. Report capability against a compute budget, not as a single number. Post on X (OpenAI), 2026. URL https://x.com/polynoamial/status/2064210146558136827

work page arXiv 2026
[33]

Chen et al

D. Chen et al. MLLM -as-a-judge: Assessing multimodal LLM -as-a-judge with vision-language benchmark. arXiv preprint arXiv:2402.04788, 2024. URL https://arxiv.org/abs/2402.04788

work page arXiv 2024
[34]

20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

DatologyAI. 20/20 vision language models: A prescription for better VLMs through data curation alone. arXiv preprint arXiv:2605.11405, 2026. URL https://arxiv.org/abs/2605.11405

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Z. Du, H. Kang, S. Han, T. Krishna, and L. Zhu. OckBench : Measuring the efficiency of LLM reasoning. arXiv preprint arXiv:2511.05722, 2025. URL https://arxiv.org/abs/2511.05722

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto. Length-controlled AlpacaEval : A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024. URL https://arxiv.org/abs/2404.04475

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

LLM responses are getting longer

Epoch AI . LLM responses are getting longer. Epoch AI Data Insight, 2025. URL https://epoch.ai/data-insights/output-length

2025
[38]

M. H. Erol et al. Cost-of-pass: An economic framework for evaluating language models. arXiv preprint arXiv:2504.13359, 2025. URL https://arxiv.org/abs/2504.13359. Stanford

work page arXiv 2025
[39]

S. J. Gershman and N. D. Goodman. Amortized inference in probabilistic reasoning. In Proceedings of the 36th Annual Conference of the Cognitive Science Society (CogSci), 2014. URL https://dblp.org/rec/conf/cogsci/GershmanG14.html

2014
[40]

Gundlach, J

H. Gundlach, J. Lynch, M. Mertens, and N. Thompson. The price of progress: Price performance and the future of AI . arXiv preprint arXiv:2511.23455, 2025. URL https://arxiv.org/abs/2511.23455

work page arXiv 2025
[41]

Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale.arXiv preprint arXiv:2412.05237, 2024

J. Guo et al. MAmmoTH-VL : Eliciting multimodal reasoning with instruction tuning at scale. arXiv preprint arXiv:2412.05237, 2024. URL https://arxiv.org/abs/2412.05237

work page arXiv 2024
[42]

The Curious Case of Neural Text Degeneration

A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi. The curious case of neural text degeneration. In International Conference on Learning Representations (ICLR), 2020. URL https://arxiv.org/abs/1904.09751

work page internal anchor Pith review Pith/arXiv arXiv 2020
[43]

Huang, Z

Z. Huang, Z. Qiu, Z. Wang, E. M. Ponti, and I. Titov. Post-hoc reward calibration: A case study on length bias. arXiv preprint arXiv:2409.17407, 2024. URL https://arxiv.org/abs/2409.17407

work page arXiv 2024
[44]

M. Jung, S. Lee, E. Kim, and S. Yoon. Visual attention never fades: Selective progressive attention recalibration for detailed image captioning in multimodal large language models. arXiv preprint arXiv:2502.01419, 2025. URL https://arxiv.org/abs/2502.01419

work page arXiv 2025
[45]

Kaiser, A

D. Kaiser, A. Frigessi, A. Ramezani-Kebrya, and B. Ricaud. CogniLoad : A synthetic natural language reasoning benchmark with tunable length, intrinsic difficulty, and distractor density. arXiv preprint arXiv:2509.18458, 2025. URL https://arxiv.org/abs/2509.18458

work page arXiv 2025
[46]

Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs

D. Kaiser, A. Frigessi, A. Ramezani-Kebrya, and B. Ricaud. Beyond accuracy: Decomposing the reasoning efficiency of LLMs . arXiv preprint arXiv:2602.09805, 2026. URL https://arxiv.org/abs/2602.09805

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

S. Lee, S. Yoon, T. Bui, J. Shi, and S. Yoon. Enhancing factuality in detailed image captioning with LLM - MLLM collaboration. In International Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=psIymxANmd. OpenReview psIymxANmd

2025
[48]

Correcting Length Bias in Neural Machine Translation

K. Murray and D. Chiang. Correcting length bias in neural machine translation. In Proceedings of the Third Conference on Machine Translation (WMT), 2018. URL https://arxiv.org/abs/1808.10006

work page internal anchor Pith review Pith/arXiv arXiv 2018
[49]

Measuring thinking efficiency in reasoning models: The missing benchmark

Nous Research . Measuring thinking efficiency in reasoning models: The missing benchmark. Nous Research, 2025. URL https://nousresearch.com/measuring-thinking-efficiency-in-reasoning-models-the-missing-benchmark

2025
[50]

Reasoning effort parameter

OpenAI . Reasoning effort parameter. OpenAI API documentation, 2026. URL https://developers.openai.com/api/docs/guides/reasoning

2026
[51]

R. Park, R. Rafailov, S. Ermon, and C. Finn. Disentangling length from quality in direct preference optimization. arXiv preprint arXiv:2403.19159, 2024. URL https://arxiv.org/abs/2403.19159

work page arXiv 2024
[52]

S. Pichai. Google i/o 2026 keynote: token-volume growth. Google Blog, 2026. URL https://blog.google/innovation-and-ai/sundar-pichai-io-2026/

2026
[53]

Singhal, T

P. Singhal, T. Goyal, J. Xu, and G. Durrett. A long way to go: Investigating length correlations in RLHF . Conference on Language Modeling (COLM), 2024. URL https://arxiv.org/abs/2310.03716. arXiv:2310.03716

work page arXiv 2024
[54]

R. S. Sutton. The bitter lesson, 2019. URL http://www.incompleteideas.net/IncIdeas/BitterLesson.html

2019
[55]

Welleck, I

S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston. Neural text generation with unlikelihood training. In International Conference on Learning Representations (ICLR), 2020. URL https://arxiv.org/abs/1908.04319

work page arXiv 2020
[56]

Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, et al. Google's neural machine translation system: Bridging the gap between human and machine translation. 2016. URL https://arxiv.org/abs/1609.08144

work page internal anchor Pith review Pith/arXiv arXiv 2016

[1] [1]

A Long Way to Go: Investigating Length Correlations in

Singhal, Prasann and Goyal, Tanya and Xu, Jiacheng and Durrett, Greg , journal =. A Long Way to Go: Investigating Length Correlations in. 2024 , note =

2024

[2] [5]

Proceedings of the Third Conference on Machine Translation (WMT) , year =

Correcting Length Bias in Neural Machine Translation , author =. Proceedings of the Third Conference on Machine Translation (WMT) , year =

[3] [6]

International Conference on Learning Representations (ICLR) , year =

The Curious Case of Neural Text Degeneration , author =. International Conference on Learning Representations (ICLR) , year =

[4] [7]

International Conference on Learning Representations (ICLR) , year =

Neural Text Generation with Unlikelihood Training , author =. International Conference on Learning Representations (ICLR) , year =

[5] [9]

2025 , url =

Du, Zheng and Kang, Hao and Han, Song and Krishna, Tushar and Zhu, Ligeng , journal =. 2025 , url =

2025

[6] [10]

Beyond Accuracy: Decomposing the Reasoning Efficiency of

Kaiser, Daniel and Frigessi, Arnoldo and Ramezani-Kebrya, Ali and Ricaud, Benjamin , journal =. Beyond Accuracy: Decomposing the Reasoning Efficiency of. 2026 , url =

2026

[7] [11]

2025 , url =

Kaiser, Daniel and Frigessi, Arnoldo and Ramezani-Kebrya, Ali and Ricaud, Benjamin , journal =. 2025 , url =

2025

[8] [12]

The Price of Progress: Price Performance and the Future of

Gundlach, Hans and Lynch, Jayson and Mertens, Matthias and Thompson, Neil , journal =. The Price of Progress: Price Performance and the Future of. 2025 , url =

2025

[9] [14]

Enhancing Factuality in Detailed Image Captioning with

Lee, Saehyung and Yoon, Seunghyun and Bui, Trung and Shi, Jing and Yoon, Sungroh , booktitle =. Enhancing Factuality in Detailed Image Captioning with. 2025 , note =

2025

[10] [15]

Udandarao, Vishaal and Cherti, Mehdi and Karthik, Shyamgopal and Jitsev, Jenia and Albanie, Samuel and Bethge, Matthias , journal =. A Good. 2025 , url =

2025

[11] [17]

2024 , url =

Chen, Dongping and others , journal =. 2024 , url =

2024

[12] [18]

Proceedings of the 36th Annual Conference of the Cognitive Science Society (CogSci) , year =

Amortized Inference in Probabilistic Reasoning , author =. Proceedings of the 36th Annual Conference of the Cognitive Science Society (CogSci) , year =

[13] [19]

2019 , url =

The Bitter Lesson , author =. 2019 , url =

2019

[14] [20]

Training Compute-Optimal Large Language Models

Training Compute-Optimal Large Language Models , author =. arXiv preprint arXiv:2203.15556 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[15] [21]

20/20 Vision Language Models: A Prescription for Better

DatologyAI , journal =. 20/20 Vision Language Models: A Prescription for Better. 2026 , url =

2026

[16] [22]

2024 , url =

Guo, Jarvis and others , journal =. 2024 , url =

2024

[17] [23]

2025 , howpublished =

2025

[18] [24]

2026 , howpublished =

Report capability against a compute budget, not as a single number , author =. 2026 , howpublished =

2026

[19] [25]

2026 , howpublished =

Google I/O 2026 keynote: token-volume growth , author =. 2026 , howpublished =

2026

[20] [26]

2026 , howpublished =

2026

[21] [27]

2026 , howpublished =

The world will be capacity-constrained for some time , author =. 2026 , howpublished =

2026

[22] [28]

2025 , howpublished =

2025: The State of Generative. 2025 , howpublished =

2025

[23] [29]

2026 , howpublished =

Uber caps employee. 2026 , howpublished =

2026

[24] [30]

2025 , howpublished =

Measuring Thinking Efficiency in Reasoning Models: The Missing Benchmark , author =. 2025 , howpublished =

2025

[25] [31]

2026 , howpublished =

Reasoning effort parameter , author =. 2026 , howpublished =

2026

[26] [32]

N. Brown. Report capability against a compute budget, not as a single number. Post on X (OpenAI), 2026. URL https://x.com/polynoamial/status/2064210146558136827

work page arXiv 2026

[27] [33]

Chen et al

D. Chen et al. MLLM -as-a-judge: Assessing multimodal LLM -as-a-judge with vision-language benchmark. arXiv preprint arXiv:2402.04788, 2024. URL https://arxiv.org/abs/2402.04788

work page arXiv 2024

[28] [34]

20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

DatologyAI. 20/20 vision language models: A prescription for better VLMs through data curation alone. arXiv preprint arXiv:2605.11405, 2026. URL https://arxiv.org/abs/2605.11405

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [35]

Z. Du, H. Kang, S. Han, T. Krishna, and L. Zhu. OckBench : Measuring the efficiency of LLM reasoning. arXiv preprint arXiv:2511.05722, 2025. URL https://arxiv.org/abs/2511.05722

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [36]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto. Length-controlled AlpacaEval : A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024. URL https://arxiv.org/abs/2404.04475

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [37]

LLM responses are getting longer

Epoch AI . LLM responses are getting longer. Epoch AI Data Insight, 2025. URL https://epoch.ai/data-insights/output-length

2025

[32] [38]

M. H. Erol et al. Cost-of-pass: An economic framework for evaluating language models. arXiv preprint arXiv:2504.13359, 2025. URL https://arxiv.org/abs/2504.13359. Stanford

work page arXiv 2025

[33] [39]

S. J. Gershman and N. D. Goodman. Amortized inference in probabilistic reasoning. In Proceedings of the 36th Annual Conference of the Cognitive Science Society (CogSci), 2014. URL https://dblp.org/rec/conf/cogsci/GershmanG14.html

2014

[34] [40]

Gundlach, J

H. Gundlach, J. Lynch, M. Mertens, and N. Thompson. The price of progress: Price performance and the future of AI . arXiv preprint arXiv:2511.23455, 2025. URL https://arxiv.org/abs/2511.23455

work page arXiv 2025

[35] [41]

Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale.arXiv preprint arXiv:2412.05237, 2024

J. Guo et al. MAmmoTH-VL : Eliciting multimodal reasoning with instruction tuning at scale. arXiv preprint arXiv:2412.05237, 2024. URL https://arxiv.org/abs/2412.05237

work page arXiv 2024

[36] [42]

The Curious Case of Neural Text Degeneration

A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi. The curious case of neural text degeneration. In International Conference on Learning Representations (ICLR), 2020. URL https://arxiv.org/abs/1904.09751

work page internal anchor Pith review Pith/arXiv arXiv 2020

[37] [43]

Huang, Z

Z. Huang, Z. Qiu, Z. Wang, E. M. Ponti, and I. Titov. Post-hoc reward calibration: A case study on length bias. arXiv preprint arXiv:2409.17407, 2024. URL https://arxiv.org/abs/2409.17407

work page arXiv 2024

[38] [44]

M. Jung, S. Lee, E. Kim, and S. Yoon. Visual attention never fades: Selective progressive attention recalibration for detailed image captioning in multimodal large language models. arXiv preprint arXiv:2502.01419, 2025. URL https://arxiv.org/abs/2502.01419

work page arXiv 2025

[39] [45]

Kaiser, A

D. Kaiser, A. Frigessi, A. Ramezani-Kebrya, and B. Ricaud. CogniLoad : A synthetic natural language reasoning benchmark with tunable length, intrinsic difficulty, and distractor density. arXiv preprint arXiv:2509.18458, 2025. URL https://arxiv.org/abs/2509.18458

work page arXiv 2025

[40] [46]

Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs

D. Kaiser, A. Frigessi, A. Ramezani-Kebrya, and B. Ricaud. Beyond accuracy: Decomposing the reasoning efficiency of LLMs . arXiv preprint arXiv:2602.09805, 2026. URL https://arxiv.org/abs/2602.09805

work page internal anchor Pith review Pith/arXiv arXiv 2026

[41] [47]

S. Lee, S. Yoon, T. Bui, J. Shi, and S. Yoon. Enhancing factuality in detailed image captioning with LLM - MLLM collaboration. In International Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=psIymxANmd. OpenReview psIymxANmd

2025

[42] [48]

Correcting Length Bias in Neural Machine Translation

K. Murray and D. Chiang. Correcting length bias in neural machine translation. In Proceedings of the Third Conference on Machine Translation (WMT), 2018. URL https://arxiv.org/abs/1808.10006

work page internal anchor Pith review Pith/arXiv arXiv 2018

[43] [49]

Measuring thinking efficiency in reasoning models: The missing benchmark

Nous Research . Measuring thinking efficiency in reasoning models: The missing benchmark. Nous Research, 2025. URL https://nousresearch.com/measuring-thinking-efficiency-in-reasoning-models-the-missing-benchmark

2025

[44] [50]

Reasoning effort parameter

OpenAI . Reasoning effort parameter. OpenAI API documentation, 2026. URL https://developers.openai.com/api/docs/guides/reasoning

2026

[45] [51]

R. Park, R. Rafailov, S. Ermon, and C. Finn. Disentangling length from quality in direct preference optimization. arXiv preprint arXiv:2403.19159, 2024. URL https://arxiv.org/abs/2403.19159

work page arXiv 2024

[46] [52]

S. Pichai. Google i/o 2026 keynote: token-volume growth. Google Blog, 2026. URL https://blog.google/innovation-and-ai/sundar-pichai-io-2026/

2026

[47] [53]

Singhal, T

P. Singhal, T. Goyal, J. Xu, and G. Durrett. A long way to go: Investigating length correlations in RLHF . Conference on Language Modeling (COLM), 2024. URL https://arxiv.org/abs/2310.03716. arXiv:2310.03716

work page arXiv 2024

[48] [54]

R. S. Sutton. The bitter lesson, 2019. URL http://www.incompleteideas.net/IncIdeas/BitterLesson.html

2019

[49] [55]

Welleck, I

S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston. Neural text generation with unlikelihood training. In International Conference on Learning Representations (ICLR), 2020. URL https://arxiv.org/abs/1908.04319

work page arXiv 2020

[50] [56]

Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, et al. Google's neural machine translation system: Bridging the gap between human and machine translation. 2016. URL https://arxiv.org/abs/1609.08144

work page internal anchor Pith review Pith/arXiv arXiv 2016