Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models
Pith reviewed 2026-05-19 20:42 UTC · model grok-4.3
The pith
Large reasoning models keep the same human cognitive cost alignment no matter the effort budget set at inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across GPT-OSS-20B and GPT-OSS-120B, three effort levels, and six reasoning tasks, within-task and cross-task alignment remain invariant with Bayes Factors leaning toward the null and mean alignment numerically near-identical across conditions. The effort parameter sets an upper budget on generation length rather than driving real-time allocation, indicating the allocation policy is crystallized at training time. In arithmetic tasks, token allocation tracks fine-grained, format-dependent human difficulty patterns, improving with model scale.
What carries the argument
The effort parameter acting as a ceiling on generation length, which does not modulate the cognitive cost alignment.
If this is right
- Alignment between model generation lengths and human reaction times stays the same at all effort levels.
- The reasoning allocation strategy is determined during training rather than adjusted during use.
- Model scale improves how well token use matches human difficulty in arithmetic tasks.
- LRM problem-solving follows a compiled pattern stable under inference changes.
Where Pith is reading between the lines
- If the effort setting only limits length, very high complexity tasks may suffer from premature stopping at low effort settings.
- Training procedures could be optimized specifically to strengthen this alignment since inference tweaks have little impact.
- Similar invariance might be tested in non-reasoning tasks or with different model families to see how general the finding is.
Load-bearing premise
That different effort levels actually change the model's reasoning allocation process instead of simply capping how much it can output.
What would settle it
Demonstrating a change in alignment when effort is varied in a setup where the parameter is shown to control dynamic reasoning steps rather than output length.
Figures
read the original abstract
Large Reasoning Models (LRMs) generate chain-of-thought traces whose length tracks human reaction times across cognitive tasks, but recent debate questions whether this alignment reflects genuine computational structure or surface verbosity. We test whether the alignment varies with inference-time reasoning effort. Across GPT-OSS-20B and GPT-OSS-120B, three effort levels, and six reasoning tasks, within-task and cross-task alignment remain invariant: Bayes Factors lean toward the null, and mean alignment is numerically near-identical across conditions. A manipulation check reveals that the effort parameter sets an upper budget on generation rather than driving real-time allocation, suggesting that the allocation policy is crystallized at training time. Arithmetic complexity contrasts further show that token allocation tracks fine-grained, format-dependent human difficulty patterns, with model scale improving the match. Cognitive cost alignment between LRMs and humans appears to be a training-time achievement, robust to inference-time perturbations, supporting a compiled rather than online account of LRM problem-solving.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that cognitive-cost alignment between humans and Large Reasoning Models (LRMs) is invariant to inference-time effort. Across GPT-OSS-20B and GPT-OSS-120B, three effort levels, and six reasoning tasks, within- and cross-task alignments remain stable, with Bayes factors favoring the null and numerically near-identical means. A manipulation check shows the effort parameter functions only as an upper bound on generation length rather than a dynamic allocator; the authors interpret this as evidence that the allocation policy is crystallized at training time. Token-allocation patterns also track fine-grained human difficulty in arithmetic contrasts, with scale improving the match.
Significance. If the central invariance result holds under a stronger manipulation, the work supplies statistical support (Bayes factors, numerical invariance) for distinguishing training-time compiled policies from online modulation in LRMs. The focus on format-dependent difficulty tracking and the explicit manipulation check are positive features that could help adjudicate debates on whether CoT length reflects genuine computational structure.
major comments (1)
- Abstract / Manipulation Check: The central claim that invariance demonstrates a training-time (compiled) achievement rather than an online policy rests on the effort levels constituting a valid perturbation of real-time reasoning allocation. The reported manipulation check indicates the parameter only imposes an upper budget on output length. This makes the observed null result expected under a length-capped regime and limits the design's ability to distinguish a policy that simply ignores this control from one fixed at training. Evidence that higher-effort settings produce measurably deeper traces, additional intermediate steps, branching differences, or accuracy gains beyond the length cap is needed to support the crystallized-vs-online distinction.
minor comments (2)
- Methods: Provide explicit definitions of the six reasoning tasks, data-exclusion rules, and error-handling procedures so that the Bayes-factor calculations and alignment metrics can be independently verified.
- Results: Report effect sizes alongside Bayes factors and clarify whether the numerical invariance holds after correcting for multiple comparisons across within-task and cross-task analyses.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments on our manuscript. We address the major comment below.
read point-by-point responses
-
Referee: Abstract / Manipulation Check: The central claim that invariance demonstrates a training-time (compiled) achievement rather than an online policy rests on the effort levels constituting a valid perturbation of real-time reasoning allocation. The reported manipulation check indicates the parameter only imposes an upper budget on output length. This makes the observed null result expected under a length-capped regime and limits the design's ability to distinguish a policy that simply ignores this control from one fixed at training. Evidence that higher-effort settings produce measurably deeper traces, additional intermediate steps, branching differences, or accuracy gains beyond the length cap is needed to support the crystallized-vs-online distinction.
Authors: We agree that the manipulation check establishes the effort parameter as an upper bound on generation length rather than a real-time allocator of reasoning depth. This is precisely why the invariance result is informative for our interpretation. If the model maintained an online policy capable of modulating reasoning effort in response to the control signal, we would expect at least some adjustment in token-allocation patterns, alignment strength, or performance when additional budget is made available—even if capped. The fact that alignment remains stable, token usage continues to track fine-grained human difficulty contrasts, and no systematic changes in reasoning structure appear across settings indicates that the core allocation policy does not respond to the inference-time cue. The model does process the parameter (by respecting the length limit), which weakens the alternative that it simply ignores the control entirely. We acknowledge that stronger evidence of deeper traces or branching under higher settings would further bolster the distinction, but such evidence is not observed precisely because the manipulation does not elicit online adjustment. In revision we will expand the discussion to clarify these limits and the evidential basis for the compiled-policy account. revision: partial
Circularity Check
No significant circularity; empirical statistical comparisons are self-contained
full rationale
The paper reports an empirical study measuring alignment of cognitive costs between humans and LRMs across effort levels, tasks, and model scales. Central results rest on observed data processed via Bayes factors and mean alignment statistics, with a manipulation check based on measured generation lengths. No equations, predictions, or first-principles derivations are presented that reduce to fitted inputs or self-definitions by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior work appear in the abstract or described chain. The invariance claim follows directly from statistical tests on independent measurements rather than any renaming or circular reduction, satisfying the criteria for a non-circular empirical analysis.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Bayes factors can be interpreted as evidence for the null hypothesis of invariance
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
token count within the <think> delimiters, serving as the proxy for inference-time computational cost
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat induction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Effort Invariance null (H3) predicts that alignment will remain stable across effort conditions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
gpt-oss-120b & gpt-oss-20b Model Card
Arbus, E., Arora, R. K., Bai, Y., Baker, B., Bao, H., et al. (2025). Gpt-oss-120b and gpt-oss-20b model card.arXiv preprint arXiv:2508.10925
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Anderson, J. R. (1982). Acquisition of cognitive skill.Psy- chological Review,89(4), 369–406
work page 1982
-
[3]
Ashcraft, M. H. (1992). Cognitive arithmetic: A review of data and theory.Cognition,44(1–2), 75–106. Binz,M.,&Schulz,E.(2023).Usingcognitivepsychologyto understand GPT-3.Proceedings of the National Academy of Sciences,120(6), e2218523120
work page 1992
-
[4]
Campbell, J. I. D., & Xue, Q. (2001). Cognitive arithmetic across cultures.Journal of Experimental Psychology: Gen- eral,130(2), 299–315
work page 2001
-
[5]
arXiv preprint arXiv:2602.13517 , year=
Chen, W.-L., Peng, L., Tan, T., Zhao, C., Chen, B. J., Lin, Z., Go, A., & Meng, Y. (2026). Think deep, not just long: MeasuringLLMreasoningeffortviadeep-thinkingtokens. arXiv preprint arXiv:2602.13517. de Varda, A. G., D’Elia, F. P., Kean, H., Lampinen, A., &
-
[6]
Fedorenko, E. (2025). The cost of thinking is similar be- tween large reasoning models and humans.Proceedings of the National Academy of Sciences,122(47), e2520077122. de Varda, A. G., D’Elia, F. P., Kean, H., Lampinen, A., &
work page 2025
-
[7]
Fedorenko, E. (2026c). Reply to Vankov et al.: Reasoning tracesarelinkedtoaccuracyandcapturekeydimensionsof problemcomplexity.Proceedings of the National Academy of Sciences,123(12), e2603574123. Dujmović, M. (2026). No deep insights into the alignment between human and deep learning reasoning processes: Thoughts on de Varda et al. (2025).Proceedings of t...
work page 2026
-
[8]
Gershman, S. J., Horvitz, E. J., & Tenenbaum, J. B. (2015). Computationalrationality:Aconvergingparadigmforintel- ligenceinbrains,minds,andmachines.Science,349(6245), 273–278
work page 2015
-
[9]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Goyal, S., Ji, Z., Rawat, A. S., Menon, A. K., Kumar, S., & Nagarajan,V.(2024).Thinkbeforeyouspeak:Traininglan- guage models with pause tokens.International Conference on Learning Representations (ICLR). Guo,D.,Yang,D.,Zhang,H.,Song,J.,Wang,P.,Zhu,Q.,Xu, R., Zhang, R., Ma, S., Bi, X., et al. (2025). DeepSeek-R1: Incentivizing reasoning capability in LLMs ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Training Large Language Models to Reason in a Continuous Latent Space
Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., & Tian,Y.(2024).Traininglargelanguagemodelstoreasonin acontinuouslatentspace.arXiv preprint arXiv:2412.06769. Hu,K.,Lin,A.C.,Qiu,L.,Ding,X.D.,Wang,R.,Zhu,Y.E.,
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [11]
-
[12]
Hu, Y. (2026). “Thinking traces” in large reasoning models: Cognitivecostorperformativescaffolding?Proceedings of the National Academy of Sciences,123(17), e2604554123
work page 2026
-
[13]
Hu, Y., Peng, X., Peng, S., Wang, H., & Wang, T. (2026). Hán d¯an xué bù (Mimicry) or Q¯ıng ch¯u yú lán (Mastery)? A cognitive perspective on reasoning distillation in large language models.arXiv preprint arXiv:2601.05019
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[14]
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2017). Building machines that learn and think like people.Behavioral and Brain Sciences,40, e253
work page 2017
-
[15]
Brahman, F., Miranda, L. J. V., Liu, A., Dziri, N., Lyu, S., et al. (2024). Tülu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Logan, G. D. (1988). Toward an instance theory of automati- zation.Psychological Review,95(4), 492–527
work page 1988
-
[17]
T., Yao, S., Friedman, D., Hardy, M
McCoy, R. T., Yao, S., Friedman, D., Hardy, M. D., & Grif- fiths,T.L.(2024).Embersofautoregressionshowhowlarge languagemodelsareshapedbytheproblemtheyaretrained tosolve.Proceedings of the National Academy of Sciences, 121(41), e2322420121
work page 2024
-
[18]
Palod, V., Valmeekam, K., Stechly, K., & Kambhampati, S. (2025). Performative thinking? the brittle correlation be- tween CoT length and problem complexity.arXiv preprint arXiv:2509.07339. Paul,D.,West,R.,Bosselut,A.,&Faltings,B.(2024).Making reasoningmatter:Measuringandimprovingfaithfulnessof chain-of-thoughtreasoning.Findings of the Association for Comp...
-
[19]
Shenhav, A., Botvinick, M. M., & Cohen, J. D. (2013). The expectedvalueofcontrol:Anintegrativetheoryofanterior cingulate cortex function.Neuron,79(2), 217–240
work page 2013
-
[20]
Stechly, K., Valmeekam, K., Palod, V., Gundawar, A., & Kambhampati,S.(2025).Beyondsemantics:Theunreason- able effectiveness of reasonless intermediate tokens.First Workshop on Foundations of Reasoning in Language Mod- els
work page 2025
-
[21]
Vankov, I. I., Adolfi, F., Heaton, R. F., Puebla, G., & Bow- ers, J. S. (2026). Correlations without causation do not support claims of human–LLM reasoning alignment.Pro- ceedings of the National Academy of Sciences,123(12), e2536362123
work page 2026
-
[22]
F., Šmíra, M., Ep- skamp,S.,etal.(2018).Bayesianinferenceforpsychology
Wagenmakers, E.-J., Marsman, M., Jamil, T., Ly, A., Verha- gen, J., Love, J., Selker, R., Gronau, Q. F., Šmíra, M., Ep- skamp,S.,etal.(2018).Bayesianinferenceforpsychology. Part I: Theoretical advantages and practical ramifications. Psychonomic Bulletin & Review,25(1), 35–57
work page 2018
-
[23]
Chi, E., Le, Q. V., Zhou, D., et al. (2022). Chain-of- thoughtpromptingelicitsreasoninginlargelanguagemod- els.Advances in Neural Information Processing Systems, 35, 24824–24837
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.