pith. sign in

arxiv: 2605.16938 · v1 · pith:J3EEN5IPnew · submitted 2026-05-16 · 💻 cs.CL · cs.AI· q-bio.NC

Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models

Pith reviewed 2026-05-19 20:42 UTC · model grok-4.3

classification 💻 cs.CL cs.AIq-bio.NC
keywords large reasoning modelscognitive cost alignmentchain-of-thoughtinference-time efforthuman reaction timestraining-time policyreasoning budget
0
0 comments X

The pith

Large reasoning models keep the same human cognitive cost alignment no matter the effort budget set at inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors examine whether adjusting the reasoning effort level at inference time changes how well the length of model-generated reasoning steps matches human reaction times on the same tasks. They find the alignment holds steady across three effort levels, two model sizes, and six tasks, with statistical evidence supporting no difference. A check on the effort setting reveals it mainly restricts the maximum length of the output rather than guiding how the model thinks in the moment. This points to the alignment being locked in during training instead of being adjustable on the fly. If correct, it means efforts to tune model reasoning dynamically through effort prompts may not affect the underlying cost matching.

Core claim

Across GPT-OSS-20B and GPT-OSS-120B, three effort levels, and six reasoning tasks, within-task and cross-task alignment remain invariant with Bayes Factors leaning toward the null and mean alignment numerically near-identical across conditions. The effort parameter sets an upper budget on generation length rather than driving real-time allocation, indicating the allocation policy is crystallized at training time. In arithmetic tasks, token allocation tracks fine-grained, format-dependent human difficulty patterns, improving with model scale.

What carries the argument

The effort parameter acting as a ceiling on generation length, which does not modulate the cognitive cost alignment.

If this is right

  • Alignment between model generation lengths and human reaction times stays the same at all effort levels.
  • The reasoning allocation strategy is determined during training rather than adjusted during use.
  • Model scale improves how well token use matches human difficulty in arithmetic tasks.
  • LRM problem-solving follows a compiled pattern stable under inference changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the effort setting only limits length, very high complexity tasks may suffer from premature stopping at low effort settings.
  • Training procedures could be optimized specifically to strengthen this alignment since inference tweaks have little impact.
  • Similar invariance might be tested in non-reasoning tasks or with different model families to see how general the finding is.

Load-bearing premise

That different effort levels actually change the model's reasoning allocation process instead of simply capping how much it can output.

What would settle it

Demonstrating a change in alignment when effort is varied in a setup where the parameter is shown to control dynamic reasoning steps rather than output length.

Figures

Figures reproduced from arXiv: 2605.16938 by Tianhong Wang, Yueqing Hu.

Figure 2
Figure 2. Figure 2: Sample chain-of-thought produced by GPT-OSS [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Within-task alignment across reasoning effort conditions. Each panel shows mean reasoning token counts (purple bars, top axis) and accuracy (green bars, bottom axis) for low, medium, and high effort in (A) GPT-OSS-20B and (B) GPT￾OSS-120B. Pearson 𝑟 values (log tokens vs. log human RT) are annotated on each bar. All correlations are significant (𝑝 < .01) and remain highly stable across effort levels within… view at source ↗
Figure 4
Figure 4. Figure 4: Cross-task cognitive cost alignment by reasoning effort. Each panel plots task-level mean human RTs (blue, left axis) and model reasoning tokens (red, right axis) on log scales. Both axes are calibrated independently; the close covariation of the two curves reflects alignment. The cross-task Pearson correlation 𝑟¯ and its permutation 𝑝-value are inset. (A) Low, (B) medium, and (C) high effort yield virtual… view at source ↗
Figure 5
Figure 5. Figure 5: Bayesian paired comparisons of within-task alignment (𝑟) across effort levels. Each box represents the distribution of Fisher 𝑧-transformed 𝑟 values across 𝑛 = 12 (model × task) observations. Grey lines connect paired obser￾vations. Bayes Factors (BF10) from JZS-prior paired 𝑡-tests fall in the inconclusive range (1/3 < BF10 < 1) but lean to￾ward the null in all three contrasts. out any practically meaning… view at source ↗
Figure 6
Figure 6. Figure 6: Arithmetic complexity contrasts across reason￾ing effort conditions. Each row of forest plots contrasts a harder vs. easier condition within four complexity dimen￾sions. Blue points show model token log-means; red points show human RT log-means. (A) low, (B) medium, (C) high effort. Direction counts (harder → more tokens, 𝑡 < 0) are re￾ported in each panel. Human baseline: 8/8 correct directions, 6/8 stati… view at source ↗
read the original abstract

Large Reasoning Models (LRMs) generate chain-of-thought traces whose length tracks human reaction times across cognitive tasks, but recent debate questions whether this alignment reflects genuine computational structure or surface verbosity. We test whether the alignment varies with inference-time reasoning effort. Across GPT-OSS-20B and GPT-OSS-120B, three effort levels, and six reasoning tasks, within-task and cross-task alignment remain invariant: Bayes Factors lean toward the null, and mean alignment is numerically near-identical across conditions. A manipulation check reveals that the effort parameter sets an upper budget on generation rather than driving real-time allocation, suggesting that the allocation policy is crystallized at training time. Arithmetic complexity contrasts further show that token allocation tracks fine-grained, format-dependent human difficulty patterns, with model scale improving the match. Cognitive cost alignment between LRMs and humans appears to be a training-time achievement, robust to inference-time perturbations, supporting a compiled rather than online account of LRM problem-solving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that cognitive-cost alignment between humans and Large Reasoning Models (LRMs) is invariant to inference-time effort. Across GPT-OSS-20B and GPT-OSS-120B, three effort levels, and six reasoning tasks, within- and cross-task alignments remain stable, with Bayes factors favoring the null and numerically near-identical means. A manipulation check shows the effort parameter functions only as an upper bound on generation length rather than a dynamic allocator; the authors interpret this as evidence that the allocation policy is crystallized at training time. Token-allocation patterns also track fine-grained human difficulty in arithmetic contrasts, with scale improving the match.

Significance. If the central invariance result holds under a stronger manipulation, the work supplies statistical support (Bayes factors, numerical invariance) for distinguishing training-time compiled policies from online modulation in LRMs. The focus on format-dependent difficulty tracking and the explicit manipulation check are positive features that could help adjudicate debates on whether CoT length reflects genuine computational structure.

major comments (1)
  1. Abstract / Manipulation Check: The central claim that invariance demonstrates a training-time (compiled) achievement rather than an online policy rests on the effort levels constituting a valid perturbation of real-time reasoning allocation. The reported manipulation check indicates the parameter only imposes an upper budget on output length. This makes the observed null result expected under a length-capped regime and limits the design's ability to distinguish a policy that simply ignores this control from one fixed at training. Evidence that higher-effort settings produce measurably deeper traces, additional intermediate steps, branching differences, or accuracy gains beyond the length cap is needed to support the crystallized-vs-online distinction.
minor comments (2)
  1. Methods: Provide explicit definitions of the six reasoning tasks, data-exclusion rules, and error-handling procedures so that the Bayes-factor calculations and alignment metrics can be independently verified.
  2. Results: Report effect sizes alongside Bayes factors and clarify whether the numerical invariance holds after correcting for multiple comparisons across within-task and cross-task analyses.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address the major comment below.

read point-by-point responses
  1. Referee: Abstract / Manipulation Check: The central claim that invariance demonstrates a training-time (compiled) achievement rather than an online policy rests on the effort levels constituting a valid perturbation of real-time reasoning allocation. The reported manipulation check indicates the parameter only imposes an upper budget on output length. This makes the observed null result expected under a length-capped regime and limits the design's ability to distinguish a policy that simply ignores this control from one fixed at training. Evidence that higher-effort settings produce measurably deeper traces, additional intermediate steps, branching differences, or accuracy gains beyond the length cap is needed to support the crystallized-vs-online distinction.

    Authors: We agree that the manipulation check establishes the effort parameter as an upper bound on generation length rather than a real-time allocator of reasoning depth. This is precisely why the invariance result is informative for our interpretation. If the model maintained an online policy capable of modulating reasoning effort in response to the control signal, we would expect at least some adjustment in token-allocation patterns, alignment strength, or performance when additional budget is made available—even if capped. The fact that alignment remains stable, token usage continues to track fine-grained human difficulty contrasts, and no systematic changes in reasoning structure appear across settings indicates that the core allocation policy does not respond to the inference-time cue. The model does process the parameter (by respecting the length limit), which weakens the alternative that it simply ignores the control entirely. We acknowledge that stronger evidence of deeper traces or branching under higher settings would further bolster the distinction, but such evidence is not observed precisely because the manipulation does not elicit online adjustment. In revision we will expand the discussion to clarify these limits and the evidential basis for the compiled-policy account. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical statistical comparisons are self-contained

full rationale

The paper reports an empirical study measuring alignment of cognitive costs between humans and LRMs across effort levels, tasks, and model scales. Central results rest on observed data processed via Bayes factors and mean alignment statistics, with a manipulation check based on measured generation lengths. No equations, predictions, or first-principles derivations are presented that reduce to fitted inputs or self-definitions by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior work appear in the abstract or described chain. The invariance claim follows directly from statistical tests on independent measurements rather than any renaming or circular reduction, satisfying the criteria for a non-circular empirical analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard statistical interpretation of Bayes factors and empirical measurements of alignment; no free parameters, new axioms, or invented entities are introduced beyond domain-standard assumptions about reaction time and token count as proxies.

axioms (1)
  • standard math Bayes factors can be interpreted as evidence for the null hypothesis of invariance
    Invoked to conclude alignment remains unchanged across effort conditions.

pith-pipeline@v0.9.0 · 5707 in / 1125 out tokens · 38335 ms · 2026-05-19T20:42:39.021647+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 5 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    Arbus, E., Arora, R. K., Bai, Y., Baker, B., Bao, H., et al. (2025). Gpt-oss-120b and gpt-oss-20b model card.arXiv preprint arXiv:2508.10925

  2. [2]

    Anderson, J. R. (1982). Acquisition of cognitive skill.Psy- chological Review,89(4), 369–406

  3. [3]

    Ashcraft, M. H. (1992). Cognitive arithmetic: A review of data and theory.Cognition,44(1–2), 75–106. Binz,M.,&Schulz,E.(2023).Usingcognitivepsychologyto understand GPT-3.Proceedings of the National Academy of Sciences,120(6), e2218523120

  4. [4]

    Campbell, J. I. D., & Xue, Q. (2001). Cognitive arithmetic across cultures.Journal of Experimental Psychology: Gen- eral,130(2), 299–315

  5. [5]

    arXiv preprint arXiv:2602.13517 , year=

    Chen, W.-L., Peng, L., Tan, T., Zhao, C., Chen, B. J., Lin, Z., Go, A., & Meng, Y. (2026). Think deep, not just long: MeasuringLLMreasoningeffortviadeep-thinkingtokens. arXiv preprint arXiv:2602.13517. de Varda, A. G., D’Elia, F. P., Kean, H., Lampinen, A., &

  6. [6]

    Fedorenko, E. (2025). The cost of thinking is similar be- tween large reasoning models and humans.Proceedings of the National Academy of Sciences,122(47), e2520077122. de Varda, A. G., D’Elia, F. P., Kean, H., Lampinen, A., &

  7. [7]

    Fedorenko, E. (2026c). Reply to Vankov et al.: Reasoning tracesarelinkedtoaccuracyandcapturekeydimensionsof problemcomplexity.Proceedings of the National Academy of Sciences,123(12), e2603574123. Dujmović, M. (2026). No deep insights into the alignment between human and deep learning reasoning processes: Thoughts on de Varda et al. (2025).Proceedings of t...

  8. [8]

    J., Horvitz, E

    Gershman, S. J., Horvitz, E. J., & Tenenbaum, J. B. (2015). Computationalrationality:Aconvergingparadigmforintel- ligenceinbrains,minds,andmachines.Science,349(6245), 273–278

  9. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Goyal, S., Ji, Z., Rawat, A. S., Menon, A. K., Kumar, S., & Nagarajan,V.(2024).Thinkbeforeyouspeak:Traininglan- guage models with pause tokens.International Conference on Learning Representations (ICLR). Guo,D.,Yang,D.,Zhang,H.,Song,J.,Wang,P.,Zhu,Q.,Xu, R., Zhang, R., Ma, S., Bi, X., et al. (2025). DeepSeek-R1: Incentivizing reasoning capability in LLMs ...

  10. [10]

    Training Large Language Models to Reason in a Continuous Latent Space

    Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., & Tian,Y.(2024).Traininglargelanguagemodelstoreasonin acontinuouslatentspace.arXiv preprint arXiv:2412.06769. Hu,K.,Lin,A.C.,Qiu,L.,Ding,X.D.,Wang,R.,Zhu,Y.E.,

  11. [11]

    Andreas, J., & He, K. (2025). ARC is a vision problem! arXiv preprint arXiv:2511.14761

  12. [12]

    Thinking traces

    Hu, Y. (2026). “Thinking traces” in large reasoning models: Cognitivecostorperformativescaffolding?Proceedings of the National Academy of Sciences,123(17), e2604554123

  13. [13]

    Hu, Y., Peng, X., Peng, S., Wang, H., & Wang, T. (2026). Hán d¯an xué bù (Mimicry) or Q¯ıng ch¯u yú lán (Mastery)? A cognitive perspective on reasoning distillation in large language models.arXiv preprint arXiv:2601.05019

  14. [14]

    M., Ullman, T

    Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2017). Building machines that learn and think like people.Behavioral and Brain Sciences,40, e253

  15. [15]

    Brahman, F., Miranda, L. J. V., Liu, A., Dziri, N., Lyu, S., et al. (2024). Tülu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124

  16. [16]

    Logan, G. D. (1988). Toward an instance theory of automati- zation.Psychological Review,95(4), 492–527

  17. [17]

    T., Yao, S., Friedman, D., Hardy, M

    McCoy, R. T., Yao, S., Friedman, D., Hardy, M. D., & Grif- fiths,T.L.(2024).Embersofautoregressionshowhowlarge languagemodelsareshapedbytheproblemtheyaretrained tosolve.Proceedings of the National Academy of Sciences, 121(41), e2322420121

  18. [18]

    Palod, V., Valmeekam, K., Stechly, K., & Kambhampati, S. (2025). Performative thinking? the brittle correlation be- tween CoT length and problem complexity.arXiv preprint arXiv:2509.07339. Paul,D.,West,R.,Bosselut,A.,&Faltings,B.(2024).Making reasoningmatter:Measuringandimprovingfaithfulnessof chain-of-thoughtreasoning.Findings of the Association for Comp...

  19. [19]

    M., & Cohen, J

    Shenhav, A., Botvinick, M. M., & Cohen, J. D. (2013). The expectedvalueofcontrol:Anintegrativetheoryofanterior cingulate cortex function.Neuron,79(2), 217–240

  20. [20]

    Stechly, K., Valmeekam, K., Palod, V., Gundawar, A., & Kambhampati,S.(2025).Beyondsemantics:Theunreason- able effectiveness of reasonless intermediate tokens.First Workshop on Foundations of Reasoning in Language Mod- els

  21. [21]

    I., Adolfi, F., Heaton, R

    Vankov, I. I., Adolfi, F., Heaton, R. F., Puebla, G., & Bow- ers, J. S. (2026). Correlations without causation do not support claims of human–LLM reasoning alignment.Pro- ceedings of the National Academy of Sciences,123(12), e2536362123

  22. [22]

    F., Šmíra, M., Ep- skamp,S.,etal.(2018).Bayesianinferenceforpsychology

    Wagenmakers, E.-J., Marsman, M., Jamil, T., Ly, A., Verha- gen, J., Love, J., Selker, R., Gronau, Q. F., Šmíra, M., Ep- skamp,S.,etal.(2018).Bayesianinferenceforpsychology. Part I: Theoretical advantages and practical ramifications. Psychonomic Bulletin & Review,25(1), 35–57

  23. [23]

    V., Zhou, D., et al

    Chi, E., Le, Q. V., Zhou, D., et al. (2022). Chain-of- thoughtpromptingelicitsreasoninginlargelanguagemod- els.Advances in Neural Information Processing Systems, 35, 24824–24837