Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models

Tianhong Wang; Yueqing Hu

arxiv: 2605.16938 · v1 · pith:J3EEN5IPnew · submitted 2026-05-16 · 💻 cs.CL · cs.AI· q-bio.NC

Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models

Yueqing Hu , Tianhong Wang This is my paper

Pith reviewed 2026-05-19 20:42 UTC · model grok-4.3

classification 💻 cs.CL cs.AIq-bio.NC

keywords large reasoning modelscognitive cost alignmentchain-of-thoughtinference-time efforthuman reaction timestraining-time policyreasoning budget

0 comments

The pith

Large reasoning models keep the same human cognitive cost alignment no matter the effort budget set at inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors examine whether adjusting the reasoning effort level at inference time changes how well the length of model-generated reasoning steps matches human reaction times on the same tasks. They find the alignment holds steady across three effort levels, two model sizes, and six tasks, with statistical evidence supporting no difference. A check on the effort setting reveals it mainly restricts the maximum length of the output rather than guiding how the model thinks in the moment. This points to the alignment being locked in during training instead of being adjustable on the fly. If correct, it means efforts to tune model reasoning dynamically through effort prompts may not affect the underlying cost matching.

Core claim

Across GPT-OSS-20B and GPT-OSS-120B, three effort levels, and six reasoning tasks, within-task and cross-task alignment remain invariant with Bayes Factors leaning toward the null and mean alignment numerically near-identical across conditions. The effort parameter sets an upper budget on generation length rather than driving real-time allocation, indicating the allocation policy is crystallized at training time. In arithmetic tasks, token allocation tracks fine-grained, format-dependent human difficulty patterns, improving with model scale.

What carries the argument

The effort parameter acting as a ceiling on generation length, which does not modulate the cognitive cost alignment.

If this is right

Alignment between model generation lengths and human reaction times stays the same at all effort levels.
The reasoning allocation strategy is determined during training rather than adjusted during use.
Model scale improves how well token use matches human difficulty in arithmetic tasks.
LRM problem-solving follows a compiled pattern stable under inference changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the effort setting only limits length, very high complexity tasks may suffer from premature stopping at low effort settings.
Training procedures could be optimized specifically to strengthen this alignment since inference tweaks have little impact.
Similar invariance might be tested in non-reasoning tasks or with different model families to see how general the finding is.

Load-bearing premise

That different effort levels actually change the model's reasoning allocation process instead of simply capping how much it can output.

What would settle it

Demonstrating a change in alignment when effort is varied in a setup where the parameter is shown to control dynamic reasoning steps rather than output length.

Figures

Figures reproduced from arXiv: 2605.16938 by Tianhong Wang, Yueqing Hu.

**Figure 3.** Figure 3: Within-task alignment across reasoning effort conditions. Each panel shows mean reasoning token counts (purple bars, top axis) and accuracy (green bars, bottom axis) for low, medium, and high effort in (A) GPT-OSS-20B and (B) GPTOSS-120B. Pearson 𝑟 values (log tokens vs. log human RT) are annotated on each bar. All correlations are significant (𝑝 < .01) and remain highly stable across effort levels within… view at source ↗

**Figure 4.** Figure 4: Cross-task cognitive cost alignment by reasoning effort. Each panel plots task-level mean human RTs (blue, left axis) and model reasoning tokens (red, right axis) on log scales. Both axes are calibrated independently; the close covariation of the two curves reflects alignment. The cross-task Pearson correlation 𝑟¯ and its permutation 𝑝-value are inset. (A) Low, (B) medium, and (C) high effort yield virtual… view at source ↗

**Figure 5.** Figure 5: Bayesian paired comparisons of within-task alignment (𝑟) across effort levels. Each box represents the distribution of Fisher 𝑧-transformed 𝑟 values across 𝑛 = 12 (model × task) observations. Grey lines connect paired observations. Bayes Factors (BF10) from JZS-prior paired 𝑡-tests fall in the inconclusive range (1/3 < BF10 < 1) but lean toward the null in all three contrasts. out any practically meaning… view at source ↗

**Figure 6.** Figure 6: Arithmetic complexity contrasts across reasoning effort conditions. Each row of forest plots contrasts a harder vs. easier condition within four complexity dimensions. Blue points show model token log-means; red points show human RT log-means. (A) low, (B) medium, (C) high effort. Direction counts (harder → more tokens, 𝑡 < 0) are reported in each panel. Human baseline: 8/8 correct directions, 6/8 stati… view at source ↗

read the original abstract

Large Reasoning Models (LRMs) generate chain-of-thought traces whose length tracks human reaction times across cognitive tasks, but recent debate questions whether this alignment reflects genuine computational structure or surface verbosity. We test whether the alignment varies with inference-time reasoning effort. Across GPT-OSS-20B and GPT-OSS-120B, three effort levels, and six reasoning tasks, within-task and cross-task alignment remain invariant: Bayes Factors lean toward the null, and mean alignment is numerically near-identical across conditions. A manipulation check reveals that the effort parameter sets an upper budget on generation rather than driving real-time allocation, suggesting that the allocation policy is crystallized at training time. Arithmetic complexity contrasts further show that token allocation tracks fine-grained, format-dependent human difficulty patterns, with model scale improving the match. Cognitive cost alignment between LRMs and humans appears to be a training-time achievement, robust to inference-time perturbations, supporting a compiled rather than online account of LRM problem-solving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Alignment between LRM token use and human times stays flat across effort levels, but the effort setting turns out to be mostly a length cap rather than a test of dynamic allocation.

read the letter

The main thing to know is that this paper finds the alignment between how long LRMs take to reason and human reaction times stays the same no matter the effort level set at inference time. They conclude this points to the policy being set during training rather than adjusted on the fly. They do a few things well. The experiment covers two model sizes, three effort levels, and six tasks, with Bayes factors backing up the lack of change in alignment scores. The manipulation check is honest about what the parameter actually controls, which helps ground the interpretation that the allocation is fixed at training time. They also show that token use follows detailed human difficulty patterns in arithmetic, with bigger models doing a better job matching those patterns. These are useful empirical points in the debate over compiled versus online reasoning in AI. The main soft spot is in how much the design can distinguish the two accounts. If the effort levels only limit maximum generation length without forcing more intermediate steps or different branching, then seeing the same alignment across conditions is what you would expect even for an online system that does not use this particular knob. The abstract does not provide details on whether higher effort produces measurably richer traces or accuracy gains beyond the length limit. That makes the support for a purely training-time achievement a bit thinner than it first appears. Overall this is for researchers tracking how well LRMs capture human-like reasoning costs and those testing claims about when those alignments form. It brings targeted data to an existing question without overclaiming a new framework. I would bring it to a reading group for the methods discussion and the scale effects. It deserves peer review because the measurements are clear enough to be worth referee input on the stats and the interpretation of the effort manipulation.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that cognitive-cost alignment between humans and Large Reasoning Models (LRMs) is invariant to inference-time effort. Across GPT-OSS-20B and GPT-OSS-120B, three effort levels, and six reasoning tasks, within- and cross-task alignments remain stable, with Bayes factors favoring the null and numerically near-identical means. A manipulation check shows the effort parameter functions only as an upper bound on generation length rather than a dynamic allocator; the authors interpret this as evidence that the allocation policy is crystallized at training time. Token-allocation patterns also track fine-grained human difficulty in arithmetic contrasts, with scale improving the match.

Significance. If the central invariance result holds under a stronger manipulation, the work supplies statistical support (Bayes factors, numerical invariance) for distinguishing training-time compiled policies from online modulation in LRMs. The focus on format-dependent difficulty tracking and the explicit manipulation check are positive features that could help adjudicate debates on whether CoT length reflects genuine computational structure.

major comments (1)

Abstract / Manipulation Check: The central claim that invariance demonstrates a training-time (compiled) achievement rather than an online policy rests on the effort levels constituting a valid perturbation of real-time reasoning allocation. The reported manipulation check indicates the parameter only imposes an upper budget on output length. This makes the observed null result expected under a length-capped regime and limits the design's ability to distinguish a policy that simply ignores this control from one fixed at training. Evidence that higher-effort settings produce measurably deeper traces, additional intermediate steps, branching differences, or accuracy gains beyond the length cap is needed to support the crystallized-vs-online distinction.

minor comments (2)

Methods: Provide explicit definitions of the six reasoning tasks, data-exclusion rules, and error-handling procedures so that the Bayes-factor calculations and alignment metrics can be independently verified.
Results: Report effect sizes alongside Bayes factors and clarify whether the numerical invariance holds after correcting for multiple comparisons across within-task and cross-task analyses.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address the major comment below.

read point-by-point responses

Referee: Abstract / Manipulation Check: The central claim that invariance demonstrates a training-time (compiled) achievement rather than an online policy rests on the effort levels constituting a valid perturbation of real-time reasoning allocation. The reported manipulation check indicates the parameter only imposes an upper budget on output length. This makes the observed null result expected under a length-capped regime and limits the design's ability to distinguish a policy that simply ignores this control from one fixed at training. Evidence that higher-effort settings produce measurably deeper traces, additional intermediate steps, branching differences, or accuracy gains beyond the length cap is needed to support the crystallized-vs-online distinction.

Authors: We agree that the manipulation check establishes the effort parameter as an upper bound on generation length rather than a real-time allocator of reasoning depth. This is precisely why the invariance result is informative for our interpretation. If the model maintained an online policy capable of modulating reasoning effort in response to the control signal, we would expect at least some adjustment in token-allocation patterns, alignment strength, or performance when additional budget is made available—even if capped. The fact that alignment remains stable, token usage continues to track fine-grained human difficulty contrasts, and no systematic changes in reasoning structure appear across settings indicates that the core allocation policy does not respond to the inference-time cue. The model does process the parameter (by respecting the length limit), which weakens the alternative that it simply ignores the control entirely. We acknowledge that stronger evidence of deeper traces or branching under higher settings would further bolster the distinction, but such evidence is not observed precisely because the manipulation does not elicit online adjustment. In revision we will expand the discussion to clarify these limits and the evidential basis for the compiled-policy account. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical statistical comparisons are self-contained

full rationale

The paper reports an empirical study measuring alignment of cognitive costs between humans and LRMs across effort levels, tasks, and model scales. Central results rest on observed data processed via Bayes factors and mean alignment statistics, with a manipulation check based on measured generation lengths. No equations, predictions, or first-principles derivations are presented that reduce to fitted inputs or self-definitions by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior work appear in the abstract or described chain. The invariance claim follows directly from statistical tests on independent measurements rather than any renaming or circular reduction, satisfying the criteria for a non-circular empirical analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard statistical interpretation of Bayes factors and empirical measurements of alignment; no free parameters, new axioms, or invented entities are introduced beyond domain-standard assumptions about reaction time and token count as proxies.

axioms (1)

standard math Bayes factors can be interpreted as evidence for the null hypothesis of invariance
Invoked to conclude alignment remains unchanged across effort conditions.

pith-pipeline@v0.9.0 · 5707 in / 1125 out tokens · 38335 ms · 2026-05-19T20:42:39.021647+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

token count within the <think> delimiters, serving as the proxy for inference-time computational cost
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Effort Invariance null (H3) predicts that alignment will remain stable across effort conditions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 5 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

Arbus, E., Arora, R. K., Bai, Y., Baker, B., Bao, H., et al. (2025). Gpt-oss-120b and gpt-oss-20b model card.arXiv preprint arXiv:2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Anderson, J. R. (1982). Acquisition of cognitive skill.Psy- chological Review,89(4), 369–406

work page 1982
[3]

Ashcraft, M. H. (1992). Cognitive arithmetic: A review of data and theory.Cognition,44(1–2), 75–106. Binz,M.,&Schulz,E.(2023).Usingcognitivepsychologyto understand GPT-3.Proceedings of the National Academy of Sciences,120(6), e2218523120

work page 1992
[4]

Campbell, J. I. D., & Xue, Q. (2001). Cognitive arithmetic across cultures.Journal of Experimental Psychology: Gen- eral,130(2), 299–315

work page 2001
[5]

arXiv preprint arXiv:2602.13517 , year=

Chen, W.-L., Peng, L., Tan, T., Zhao, C., Chen, B. J., Lin, Z., Go, A., & Meng, Y. (2026). Think deep, not just long: MeasuringLLMreasoningeffortviadeep-thinkingtokens. arXiv preprint arXiv:2602.13517. de Varda, A. G., D’Elia, F. P., Kean, H., Lampinen, A., &

work page arXiv 2026
[6]

Fedorenko, E. (2025). The cost of thinking is similar be- tween large reasoning models and humans.Proceedings of the National Academy of Sciences,122(47), e2520077122. de Varda, A. G., D’Elia, F. P., Kean, H., Lampinen, A., &

work page 2025
[7]

Fedorenko, E. (2026c). Reply to Vankov et al.: Reasoning tracesarelinkedtoaccuracyandcapturekeydimensionsof problemcomplexity.Proceedings of the National Academy of Sciences,123(12), e2603574123. Dujmović, M. (2026). No deep insights into the alignment between human and deep learning reasoning processes: Thoughts on de Varda et al. (2025).Proceedings of t...

work page 2026
[8]

J., Horvitz, E

Gershman, S. J., Horvitz, E. J., & Tenenbaum, J. B. (2015). Computationalrationality:Aconvergingparadigmforintel- ligenceinbrains,minds,andmachines.Science,349(6245), 273–278

work page 2015
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Goyal, S., Ji, Z., Rawat, A. S., Menon, A. K., Kumar, S., & Nagarajan,V.(2024).Thinkbeforeyouspeak:Traininglan- guage models with pause tokens.International Conference on Learning Representations (ICLR). Guo,D.,Yang,D.,Zhang,H.,Song,J.,Wang,P.,Zhu,Q.,Xu, R., Zhang, R., Ma, S., Bi, X., et al. (2025). DeepSeek-R1: Incentivizing reasoning capability in LLMs ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Training Large Language Models to Reason in a Continuous Latent Space

Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., & Tian,Y.(2024).Traininglargelanguagemodelstoreasonin acontinuouslatentspace.arXiv preprint arXiv:2412.06769. Hu,K.,Lin,A.C.,Qiu,L.,Ding,X.D.,Wang,R.,Zhu,Y.E.,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Andreas, J., & He, K. (2025). ARC is a vision problem! arXiv preprint arXiv:2511.14761

work page arXiv 2025
[12]

Thinking traces

Hu, Y. (2026). “Thinking traces” in large reasoning models: Cognitivecostorperformativescaffolding?Proceedings of the National Academy of Sciences,123(17), e2604554123

work page 2026
[13]

Hu, Y., Peng, X., Peng, S., Wang, H., & Wang, T. (2026). Hán d¯an xué bù (Mimicry) or Q¯ıng ch¯u yú lán (Mastery)? A cognitive perspective on reasoning distillation in large language models.arXiv preprint arXiv:2601.05019

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

M., Ullman, T

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2017). Building machines that learn and think like people.Behavioral and Brain Sciences,40, e253

work page 2017
[15]

Brahman, F., Miranda, L. J. V., Liu, A., Dziri, N., Lyu, S., et al. (2024). Tülu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Logan, G. D. (1988). Toward an instance theory of automati- zation.Psychological Review,95(4), 492–527

work page 1988
[17]

T., Yao, S., Friedman, D., Hardy, M

McCoy, R. T., Yao, S., Friedman, D., Hardy, M. D., & Grif- fiths,T.L.(2024).Embersofautoregressionshowhowlarge languagemodelsareshapedbytheproblemtheyaretrained tosolve.Proceedings of the National Academy of Sciences, 121(41), e2322420121

work page 2024
[18]

Palod, V., Valmeekam, K., Stechly, K., & Kambhampati, S. (2025). Performative thinking? the brittle correlation be- tween CoT length and problem complexity.arXiv preprint arXiv:2509.07339. Paul,D.,West,R.,Bosselut,A.,&Faltings,B.(2024).Making reasoningmatter:Measuringandimprovingfaithfulnessof chain-of-thoughtreasoning.Findings of the Association for Comp...

work page arXiv 2025
[19]

M., & Cohen, J

Shenhav, A., Botvinick, M. M., & Cohen, J. D. (2013). The expectedvalueofcontrol:Anintegrativetheoryofanterior cingulate cortex function.Neuron,79(2), 217–240

work page 2013
[20]

Stechly, K., Valmeekam, K., Palod, V., Gundawar, A., & Kambhampati,S.(2025).Beyondsemantics:Theunreason- able effectiveness of reasonless intermediate tokens.First Workshop on Foundations of Reasoning in Language Mod- els

work page 2025
[21]

I., Adolfi, F., Heaton, R

Vankov, I. I., Adolfi, F., Heaton, R. F., Puebla, G., & Bow- ers, J. S. (2026). Correlations without causation do not support claims of human–LLM reasoning alignment.Pro- ceedings of the National Academy of Sciences,123(12), e2536362123

work page 2026
[22]

F., Šmíra, M., Ep- skamp,S.,etal.(2018).Bayesianinferenceforpsychology

Wagenmakers, E.-J., Marsman, M., Jamil, T., Ly, A., Verha- gen, J., Love, J., Selker, R., Gronau, Q. F., Šmíra, M., Ep- skamp,S.,etal.(2018).Bayesianinferenceforpsychology. Part I: Theoretical advantages and practical ramifications. Psychonomic Bulletin & Review,25(1), 35–57

work page 2018
[23]

V., Zhou, D., et al

Chi, E., Le, Q. V., Zhou, D., et al. (2022). Chain-of- thoughtpromptingelicitsreasoninginlargelanguagemod- els.Advances in Neural Information Processing Systems, 35, 24824–24837

work page 2022

[1] [1]

gpt-oss-120b & gpt-oss-20b Model Card

Arbus, E., Arora, R. K., Bai, Y., Baker, B., Bao, H., et al. (2025). Gpt-oss-120b and gpt-oss-20b model card.arXiv preprint arXiv:2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Anderson, J. R. (1982). Acquisition of cognitive skill.Psy- chological Review,89(4), 369–406

work page 1982

[3] [3]

Ashcraft, M. H. (1992). Cognitive arithmetic: A review of data and theory.Cognition,44(1–2), 75–106. Binz,M.,&Schulz,E.(2023).Usingcognitivepsychologyto understand GPT-3.Proceedings of the National Academy of Sciences,120(6), e2218523120

work page 1992

[4] [4]

Campbell, J. I. D., & Xue, Q. (2001). Cognitive arithmetic across cultures.Journal of Experimental Psychology: Gen- eral,130(2), 299–315

work page 2001

[5] [5]

arXiv preprint arXiv:2602.13517 , year=

Chen, W.-L., Peng, L., Tan, T., Zhao, C., Chen, B. J., Lin, Z., Go, A., & Meng, Y. (2026). Think deep, not just long: MeasuringLLMreasoningeffortviadeep-thinkingtokens. arXiv preprint arXiv:2602.13517. de Varda, A. G., D’Elia, F. P., Kean, H., Lampinen, A., &

work page arXiv 2026

[6] [6]

Fedorenko, E. (2025). The cost of thinking is similar be- tween large reasoning models and humans.Proceedings of the National Academy of Sciences,122(47), e2520077122. de Varda, A. G., D’Elia, F. P., Kean, H., Lampinen, A., &

work page 2025

[7] [7]

Fedorenko, E. (2026c). Reply to Vankov et al.: Reasoning tracesarelinkedtoaccuracyandcapturekeydimensionsof problemcomplexity.Proceedings of the National Academy of Sciences,123(12), e2603574123. Dujmović, M. (2026). No deep insights into the alignment between human and deep learning reasoning processes: Thoughts on de Varda et al. (2025).Proceedings of t...

work page 2026

[8] [8]

J., Horvitz, E

Gershman, S. J., Horvitz, E. J., & Tenenbaum, J. B. (2015). Computationalrationality:Aconvergingparadigmforintel- ligenceinbrains,minds,andmachines.Science,349(6245), 273–278

work page 2015

[9] [9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Goyal, S., Ji, Z., Rawat, A. S., Menon, A. K., Kumar, S., & Nagarajan,V.(2024).Thinkbeforeyouspeak:Traininglan- guage models with pause tokens.International Conference on Learning Representations (ICLR). Guo,D.,Yang,D.,Zhang,H.,Song,J.,Wang,P.,Zhu,Q.,Xu, R., Zhang, R., Ma, S., Bi, X., et al. (2025). DeepSeek-R1: Incentivizing reasoning capability in LLMs ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Training Large Language Models to Reason in a Continuous Latent Space

Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., & Tian,Y.(2024).Traininglargelanguagemodelstoreasonin acontinuouslatentspace.arXiv preprint arXiv:2412.06769. Hu,K.,Lin,A.C.,Qiu,L.,Ding,X.D.,Wang,R.,Zhu,Y.E.,

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Andreas, J., & He, K. (2025). ARC is a vision problem! arXiv preprint arXiv:2511.14761

work page arXiv 2025

[12] [12]

Thinking traces

Hu, Y. (2026). “Thinking traces” in large reasoning models: Cognitivecostorperformativescaffolding?Proceedings of the National Academy of Sciences,123(17), e2604554123

work page 2026

[13] [13]

Hu, Y., Peng, X., Peng, S., Wang, H., & Wang, T. (2026). Hán d¯an xué bù (Mimicry) or Q¯ıng ch¯u yú lán (Mastery)? A cognitive perspective on reasoning distillation in large language models.arXiv preprint arXiv:2601.05019

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

M., Ullman, T

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2017). Building machines that learn and think like people.Behavioral and Brain Sciences,40, e253

work page 2017

[15] [15]

Brahman, F., Miranda, L. J. V., Liu, A., Dziri, N., Lyu, S., et al. (2024). Tülu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Logan, G. D. (1988). Toward an instance theory of automati- zation.Psychological Review,95(4), 492–527

work page 1988

[17] [17]

T., Yao, S., Friedman, D., Hardy, M

McCoy, R. T., Yao, S., Friedman, D., Hardy, M. D., & Grif- fiths,T.L.(2024).Embersofautoregressionshowhowlarge languagemodelsareshapedbytheproblemtheyaretrained tosolve.Proceedings of the National Academy of Sciences, 121(41), e2322420121

work page 2024

[18] [18]

Palod, V., Valmeekam, K., Stechly, K., & Kambhampati, S. (2025). Performative thinking? the brittle correlation be- tween CoT length and problem complexity.arXiv preprint arXiv:2509.07339. Paul,D.,West,R.,Bosselut,A.,&Faltings,B.(2024).Making reasoningmatter:Measuringandimprovingfaithfulnessof chain-of-thoughtreasoning.Findings of the Association for Comp...

work page arXiv 2025

[19] [19]

M., & Cohen, J

Shenhav, A., Botvinick, M. M., & Cohen, J. D. (2013). The expectedvalueofcontrol:Anintegrativetheoryofanterior cingulate cortex function.Neuron,79(2), 217–240

work page 2013

[20] [20]

Stechly, K., Valmeekam, K., Palod, V., Gundawar, A., & Kambhampati,S.(2025).Beyondsemantics:Theunreason- able effectiveness of reasonless intermediate tokens.First Workshop on Foundations of Reasoning in Language Mod- els

work page 2025

[21] [21]

I., Adolfi, F., Heaton, R

Vankov, I. I., Adolfi, F., Heaton, R. F., Puebla, G., & Bow- ers, J. S. (2026). Correlations without causation do not support claims of human–LLM reasoning alignment.Pro- ceedings of the National Academy of Sciences,123(12), e2536362123

work page 2026

[22] [22]

F., Šmíra, M., Ep- skamp,S.,etal.(2018).Bayesianinferenceforpsychology

Wagenmakers, E.-J., Marsman, M., Jamil, T., Ly, A., Verha- gen, J., Love, J., Selker, R., Gronau, Q. F., Šmíra, M., Ep- skamp,S.,etal.(2018).Bayesianinferenceforpsychology. Part I: Theoretical advantages and practical ramifications. Psychonomic Bulletin & Review,25(1), 35–57

work page 2018

[23] [23]

V., Zhou, D., et al

Chi, E., Le, Q. V., Zhou, D., et al. (2022). Chain-of- thoughtpromptingelicitsreasoninginlargelanguagemod- els.Advances in Neural Information Processing Systems, 35, 24824–24837

work page 2022