pith. sign in

arxiv: 2606.07157 · v1 · pith:KOKUW7ESnew · submitted 2026-06-05 · 💻 cs.AI

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

Pith reviewed 2026-06-27 22:07 UTC · model grok-4.3

classification 💻 cs.AI
keywords AI capabilitiesno-CoT reasoningtime horizonsfrontier modelschain-of-thoughtbenchmarksAI safetytask completion
0
0 comments X

The pith

Frontier AI models without chain-of-thought reasoning now solve tasks that take humans over three minutes at 50 percent success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures frontier models' ability to solve tasks without producing explicit chain-of-thought reasoning by evaluating them on over 30,000 questions across 43 benchmarks in math, coding, puzzles, causality, theory of mind, and strategic reasoning. It defines the 50 percent task-completion time horizon as the human time needed for tasks a model completes at 50 percent success and finds this horizon has doubled roughly every year for the past six years. The latest models reach over three minutes, with a reasoning token horizon above 1,500 tokens. Median projections indicate these horizons could exceed seven minutes by 2028 and 25 minutes by 2030. The work argues that frontier developers should track this metric explicitly because internal reasoning would reduce the effectiveness of oversight that relies on inspecting generated thoughts.

Core claim

The no-CoT 50 percent task-completion time horizon of frontier models has doubled roughly every year over the past six years, with GPT-5.5 reaching over three minutes and a reasoning token horizon exceeding 1,500 tokens; median estimates project horizons above seven minutes by 2028 and 25 minutes by 2030.

What carries the argument

The 50 percent task-completion time horizon (TH), defined as the human time required for tasks a model completes with 50 percent success rate, paired with the 50 percent reasoning token horizon.

If this is right

  • Models are approaching the ability to complete multi-minute tasks through internal reasoning alone.
  • Safety monitoring that depends on inspecting chain-of-thought outputs will become less reliable as horizons grow.
  • Frontier developers should begin explicit tracking of no-CoT time horizons alongside other capability metrics.
  • Projections indicate no-CoT capabilities will reach tasks requiring 25 minutes of human effort by 2030 if the observed trend continues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Continued doubling would mean models soon handle problems that require expert humans hours of effort without any visible intermediate steps.
  • Safety research would need to develop new methods beyond output inspection, such as activation monitoring or behavioral testing on longer tasks.
  • Benchmarks focused on tasks with 5- to 30-minute human completion times would be required to measure further progress accurately.

Load-bearing premise

Performance on the chosen 43 benchmarks at the 50 percent success threshold provides a valid proxy for general no-CoT reasoning ability that can be directly compared to human task-completion times.

What would settle it

A new benchmark suite where current frontier models reach 50 percent success only on tasks that take humans under one minute would falsify both the reported horizons and the doubling trend.

Figures

Figures reproduced from arXiv: 2606.07157 by Alex Serrano, Anders Cairns Woodruff, Ariana Azarbal, Dewi Gould, Elle Najt, Francis Rhys Ward, Harry Mayne, Ida Caspary, Ionut Gabriel Stan, Jason Ross Brown, Jo J. Jiao, Josh Hills, Julian Stastny, Patrick Leask, Ram Potham, Rauno Arike, Ryan Greenblatt, Shubhorup Biswas, Simeon Hellsten, Twm Stone, William L. Anderson.

Figure 1
Figure 1. Figure 1: The length of tasks frontier models complete with 50% reliability without CoT has doubled approximately every 373 days. Frontier models now attain estimated no-CoT THs of over 3 minutes, and our median projection has this exceeding 25 minutes by the end of the decade. Shaded band shows 95% CIs. The right-hand axis shows each model’s reasoning-token horizon. The two anchors are related by a sub-linear power… view at source ↗
Figure 2
Figure 2. Figure 2: No-CoT THs (ours) compared to with-CoT THs (from Kwa et al. [14]). Until the release of GPT-4, with- and without-CoT THs increased at a similar rate. Since GPT-4 with-CoT THs have grown at roughly twice the rate of no-CoT THs. 2 Related work Reasoning models. LLMs can be prompted to generate CoT reasoning, improving benchmark performance [16]. Recent reasoning models are trained with reinforcement learning… view at source ↗
Figure 3
Figure 3. Figure 3: Calibrating model estimates of human solve time. We few-shot prompt Opus 4.7 with a small sample of in-domain human times. Each point is a held-out problem on one of the bench￾marks with real human solve times. The shaded region shows calibrated uncertainty bounds. Modelling the uncertainty in estimated times. We quantify the uncertainty in model esti￾mates by fitting a single conditional Gaus￾sian to (log… view at source ↗
Figure 4
Figure 4. Figure 4: Both time and reasoning token horizons have doubled roughly every year historically. Left: THs have doubled every 373 days and currently exceed 3 minutes. Right: Token horizons have doubled every 437 days and now exceed 1, 500 tokens.6 Exponential growth on both anchors ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Success rates for tasks with varying human solve times (left) and o3-mini reasoning token counts (right). The bars show aggregated, normalised success-rates with bootstrapped 95% CIs, for a selection of models. We overlay the logistic curves determined by fitting to problem-level success rates. GPT-5.5 has the longest estimated 50% no-CoT problem-completion TH at 3 min. Today’s frontier and the projection.… view at source ↗
Figure 6
Figure 6. Figure 6: Per-benchmark 50% time horizons. Error bars are 95% bootstrapped CIs and are large where fits are unstable. Benchmarks are ordered by average problem time. Why the time and token horizon slopes differ. The two doubling times differ slightly because each Pareto frontier selects a partially different subset of models: GPT-4-turbo and GPT-4o sit on the token frontier but not the time frontier, while Opus-4.6 … view at source ↗
Figure 7
Figure 7. Figure 7: The effect of filler tokens (top) and question repeats (bottom) is variable across benchmarks. We observe that a single question repeat offers performance improvement for N-hop lookup and Scheming (numeric), but that the effect of filler tokens is essentially negligible. 4.2 Robustness of our results The headline doubling time is robust to the choice of time-estimate uncertainty model, leaving out any one … view at source ↗
Figure 8
Figure 8. Figure 8: The no-CoT time-horizon trend is robust to adding longer-form tasks. We recompute the 50% THs after expanding the main short-answer benchmark suite. Left: Adding generation tasks leaves the estimated THs and doubling time broadly unchanged. Right: Adding generation and multi-turn agentic tasks modestly increases THs. 7We exclude benchmarks for which a model’s dynamic range (the difference between its norma… view at source ↗
Figure 9
Figure 9. Figure 9: Per-category no-CoT TH trends show suggestive but noisy domain variation. Dou￾bling times vary across task categories, suggesting possible domain-specific differences in no-CoT capability growth, but these estimates are noisy because several categories contain few benchmarks. 4.3 Open-weight model parameter scaling No-CoT THs increase with model size and layer count ( [PITH_FULL_IMAGE:figures/full_fig_p00… view at source ↗
Figure 10
Figure 10. Figure 10: Open-weight model no-CoT THs as a function of total parameter and layer count. Doubling the 50% TH requires a 4.2× increase in total parameters and a 1.3× increase in the layer count. The best model, Kimi K2.5, attains a TH of 1.10 minutes (not directly comparable to whole￾suite THs since this ablation uses a 25-benchmark subset; see Section 3.4). The pooled slope masks a substantial dense/MoE difference … view at source ↗
Figure 11
Figure 11. Figure 11: Success rates for tasks with varying human solve times. The bars show aggregated, normalised success-rates with bootstrapped 95% CIs, for a selection of models. We overlay logistic curves determined by fitting to problem-level success rates. 1 4 16 64 256 1k 4k 16k 64k 0 20 40 60 80 100 Chance-corrected success rate (%) GPT-3 Token horizon 1 tok 1 4 16 64 256 1k 4k 16k 64k 0 20 40 60 80 100 GPT-3.5 Token … view at source ↗
Figure 12
Figure 12. Figure 12 [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Change in odds ratio of success/failure with human solve time doubling. We observe that the majority of (model, benchmark) pairs have odds ratio around 2, but that some tasks (Sudoku, N-Hop Lookup) have significantly higher odds ratios. one dot per model, one column per benchmark, columns ordered by median slope. For each pair we fit P(correct | t) = σ [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Change in odds ratio of success/failure with o3-mini token count doubling. In the vast majority of cases the (model, benchmark) odds ratios are at or above 2. 1 s 2 s 4 s 8 s 15 s 30 s 1 min 2 min 4 min 8 min 15 min 30 min 1 hr Real solve time treal 1 s 2 s 4 s 8 s 15 s 30 s 1 min 2 min 4 min 8 min 15 min 30 min 1 hr Estim ate d solv e tim e test = 0.28, = 1.13, = 0.71 = 0.92, n = 1075 conditional mean ±2… view at source ↗
Figure 15
Figure 15. Figure 15: Calibration of time estimator. Grey points are the n = 1075 paired (treal, test) observa￾tions from our in-domain ICL calibration sweep (test generated using 50 in-domain anchors, pooled across seeds). The blue line is the fitted conditional mean E[log treal | log test] = α + β log test with (α, β, σ) = (−0.28, 1.13, 0.71) and overall correlation ρ = 0.92; the shaded band is the ±2σ envelope of the condit… view at source ↗
Figure 16
Figure 16. Figure 16: Calibration of time estimator (split Gaussian). Grey points show in-domain held-out (real, estimated) solve time pairs (n = 1, 075). Solid lines show the conditional mean E[log treal | log test] = αr + βr log test of the fitted log-normal noise model in each regime r ∈ low, high (split at test = 1 min, dotted horizontal line). Shaded bands show the corresponding ±2σr envelopes: the central ∼ 95% of the fi… view at source ↗
Figure 17
Figure 17. Figure 17: Canonical trend fitted with a two-regime split-Gaussian time-uncertainty layer. Same models, benchmarks, horizon floor, and exclusion set as the main-text canonical trend; only the noise model on estimated solve times differs (split Gaussian here vs single Gaussian in the main text). The indistinguishable doubling times and CI envelopes indicate that our analysis is insensitive to this particular choice o… view at source ↗
Figure 18
Figure 18. Figure 18: Sensitivity of trend doubling time to the time horizon floor (GPT-2 + GPT-3 included in the fit). GPT-2 and GPT-3 both saturate at near-chance accuracy, so their 50% TH falls close to or below the resolution of our methodology. Any single trend line forces an arbitrary choice of floor at which to clip these horizons during the bootstrap. The doubling time moves monotonically with the floor, demonstrating … view at source ↗
Figure 19
Figure 19. Figure 19: Comparison of TH trend lines using only questions with real human solve time data, versus our full benchmark suite. Using only questions with real human solve times leads to a reduction in the absolute values of THs in most cases, but an increase in the doubling rate. Confidence regions are large in this setting due to a smaller number of benchmarks available with real human solve time. A.7 Effect of long… view at source ↗
Figure 20
Figure 20. Figure 20: Comparison of TH trend lines using only short-answer tasks, and including gen￾eration tasks. Doubling times and individual point-estimates are insensitive to the addition of longer-form generation tasks. Jan 2019 Jan 2020 Jan 2021 Jan 2022 Jan 2023 Jan 2024 Jan 2025 Jan 2026 Jan 2027 Jan 2028 Model release date 1 4 16 64 256 1k 4k o3-mini reasoning-token anchor for 50% success rate short-answer (canonical… view at source ↗
Figure 21
Figure 21. Figure 21: Comparison of reasoning token horizon trend lines using only short-answer tasks, and including generation tasks. Doubling times and individual point-estimates are insensitive to the addition of longer-form generation tasks. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Comparison of TH trend lines using all tasks including short-answer, generation, and multi-turn agentic. The most significant impact of including the multi-turn agentic tasks is an increase in the point-estimate THs for the latest frontier models which perform very well on these tasks. math and science to 387 days for language and reasoning. These are all highly noisy estimates given the reduction in data… view at source ↗
Figure 23
Figure 23. Figure 23: Exponential, hyperbolic and linear fits to the TH data. Curves are fit to the median per-model TH estimates. reasoning tokens), which clarifies the residual spread: per-benchmark horizons tend to track the underlying time scale of the benchmark itself, with longer-horizon benchmarks (e.g. Codeforces, HCAST, agentic terminal tasks) producing higher per-benchmark horizons than short-answer tasks. A.11 Robus… view at source ↗
Figure 24
Figure 24. Figure 24: 50% THs per-benchmark. One panel per model; bars give the per-benchmark 50% TH with 95% bootstrap CIs, colored by the benchmark’s mean human solve time. Benchmarks where the dynamic range of normalised scores is below 0.3 or where the bootstrap CI exceeds 30 minutes are omitted. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: 50% reasoning token horizons per-benchmark. One panel per model; bars give the per-benchmark 50% reasoning token horizon with 95% bootstrap CIs, colored by the benchmark’s mean o3-mini reasoning token count. Benchmarks where the dynamic range of normalised scores is below 0.3 or where the bootstrap CI exceeds 30 minutes are omitted. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Token-anchor trend with zero-token questions dropped. Pareto frontier fit on the o3-mini reasoning token anchor, with every question whose o3-mini min reasoning token count is exactly 0 excluded from the fit (compare the main-text token-anchor trend, which clips them to a floor of 1 token). 0 10 50 100 500 1000 N fillers 0 20 40 60 80 100 Chance-corrected accuracy (%) (pooled across benchmarks) GPT-4o GPT… view at source ↗
Figure 27
Figure 27. Figure 27: Effect of filler tokens and question repeats. Aggregated across N benchmarks, we observe no effect of filler tokens, but some moderate improvement with non-zero question repeats for GPT-5.4, and the Opus models. In [PITH_FULL_IMAGE:figures/full_fig_p033_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Per-model log-log scatter of TH vs token horizon. Each dot is one of our frontier models (GPT-2 and GPT-3 excluded); axes are the canonical 50% horizons: o3-mini reasoning tokens on the x-axis and human solve minutes on the y-axis. Whiskers are bootstrap 95% CIs in both axes. The dashed black line is the least-squares fit in log-log space (slope ≈ 0.85, R2 ≈ 0.95). • SWE and cyber [PITH_FULL_IMAGE:figure… view at source ↗
Figure 29
Figure 29. Figure 29: No-CoT performance in N-hop reasoning benchmark as function of N. State of the art models can reliably track data over N = 5 or 6 hops. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Per-bucket success rates and logistic fits for all open-weight models. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: No-CoT performance in open-weight models as a function of the active parameter count. Only four models in our panel sit on the active-parameter Pareto frontier, so the fitted slope is sensitive to single-model perturbations and should be interpreted with caution [PITH_FULL_IMAGE:figures/full_fig_p042_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: No-CoT performance in open-weight models as a function of training FLOPs. Pretraining FLOPs are estimated as Cpre ≈ 6 × Nactive × T following Hoffmann et al. [39], with disclosed values used where available. Total-training-FLOP scaling is reported on a 16-model subset whose technical reports or Epoch AI AI Models database [40] entries disclose enough to estimate total compute (most Qwen3.5 and Gemma 4 var… view at source ↗
Figure 33
Figure 33. Figure 33: No-CoT performance in open-weight models as a function of their Artificial Analysis Intelligence Index score. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: No-CoT performance in open-weight models as a function of the total parameter count, broken down by dense vs. MoE models [PITH_FULL_IMAGE:figures/full_fig_p044_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: No-CoT performance in open-weight models as a function of the active parameter count, broken down by dense vs. MoE models. A.16.7 TH in non-reasoning and hybrid-reasoning models Below, we compare the TH scaling trends across non-reasoning and hybrid-reasoning models (Fig￾ure 37). We observe a substantially steeper scaling curve for non-reasoning than for hybrid-reasoning models. We believe that this is un… view at source ↗
Figure 36
Figure 36. Figure 36: No-CoT performance in open-weight models as a function of layer count, broken down by dense vs. MoE models. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: No-CoT performance in open-weight models as a function of the total parameter count, broken down by non-reasoning vs. hybrid-reasoning models. A.16.8 TH in text-only vs. vision-language models In [PITH_FULL_IMAGE:figures/full_fig_p045_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: No-CoT performance in open-weight models as a function of the total parameter count, broken down by text-only vs. vision-language models. A.16.9 TH in different model families We analyze per-family no-CoT TH scaling in five families where we evaluate 3+ models: Qwen 3, Qwen 3.5, Llama 3.n, Gemma 3, and Mistral Ministral. We observe very clean within-family scaling behavior in three families: the R2 values… view at source ↗
Figure 39
Figure 39. Figure 39: No-CoT performance in open-weight models from the same model family. In contrast to all other plots in this section, we plot the bootstrap medians rather than point estimates for visual clarity [PITH_FULL_IMAGE:figures/full_fig_p046_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Effect of structured outputs on benchmarks with high non-compliance for Opus 4.7. B.3 In-context calibration of model time estimates Many tasks in our benchmark rely on model-estimated human solve times. We can improve these estimates by giving the model labelled examples. We test two settings: • In-domain: examples come from the same task (e.g. use 10 labelled Sudoku puzzles to estimate times for held-ou… view at source ↗
Figure 41
Figure 41. Figure 41: MALR vs number of in-context examples on the matched cohort of 11 benchmarks. Cross-domain few-shot (blue) is essentially flat across N, plateauing at an 8% improvement over the zero-shot baseline (grey dashed). In-domain few-shot (orange) cuts MALR by ∼60% at N=10 and continues to improve up to N=100. Error bars are SEM across benchmarks. Chess Puzzles Kenken Gpqa Intuit Arc Agi 2 Hcast Cybash (Bash) Cyb… view at source ↗
Figure 42
Figure 42. Figure 42: Per-benchmark MALR by condition. Each cluster shows the ad-hoc heuristic estimate (purple, where defined), zero-shot (grey), best cross-domain (blue), and best in-domain few-shot (orange). Benchmarks are sorted by zero-shot MALR. and Puzzle Baron (0.72 → 0.06). On the 14 benchmarks where a hand-engineered ad-hoc esti￾mate also exists (e.g. “1 minute per A-Level mark”, scaled medalist times for LingOly, an… view at source ↗
Figure 43
Figure 43. Figure 43: Rank correlation between reasoning-token usage and no-CoT accuracy. Left: Matrix of Spearman ρ between reasoning-model token counts (columns) and no-CoT model accuracy (rows), pooled across all (benchmark, question) pairs. Negative values indicate that problems requiring more reasoning tokens tend to be those with lower no-CoT accuracy. Right: Pooled ρ per reasoning model, with 95% CIs from a two-level cl… view at source ↗
Figure 44
Figure 44. Figure 44: Rank correlation between reasoning-token usage and no-CoT accuracy, extended to 16 tasks for the 5 reasoning models that ran on all of them. o3-mini’s pooled ρ ≈ −0.7 is the most negative, though CIs overlap with the other models. 16 64 256 1k 4k 16k n = 18310 Sally-Anne A-Level (Text) Hash GSM-1K Stego Encode N-Hop Lookup Competition Math Stego Decode BEA-24 Chess Puzzles Puzzle Baron Arithmetic Test Cas… view at source ↗
Figure 45
Figure 45. Figure 45: o3-mini reasoning token count versus human-solve time. B.4.3 Uncertainty in reasoning token anchor The minimum reasoning-token count over a question’s correct attempts is itself a noisy estimator, because we may not have sampled the true minimum reasoning-token count with only k attempts. To quantify this noise, for each question we resample its correct attempts with replacement and recompute the minimum … view at source ↗
Figure 46
Figure 46. Figure 46: Human baselining web app. Participants signed up, selected benchmark problems, and submitted final answers through the web app. The interface recorded task start and submission times, active solve time, answers, attempt numbers, and grading status. 58 [PITH_FULL_IMAGE:figures/full_fig_p058_46.png] view at source ↗
Figure 47
Figure 47. Figure 47: Real and estimated human solve time distributions for all benchmarks. Median and interquartile range of time estimates in each benchmark (jitter added for clarity). Real times are used where available or collected, and model estimates are used otherwise, following Section 3.1. “ICL” denotes estimates generated using in-context learning. 59 [PITH_FULL_IMAGE:figures/full_fig_p059_47.png] view at source ↗
Figure 48
Figure 48. Figure 48: Exponential fits of human completion time versus puzzle difficulty rating at the 10th, 25th, 50th, 75th, and 90th percentiles, computed from the filtered dataset of 33,345 Lichess puzzles. Domain: Reasoning. Estimated time range: 0.14 to 1.07 minutes. Scoring: The model is asked to provide the move in Universal Chess Interface notation (UCI) or Standard Algebraic Notation (SAN). The scorer then validates … view at source ↗
read the original abstract

Many efforts to ensure frontier AI models are safe rely on monitoring their chain-of-thought (CoT) reasoning. If models become able to perform sufficiently complex reasoning internally, without explicit thinking tokens, this would undermine such oversight. We measure how well frontier models reason without CoT across a suite of over 30,000 questions spanning 43 benchmarks in domains including math, coding, puzzles, causality, theory-of-mind, and strategic reasoning. To compare models against humans, we estimate the $50\%$-task-completion time horizon (TH): the human time required for tasks a model completes with $50\%$ success rate. We complement this with a $50\%$ reasoning token horizon: the minimum number of o3-mini reasoning tokens needed for tasks a model solves with $50\%$ success rate. We find that the no-CoT $50\%$ TH of frontier models has been doubling roughly every year over the past six years, with GPT-5.5's TH reaching over 3 minutes and reasoning token horizon exceeding 1,500 tokens. Our median estimates predict that frontier no-CoT THs could exceed 7 minutes by 2028, and 25 minutes by 2030, though these projections carry substantial uncertainty. We recommend frontier developers track this explicitly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript measures frontier AI models' no-CoT reasoning performance across 43 benchmarks spanning math, coding, puzzles, causality, theory-of-mind, and strategic reasoning (over 30,000 questions total). It defines the 50% task-completion time horizon (TH) as the estimated human time required for tasks a model solves at 50% success rate, and a complementary 50% reasoning token horizon. The central claims are that no-CoT 50% TH has doubled roughly every year over the past six years, reaching >3 minutes for GPT-5.5 with a reasoning token horizon >1,500 tokens, and that median extrapolations predict >7 minutes by 2028 and >25 minutes by 2030.

Significance. If the benchmark suite and human-time mapping constitute a valid proxy for general no-CoT reasoning ability, the work supplies a concrete, time-based metric for tracking the erosion of CoT-based oversight and supplies falsifiable near-term predictions. The empirical trend over six years and the dual TH/token-horizon framing are strengths that could inform safety monitoring practices.

major comments (3)
  1. [Abstract] Abstract: The abstract states the doubling trend, GPT-5.5 values, and 2028/2030 projections but supplies no information on benchmark selection criteria, the method used to estimate human task-completion times, the statistical procedure for fitting the yearly doubling rate, or any error analysis. These omissions are load-bearing because the central claim that the 50% TH validly proxies general reasoning rests on the representativeness and calibration of the chosen tasks.
  2. [Methods] Methods (benchmark construction): The 50% success threshold is used to define task difficulty for both the time and token horizons, yet the manuscript provides no validation that tasks at this threshold correspond to real-world human task distributions or that the 43 benchmarks are not skewed toward short, multiple-choice, or AI-evaluation artifacts. This directly affects whether the observed trend can be extrapolated.
  3. [Results] Results (projections): The 2028 and 2030 forecasts are described as median estimates based on the observed yearly doubling trend. No sensitivity analysis to benchmark subset, alternative functional forms, or uncertainty quantification around the fitted rate is reported, rendering the projections an extrapolation of the same historical data used to establish the trend.
minor comments (2)
  1. [Abstract] The abstract states 'over 30,000 questions' but does not clarify whether all items are retained after filtering or how per-benchmark sample sizes affect the 50% threshold estimates.
  2. [Methods] Notation for the time horizon (TH) and reasoning token horizon should be introduced with an explicit equation in the methods section rather than only in the abstract.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and indicate where revisions will strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract states the doubling trend, GPT-5.5 values, and 2028/2030 projections but supplies no information on benchmark selection criteria, the method used to estimate human task-completion times, the statistical procedure for fitting the yearly doubling rate, or any error analysis. These omissions are load-bearing because the central claim that the 50% TH validly proxies general reasoning rests on the representativeness and calibration of the chosen tasks.

    Authors: We agree the abstract would benefit from additional context. While full methodological details appear in the Methods and Results sections, we will revise the abstract to briefly note the benchmark selection (43 established datasets spanning math, coding, puzzles, causality, theory-of-mind, and strategic reasoning), the human-time estimation approach, the exponential fitting for the doubling rate, and the reported uncertainty in projections. revision: yes

  2. Referee: [Methods] Methods (benchmark construction): The 50% success threshold is used to define task difficulty for both the time and token horizons, yet the manuscript provides no validation that tasks at this threshold correspond to real-world human task distributions or that the 43 benchmarks are not skewed toward short, multiple-choice, or AI-evaluation artifacts. This directly affects whether the observed trend can be extrapolated.

    Authors: The 43 benchmarks were drawn from established public datasets chosen to cover diverse no-CoT reasoning domains with over 30,000 questions total. The 50% threshold is used consistently to define the horizon. We acknowledge that direct empirical validation against real-world task distributions is not performed and will add an expanded limitations section discussing potential selection effects and the difficulty of such calibration. revision: partial

  3. Referee: [Results] Results (projections): The 2028 and 2030 forecasts are described as median estimates based on the observed yearly doubling trend. No sensitivity analysis to benchmark subset, alternative functional forms, or uncertainty quantification around the fitted rate is reported, rendering the projections an extrapolation of the same historical data used to establish the trend.

    Authors: The projections are explicitly labeled as median extrapolations accompanied by a statement of substantial uncertainty. We will add sensitivity checks on the fitted doubling rate, alternative functional forms, and bootstrap-based uncertainty quantification around the trend in the revised Results section. revision: yes

Circularity Check

1 steps flagged

Future TH projections reduce to extrapolation of fitted historical doubling rate

specific steps
  1. fitted input called prediction [Abstract]
    "We find that the no-CoT 50% TH of frontier models has been doubling roughly every year over the past six years, with GPT-5.5's TH reaching over 3 minutes and reasoning token horizon exceeding 1,500 tokens. Our median estimates predict that frontier no-CoT THs could exceed 7 minutes by 2028, and 25 minutes by 2030, though these projections carry substantial uncertainty."

    The doubling rate is fitted directly to the observed historical performance data across models; the 'median estimates' for future years are obtained by applying that same fitted rate forward, making the numerical predictions statistically forced consequences of the input trend rather than an independent derivation.

full rationale

The paper measures current no-CoT 50% TH values on the benchmark suite and observes a doubling trend over six years of historical model data. The 2028/2030 projections are then generated by extending that fitted rate, which matches the fitted_input_called_prediction pattern. No other circular steps (self-definitional, self-citation load-bearing, etc.) are present in the provided text; the benchmark-to-TH mapping and trend observation are independent of the forward extrapolation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Projections rest on an observed doubling rate fitted to past model performance; the 50% threshold and human-time mapping are domain assumptions without further justification in the abstract.

free parameters (1)
  • yearly doubling rate = roughly 2x per year
    The rate used for historical trend and future projections is derived from observed performance changes over six years.
axioms (1)
  • domain assumption Benchmarks at 50% success rate measure meaningful no-CoT reasoning ability comparable to human task time.
    This defines the TH metric and enables the human comparison.

pith-pipeline@v0.9.1-grok · 5844 in / 1309 out tokens · 31416 ms · 2026-06-27T22:07:51.767521+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. How Transparent is DiffusionGemma?

    cs.LG 2026-06 unverdicted novelty 6.0

    DiffusionGemma matches Gemma 4 in variable transparency and monitorability after applying an interpretable token bottleneck, despite higher naive serial depth, and shows novel phenomena such as non-chronological reasoning.

Reference graph

Works this paper leans on

111 extracted references · 9 canonical work pages · cited by 1 Pith paper

  1. [1]

    OpenAI, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich,...

  2. [2]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, 11 Chong Ruan, Dama...

  3. [3]

    Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety, 2025

    Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksand...

  4. [4]

    Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi

    Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y . Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation, 2025. URL https://arxiv.org/abs/ 2503.11926

  5. [5]

    CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring,

    Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky, Hannes Whit- tingham, and Mary Phuong. CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring,

  6. [6]

    URLhttps://arxiv.org/abs/2505.23575

  7. [7]

    Bowman, and Evan Hubinger

    Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, and Evan Hubinger. Alignment faking in large language models,

  8. [8]

    URLhttps://arxiv.org/abs/2412.14093

  9. [9]

    Unfaithfulness

    METR. CoT May Be Highly Informative Despite “Unfaithfulness”. https://metr.org/ blog/2025-08-08-cot-may-be-highly-informative-despite-unfaithfulness/ , August 2025

  10. [10]

    A Survey on Latent Reasoning, 2025

    Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, Tianle Cai, Taylor Kergan, Assel Kembay, Andrew Smith, Chenghua Lin, Binh Nguyen, Yuqi Pan, Yuhong Chou, Zefan Cai, Zhenhe Wu, Yongchi Zhao, Tianyu Liu, Jian Yang, Wangchunshu Zhou, Chujie Zheng, Chongxuan Li, Yuyin Zhou, Zhoujun...

  11. [11]

    Elson, Rif A

    Scott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, and Rohin Shah. When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors, 2025. URLhttps://arxiv.org/abs/2507.05246

  12. [12]

    Frontier Models are Capable of In-context Scheming, 2025

    Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. Frontier Models are Capable of In-context Scheming, 2025. URL https: //arxiv.org/abs/2412.04984

  13. [13]

    How will we update about scheming? https://blog.redwoodresearch

    Ryan Greenblatt. How will we update about scheming? https://blog.redwoodresearch. org/p/how-will-we-update-about-scheming , January 2025. Redwood Research blog. Cross-posted on LessWrong

  14. [14]

    A careful examination of large language model performance on grade school arithmetic.Advances in Neural Information Processing Systems, 37:46819–46836, 2024

    Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Charlotte Zhuang, Dylan Slack, et al. A careful examination of large language model performance on grade school arithmetic.Advances in Neural Information Processing Systems, 37:46819–46836, 2024

  15. [15]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A Graduate-Level Google-Proof Q&A Benchmark, 2023. URLhttps://arxiv.org/abs/2311.12022

  16. [16]

    Ziegler, Elizabeth Barnes, and Lawrence Chan

    Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney V on Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, and Lawrence...

  17. [17]

    Inspect AI: Framework for Large Language Model Evaluations, 2024

    UK AI Security Institute. Inspect AI: Framework for Large Language Model Evaluations, 2024. URLhttps://github.com/UKGovernmentBEIS/inspect_ai

  18. [18]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  19. [19]

    Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models, 2025

    Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, Jie Feng, Chen Gao, and Yong Li. Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models, 2025. URLhtt...

  20. [20]

    Stress Testing Deliberative Alignment for Anti-Scheming Training, 2025

    Bronson Schoen, Evgenia Nitishinskaya, Mikita Balesni, Axel Højmark, Felix Hofstätter, Jérémy Scheurer, Alexander Meinke, Jason Wolfe, Teun van der Weij, Alex Lloyd, Nicholas Goldowsky-Dill, Angela Fan, Andrei Matveiakin, Rusheb Shah, Marcus Williams, Amelia Glaese, Boaz Barak, Wojciech Zaremba, and Marius Hobbhahn. Stress Testing Deliberative Alignment f...

  21. [21]

    All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language, 2025

    Shiyuan Guo, Henry Sleight, and Fabien Roger. All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language, 2025. URL https://arxiv.org/abs/ 2510.09714

  22. [22]

    Training Large Language Models to Reason in a Continuous Latent Space, 2025

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training Large Language Models to Reason in a Continuous Latent Space, 2025. URL https://arxiv.org/abs/2412.06769

  23. [23]

    Compressed Chain of Thought: Efficient Reasoning Through Dense Representations, 2024

    Jeffrey Cheng and Benjamin Van Durme. Compressed Chain of Thought: Efficient Reasoning Through Dense Representations, 2024. URLhttps://arxiv.org/abs/2412.13171

  24. [24]

    Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein

    Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up Test- Time Compute with Latent Reasoning: A Recurrent Depth Approach, 2025. URL https: //arxiv.org/abs/2502.05171. 13

  25. [25]

    Implicit Chain of Thought Reasoning via Knowledge Distillation, 2023

    Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, and Stuart Shieber. Implicit Chain of Thought Reasoning via Knowledge Distillation, 2023. URL https://arxiv.org/abs/2311.01460

  26. [26]

    From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step, 2024

    Yuntian Deng, Yejin Choi, and Stuart Shieber. From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step, 2024. URLhttps://arxiv.org/abs/2405.14838

  27. [27]

    Reasoning Models Can Be Effective Without Thinking, 2025

    Wenjie Ma, Jingxuan He, Charlie Snell, Tyler Griggs, Sewon Min, and Matei Zaharia. Reasoning Models Can Be Effective Without Thinking, 2025. URL https://arxiv.org/abs/2504. 09858

  28. [28]

    Jacob Pfau, William Merrill, and Samuel R. Bowman. Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models, 2024. URL https://arxiv.org/abs/2404. 15758

  29. [29]

    Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models, 2025

    Thilo Hagendorff and Sarah Fabi. Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models, 2025. URL https://arxiv.org/abs/2504. 10615

  30. [30]

    Quantifying the Necessity of Chain of Thought through Opaque Serial Depth, 2026

    Jonah Brown-Cohen, David Lindner, and Rohin Shah. Quantifying the Necessity of Chain of Thought through Opaque Serial Depth, 2026. URLhttps://arxiv.org/abs/2603.09786

  31. [31]

    Do Large Language Models Latently Perform Multi-Hop Reasoning?, 2025

    Sohee Yang, Elena Gribovskaya, Nora Kassner, Mor Geva, and Sebastian Riedel. Do Large Language Models Latently Perform Multi-Hop Reasoning?, 2025. URLhttps://arxiv.org/ abs/2402.16837

  32. [32]

    Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?, 2025

    Sohee Yang, Nora Kassner, Elena Gribovskaya, Sebastian Riedel, and Mor Geva. Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?, 2025. URLhttps://arxiv.org/abs/2411.16679

  33. [33]

    Layer-Order Inversion: Rethinking Latent Multi-Hop Reasoning in Large Language Models, 2026

    Xukai Liu, Ye Liu, Jipeng Zhang, Yanghai Zhang, Kai Zhang, and Qi Liu. Layer-Order Inversion: Rethinking Latent Multi-Hop Reasoning in Large Language Models, 2026. URL https://arxiv.org/abs/2601.03542

  34. [34]

    How does time horizon vary across domains? https://metr

    Vincent Cheng Thomas Kwa. How does time horizon vary across domains? https://metr. org/blog/2025-07-14-how-does-time-horizon-vary-across-domains/, 07 2025

  35. [35]

    Offensive Cybersecurity Time Horizons

    Jack Payne, Jeremy Miller, and Sean Peters. Offensive Cybersecurity Time Horizons. Re- search note, Lyptus Research, April 2026. URLhttps://lyptusresearch.org/research/ offensive-cyber-time-horizons

  36. [36]

    A Rosetta Stone for AI Benchmarks, 2025

    Anson Ho, Jean-Stanislas Denain, David Atanasov, Samuel Albanie, and Rohin Shah. A Rosetta Stone for AI Benchmarks, 2025. URLhttps://arxiv.org/abs/2512.00193

  37. [37]

    Arc-agi- 2: A new challenge for frontier ai reasoning systems, 2026

    Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi- 2: A new challenge for frontier ai reasoning systems, 2026. URL https://arxiv.org/abs/ 2505.11831

  38. [38]

    Findings from the first shared task on automated prediction of difficulty and response time for multiple-choice questions

    Victoria Yaneva, Kai North, Peter Baldwin, Le An Ha, Saed Rezayi, Yiyun Zhou, Sag- nik Ray Choudhury, Polina Harik, and Brian Clauser. Findings from the first shared task on automated prediction of difficulty and response time for multiple-choice questions. In Ekaterina Kochmar, Marie Bexte, Jill Burstein, Andrea Horbach, Ronja Laarmann-Quante, Anaïs Tack...

  39. [39]

    Competition-Level Code Generation with AlphaCode.arXiv preprint arXiv:2203.07814, 2022

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel Mankowitz, Esme Sutherland Robson, Pushmeet...

  40. [40]

    Horowitz

    Joel L. Horowitz. Bootstrap methods in econometrics.Annual Review of Economics, 11(V olume 11, 2019):193–224, 2019. ISSN 1941-1391. doi: https://doi.org/10.1146/ annurev-economics-080218-025651. URL https://www.annualreviews.org/content/ journals/10.1146/annurev-economics-080218-025651

  41. [41]

    Rae, Oriol Vinyals, and Laurent Sifre

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

  42. [42]

    Data on AI models

    Epoch AI. Data on AI models. https://epoch.ai/data/ai-models?view=table&tab= notable, 2024. Accessed: 2026-05-07

  43. [43]

    Artificial Analysis Intelligence Index (v4.0)

    Artificial Analysis. Artificial Analysis Intelligence Index (v4.0). https:// artificialanalysis.ai/models/open-source, 2026. Accessed: 2026-04-28

  44. [44]

    Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

  45. [45]

    URLhttps://arxiv.org/abs/2412.19437

  46. [46]

    DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Ha...

  47. [47]

    On the Measure of Intelligence, 2019

    François Chollet. On the Measure of Intelligence, 2019. URL https://arxiv.org/abs/ 1911.01547

  48. [48]

    CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D, 2025

    Francis Rhys Ward, Teun van der Weij, Hanna Gábor, Sam Martin, Raja Mehta Moreno, Harel Lidar, Louis Makower, Thomas Jodrell, and Lauren Robson. CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D, 2025. URLhttps://arxiv.org/abs/2511.09904

  49. [49]

    HCAST: Human-Calibrated Autonomy Software Tasks, 2025

    David Rein, Joel Becker, Amy Deng, Seraphina Nix, Chris Canal, Daniel O’Connel, Pip Arnott, Ryan Bloom, Thomas Broadley, Katharyn Garcia, Brian Goodrich, Max Hasin, Sami Jawhar, Megan Kinniment, Thomas Kwa, Aron Lajko, Nate Rush, Lucas Jun Koba Sato, Sydney V on Arx, Ben West, Lawrence Chan, and Elizabeth Barnes. HCAST: Human-Calibrated Autonomy Software ...

  50. [50]

    Neacs, u, Harry Mayne, Ryan Othniel Kearns, Andrew M

    Jude Khouja, Lingyi Yang, Karolina Korgul, Simeon Hellsten, Vlad A. Neacs, u, Harry Mayne, Ryan Othniel Kearns, Andrew M. Bean, and Adam Mahdi. LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=CQIkN2uuBr

  51. [51]

    LinuxArena: A Control Setting for AI Agents in Live Production Software Environments, 2026

    Tyler Tracy, Ram Potham, Nick Kuhn, Myles Heller, Anshul Khandelwal, Cody Rushing, Henri Lemoine, Miguel Brandao, Tomas Turlik, Adam Hanson, Josh Hills, Amy Ngo, Ram Rachum, Nik Mitchell, Falko Galperin, Oscar Sykes, Pip Arnott, Samuel Prieto Lima, Carlos Giudice, Matt Goldwater, Daniel Popp, Drew de Wet, Ruben Castaing, Qi Guo, Douw Marx, Benjamin Shaffr...

  52. [52]

    Nl2bash: A corpus and semantic parser for natural language interface to the linux operating system

    Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, and Michael D Ernst. Nl2bash: A corpus and semantic parser for natural language interface to the linux operating system. InProceedings of the eleventh international conference on language resources and evaluation (LREC 2018), 2018

  53. [53]

    16 SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents, 2025

    Jonathan Kutasov, Yuqi Sun, Paul Colognese, Teun van der Weij, Linda Petrini, Chen Bo Calvin Zhang, John Hughes, Xiang Deng, Henry Sleight, Tyler Tracy, Buck Shlegeris, and Joe Benton. 16 SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents, 2025. URL https: //arxiv.org/abs/2506.15740

  54. [54]

    Merrill, Alexander G

    Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...

  55. [55]

    Lake, and Todd M

    Solim LeGris, Wai Keen V ong, Brenden M. Lake, and Todd M. Gureckis. H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark, 2024. URLhttps://arxiv.org/abs/2409.01374

  56. [56]

    ARC-AGI-2 Human Testing Dataset

    ARC Prize Foundation. ARC-AGI-2 Human Testing Dataset. https://huggingface.co/ datasets/arcprize/arc_agi_2_human_testing, 2025. Accessed: 2026-04-30

  57. [57]

    Lichess puzzle database, 2026

    Lichess. Lichess puzzle database, 2026. URL https://database.lichess.org/#puzzles. Accessed: 2026-03-01

  58. [58]

    Chase and Herbert A

    William G. Chase and Herbert A. Simon. Perception in chess.Cognitive Psychology, 4(1): 55–81, 1973. ISSN 0010-0285. doi: https://doi.org/10.1016/0010-0285(73)90004-2. URL https://www.sciencedirect.com/science/article/pii/0010028573900042

  59. [59]

    Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance

    Ryan Greenblatt. Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance. https://blog.redwoodresearch.org/p/ recent-llms-can-use-filler-tokens , December 2025. Redwood Research blog. Cross- posted on LessWrong

  60. [60]

    New York Times Crosswords (JSON Archive)

    Dylan O’Shea. New York Times Crosswords (JSON Archive). https://github.com/ doshea/nyt_crosswords, 2017. GitHub repository. Contains NYT crossword puzzles since 1977 in JSON format. Accessed: 2026-04-30

  61. [61]

    XW Stats: New York Times Crossword Solve Statistics

    Dodge, Matt. XW Stats: New York Times Crossword Solve Statistics. https://xwstats. com/, 2026. Provides crowd-sourced solve times and statistics for NYT crossword puzzles. Accessed: 2026-04-30

  62. [62]

    KenKen Puzzle Official Site: Free Math Puzzles That Make You Smarter

    KenKen Puzzle, LLC. KenKen Puzzle Official Site: Free Math Puzzles That Make You Smarter. https://www.kenkenpuzzle.com/, 2026. Accessed: 2026-03-01

  63. [63]

    Hale, and Hannah Rose Kirk

    Andrew Michael Bean, Simeon Hellsten, Harry Mayne, Jabez Magomere, Ethan A Chi, Ryan Andrew Chi, Scott A. Hale, and Hannah Rose Kirk. LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low Resource and Extinct Languages. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URLhttps...

  64. [64]

    Past challenge puzzles: How far can you go? https: //www.uklo.org/past-exam-papers/, 2023

    United Kingdom Linguistics Olympiad. Past challenge puzzles: How far can you go? https: //www.uklo.org/past-exam-papers/, 2023. Accessed: 2026-04-27

  65. [65]

    Recent LLMs can do 2-hop and 3-hop latent (no CoT) reasoning on natural facts

    Ryan Greenblatt. Recent LLMs can do 2-hop and 3-hop latent (no CoT) reasoning on natural facts. https://blog.redwoodresearch.org/p/ recent-llms-can-do-2-hop-and-3-hop , January 2026. Redwood Research blog post. Accessed: 2026-04-30. 17

  66. [66]

    Logic Puzzles by Puzzle Baron

    Puzzle Baron. Logic Puzzles by Puzzle Baron. https://logic.puzzlebaron.com/, 2026. Accessed: 2026-03-01

  67. [67]

    the- ory of mind

    Simon Baron-Cohen, Alan M. Leslie, and Uta Frith. Does the autistic child have a “the- ory of mind” ?Cognition, 21(1):37–46, 1985. ISSN 0010-0277. doi: https://doi. org/10.1016/0010-0277(85)90022-8. URL https://www.sciencedirect.com/science/ article/pii/0010027785900228

  68. [68]

    Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems

    Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 158–167, Vancouver, Canada, 2017. Association for Computational Linguistics

  69. [69]

    Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs, 2025

    Yohan Mathew, Ollie Matthews, Robert McCarthy, Joan Velja, Christian Schroeder de Witt, Dy- lan Cope, and Nandi Schoots. Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs, 2025. URLhttps://arxiv.org/abs/2410.03768

  70. [70]

    A Dataset of Sudoku Puzzles With Difficulty Metrics Experienced by Human Players.IEEE Access, 12:104254–104262, 2024

    Sheng-Wei Wang. A Dataset of Sudoku Puzzles With Difficulty Metrics Experienced by Human Players.IEEE Access, 12:104254–104262, 2024. doi: 10.1109/ACCESS.2024.3434632

  71. [71]

    ReAct: Synergizing Reasoning and Acting in Language Models, 2023

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing Reasoning and Acting in Language Models, 2023. URL https: //arxiv.org/abs/2210.03629

  72. [72]

    Specific impairments of planning.Philosophical Transactions of the Royal Society of London

    Timothy Shallice. Specific impairments of planning.Philosophical Transactions of the Royal Society of London. B, Biological Sciences, 298(1089):199–209, 1982. doi: 10.1098/rstb.1982. 0082

  73. [73]

    Planning and problem solving using the five-disc Tower of London task.The Quarterly Journal of Experimental Psychology Section A, 50(1):49–78, 1997

    Geoff Ward and Alan Allport. Planning and problem solving using the five-disc Tower of London task.The Quarterly Journal of Experimental Psychology Section A, 50(1):49–78, 1997. doi: 10.1080/027249897392224

  74. [74]

    Owen, John J

    Adrian M. Owen, John J. Downes, Barbara J. Sahakian, Charles E. Polkey, and Trevor W. Robbins. Planning and spatial working memory following frontal lobe lesions in man.Neu- ropsychologia, 28(10):1021–1034, 1990. doi: 10.1016/0028-3932(90)90137-D

  75. [75]

    Kaller, Josef M

    Christoph P. Kaller, Josef M. Unterrainer, Benjamin Rahm, and Ulrike Halsband. The impact of problem structure on planning: Insights from the Tower of London task.Cognitive Brain Research, 20(3):462–472, 2004. doi: 10.1016/j.cogbrainres.2004.04.002

  76. [76]

    Newman, Patricia A

    Sharlene D. Newman, Patricia A. Carpenter, Sashank Varma, and Marcel Adam Just. Frontal and parietal participation in problem solving in the Tower of London: fMRI and computational modeling of planning and high-level perception.Neuropsychologia, 41(12):1668–1682, 2003. doi: 10.1016/S0028-3932(03)00091-5

  77. [77]

    short-answer

    Olli Järviniemi and Evan Hubinger. Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant, 2024. URLhttps://arxiv.org/abs/2405.01576. 18 Appendix Appendix contents A Additional results 21 A.1 Time and Token Horizon Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 A.2 Comparison to Kwa et al.’s THs . . . . . . ...

  78. [78]

    35%), which exhibit slower TH scaling as shown above

    A larger fraction of hybrid-reasoning models are MoEs (60% vs. 35%), which exhibit slower TH scaling as shown above

  79. [79]

    A leave-one-out fit with Gemma 4 31B removed reduces the doubling factor to6.6×while increasing theR 2 value from 0.68 to 0.93

    The outlier performance of Gemma 4 31B has substantial influence on the trend of hybrid- reasoning models. A leave-one-out fit with Gemma 4 31B removed reduces the doubling factor to6.6×while increasing theR 2 value from 0.68 to 0.93

  80. [80]

    minutes": <number>}. The user message containsNin-context examples in the format --- Example i --- Task: <problem> Human solve time: {

    Reasoning training may reduce models’ propensity to answer questions with a single token, meaning that our tasks are further out-of-distribution for hybrid-reasoning models. However, we observe that despite a >50× gap in RL compute between DeepSeek V3 and DeepSeek Figure 36:No-CoT performance in open-weight models as a function of layer count, broken down...

Showing first 80 references.