pith. sign in

arxiv: 2604.14140 · v1 · submitted 2026-04-15 · 💻 cs.LG · cs.AI

LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

Pith reviewed 2026-05-10 12:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords long-horizon reasoningchain-of-thoughtbenchmarklanguage modelsplanningreasoning evaluationautonomous agents
0
0 comments X

The pith

Frontier language models achieve under 10 percent accuracy on problems that demand sustained reasoning across tens to hundreds of thousands of tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LongCoT as a benchmark of 2500 problems in chemistry, mathematics, computer science, chess, and logic. Each problem has a short input and a verifiable final answer, yet its solution requires traversing a graph of interdependent steps whose total length reaches very large token counts. Because every local step remains individually solvable by current models, low overall scores isolate the difficulty of maintaining accuracy and coherence over extended horizons. This measurement matters for autonomous tasks that depend on reliable multi-step planning without rapid drift or error accumulation. The reported results show top models solving fewer than one in ten problems, establishing a concrete performance baseline for this capability.

Core claim

LongCoT is a benchmark of 2500 expert-designed problems whose solutions form graphs of interdependent steps spanning tens to hundreds of thousands of reasoning tokens; each local step is tractable for frontier models, so the observed accuracies below 10 percent demonstrate that current systems lack reliable long-horizon chain-of-thought reasoning.

What carries the argument

The LongCoT benchmark, a collection of problems with short inputs, verifiable answers, and solution paths that form graphs of interdependent steps whose cumulative length reaches very large token counts.

Load-bearing premise

That each local step is individually tractable for frontier models, allowing failures to be attributed specifically to long-horizon reasoning limitations rather than knowledge gaps or other factors.

What would settle it

A model that scores above 50 percent on the full LongCoT problems while also scoring near 100 percent when tested only on the isolated local steps used inside those problems would show that the benchmark does not isolate long-horizon limitations.

Figures

Figures reproduced from arXiv: 2604.14140 by Acer Blake, Akshat Naik, Alesia Ivanova, Ameya Prabhu, Bhavya Kailkhura, Brian Bartoldson, Charles London, Christian Schroeder de Witt, Daniel Nichols, Fabio Pizzati, Hasan Hammoud, Ivan Laptev, Natasha Jaques, Peggy Li, Philip Torr, Ruben Glatt, Sumeet Ramesh Motwani, Tal Ben-Nun, Tavish McDonald, Vignesh Baskaran.

Figure 1
Figure 1. Figure 1: Accuracy versus token usage on LongCoT. GPT 5.2 achieves 9.83% with an average of 62K output tokens per problem. capability only indirectly, either through hard but short rea￾soning problems or through agentic workflows where tool use and scaffolds leverage the underlying models’ reasoning abilities. As context limits grow and test-time scaling is adopted widely, an important frontier emerges: the funda￾me… view at source ↗
Figure 2
Figure 2. Figure 2: LongCoT problems demand long-horizon reasoning. Each of our five domains requires constructing and traversing a computational dependency graph in a long chain-of-thought. These graphs can be DAGs, search trees, cyclic graphs, constraint graphs, or execution traces. Frontier models struggle with LongCoT problems: even the best model (GPT 5.2) achieves only 9.83% accuracy. plex multi-step workflows, but doma… view at source ↗
Figure 3
Figure 3. Figure 3: LongCoT problem domains and reasoning structures. (Top) Distribution of subtopics across five domains. (Bottom) Example dependency graphs. Explicit templates present the graph directly in the prompt; Implicit templates require models to discover and navigate latent structure (game trees, constraint satisfaction, simulation). Actual problem graphs are significantly larger and more diverse than these schemat… view at source ↗
Figure 4
Figure 4. Figure 4: Main results on LongCoT-mini (left) and LongCoT (right). LongCoT is extremely challenging, with the best model (GPT 5.2) achieving only 9.83% and open-source models near zero. LongCoT-mini differentiates performance across a wider range of models. (750 medium, 1250 hard) and are designed to be challeng￾ing, but still fit within the output budget of frontier models.3 Final answers are drawn from large combi… view at source ↗
Figure 5
Figure 5. Figure 5: LongCoT domain-specific results are mostly stable across all five domains for a given model. These findings comport with the design goals of LongCoT: rather than deep domain knowledge, LongCoT success demands the ability to reason over long horizons. 4.2. Experimental Analysis We analyze token usage on LongCoT ( [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: RLM evals. GPT-5.2 RLM tested in reasoning-only (solid) and code simulation (dashed) settings. Without sub-agents calling tools, RLM fails to outperform GPT 5.2 alone. With code simulation, gains appear primarily on implicit domains where substantial parts of the dependency structure can be offloaded to programmatic search routines. See Appendix C for analysis. accuracy falls well below the independent err… view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy falls as problem DAG sizes grow, inducing planning and execution difficulties before context windows satu￾rate. Our problems increasingly differentiate model capabilities as horizon lengths grow. Under an independent error assumption (GPT 5.2 on Omni-Math), accuracy would be much higher than what we observe, highlighting issues with long-horizon reasoning. Here, we use a controlled setup where we … view at source ↗
Figure 8
Figure 8. Figure 8: Reasoning trace analysis. The distribution of reasoning spent across behaviors varies substantially by domain and model. Each cell represents 1% of the reasoning trace, read left-to-right, top-to-bottom (see Appendix C for more analysis). In general, we observe the following fundamental issues in long-horizon reasoning capabilities. Poor early planning commits models to inefficient strategies, and errors c… view at source ↗
Figure 9
Figure 9. Figure 9: LongCoT-mini domain-specific results provide more signal on model performance. These findings comport with the design goals of LongCoT: rather than deep domain knowledge, LongCoT success demands the ability to reason over long horizons. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: GPT-5.2 accuracy and token usage on independent versus composed LongCoT-Math questions. Independent questions are presented in a single prompt with no inter-node dependencies and scored on the same leaf nodes. Accuracy drops sharply when questions are composed while token usage remains comparable, confirming that compositional dependency, not just output length, drives difficulty. If difficulty were a fun… view at source ↗
Figure 11
Figure 11. Figure 11: Reasoning trace structure for DeepSeek V3.2 on LongCoT-mini Chemistry Problems. Each trace is segmented into a 10×10 grid (read left-to-right, top-to-bottom), classified into Setup, Planning, Solving, Verification, Stuck, or Backtracking. Correct traces (top) allocate more budget to setup, while incorrect traces (bottom) show visibly more backtracking (orange) and stuck (red) segments. D. Limitations and … view at source ↗
read the original abstract

As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities. Overall, LongCoT provides a rigorous measure of long-horizon reasoning, tracking the ability of frontier models to reason reliably over extended periods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces LongCoT, a benchmark of 2,500 expert-designed problems across chemistry, mathematics, computer science, chess, and logic. Each problem has a short input and verifiable answer but requires navigating a graph of interdependent steps spanning tens to hundreds of thousands of reasoning tokens. The authors assert that local steps are individually tractable for frontier models, so end-to-end failures isolate long-horizon CoT limitations. They report that the best models achieve under 10% accuracy (GPT 5.2 at 9.8%, Gemini 3 Pro at 6.1%), positioning LongCoT as a rigorous measure of extended reasoning capabilities.

Significance. If the benchmark successfully isolates long-horizon reasoning from per-step tractability issues, it would provide a valuable, scalable tool for tracking progress on complex autonomous tasks where extended CoT is required. The reported performance gap could usefully direct research toward better planning and state management over long horizons.

major comments (2)
  1. [Abstract] Abstract: The claim that 'each local step is individually tractable for frontier models' is presented without any reported experiments, such as accuracy on single-step subproblems extracted from the 2,500 problems or ablations that supply intermediate states. This assumption is load-bearing for the central claim that the observed <10% end-to-end accuracy specifically measures long-horizon limitations rather than cumulative per-step error rates or knowledge gaps.
  2. [Abstract] Abstract: No details are provided on problem verification procedures, construction of inter-step dependency graphs, or controls for other failure modes (e.g., knowledge gaps or local reasoning errors). Without these, the soundness of attributing all failures to horizon length cannot be assessed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments. We appreciate the opportunity to clarify and strengthen our presentation of the LongCoT benchmark. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'each local step is individually tractable for frontier models' is presented without any reported experiments, such as accuracy on single-step subproblems extracted from the 2,500 problems or ablations that supply intermediate states. This assumption is load-bearing for the central claim that the observed <10% end-to-end accuracy specifically measures long-horizon limitations rather than cumulative per-step error rates or knowledge gaps.

    Authors: We agree that explicit validation of local step tractability would strengthen the central claim. While the problems were designed by experts with the intent that individual steps are solvable by frontier models (as detailed in the problem construction section of the full manuscript), we did not include quantitative ablations on subproblems. In the revised manuscript, we will add experiments measuring model accuracy on randomly extracted single-step subproblems from the benchmark, as well as an ablation where models are provided with intermediate states to isolate horizon effects. revision: yes

  2. Referee: [Abstract] Abstract: No details are provided on problem verification procedures, construction of inter-step dependency graphs, or controls for other failure modes (e.g., knowledge gaps or local reasoning errors). Without these, the soundness of attributing all failures to horizon length cannot be assessed.

    Authors: We acknowledge the need for more transparency on these aspects. The full manuscript includes a dedicated section on benchmark construction, where we describe expert design, verification by domain specialists, and how dependency graphs are built to ensure long-horizon requirements. However, to address the concern directly, we will expand this section with additional details on verification procedures, explicit controls for knowledge gaps (e.g., by ensuring all required knowledge is standard for the domains), and analysis of local error rates. This will allow readers to better assess the attribution to horizon length. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark paper with direct measurements and no derivation chain

full rationale

The paper introduces LongCoT as an empirical benchmark consisting of 2,500 problems and reports direct accuracy measurements on frontier models (e.g., GPT 5.2 at 9.8%). No mathematical derivations, equations, fitted parameters, or predictions are present that could reduce to inputs by construction. The statement that 'each local step is individually tractable' is presented as a design premise rather than a derived result, with no self-citations, ansatzes, or uniqueness theorems invoked in any load-bearing way. The work is self-contained as a measurement instrument without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on expert problem design and the assumption that local steps are tractable; no free parameters or new entities are introduced.

axioms (2)
  • domain assumption Each local step is individually tractable for frontier models
    Invoked in the abstract to attribute failures to long-horizon issues.
  • domain assumption Problems require navigating a graph of interdependent steps spanning tens to hundreds of thousands of reasoning tokens
    Core definition of the benchmark problems.

pith-pipeline@v0.9.0 · 5567 in / 1234 out tokens · 42134 ms · 2026-05-10T12:51:59.303093+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages

  1. [1]

    You can move U (up), D (down), L (left), or R (right)

  2. [2]

    You can push boxes by moving into them (but only if there’s space behind the box)

  3. [3]

    You cannot push boxes into walls (#) or other boxes

  4. [4]

    You cannot pull boxes

  5. [5]

    All boxes must be on goal positions to solve the puzzle. Symbols: 23 LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning •#= Wall (blocks movement) •@= Player •$= Box •.= Goal position • * = Box on a goal •+= Player on a goal • (space) = Empty floor You will be provided with a problem instance, given in the form: • Grid: a list of lists where ea...

  6. [6]

    Each row must contain all digits1–nexactly once

  7. [7]

    Each column must contain all digits1–nexactly once

  8. [8]

    Each of thenboxes (non-overlapping subgrids) must contain all digits1–nexactly once

  9. [9]

    Only use digits1–n(empty cells are represented as0in the input). You will be provided with a problem instance, given in the form: • Grid size: side×side (where side=block size×block size) • Block size: block size×block size • Puzzle grid:[row 0,row 1, . . . ,rown−1] Puzzle instance: [PUZZLE INSTANCE] Find the complete solution to this Sudoku puzzle. Forma...