LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning
Pith reviewed 2026-05-10 12:51 UTC · model grok-4.3
The pith
Frontier language models achieve under 10 percent accuracy on problems that demand sustained reasoning across tens to hundreds of thousands of tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LongCoT is a benchmark of 2500 expert-designed problems whose solutions form graphs of interdependent steps spanning tens to hundreds of thousands of reasoning tokens; each local step is tractable for frontier models, so the observed accuracies below 10 percent demonstrate that current systems lack reliable long-horizon chain-of-thought reasoning.
What carries the argument
The LongCoT benchmark, a collection of problems with short inputs, verifiable answers, and solution paths that form graphs of interdependent steps whose cumulative length reaches very large token counts.
Load-bearing premise
That each local step is individually tractable for frontier models, allowing failures to be attributed specifically to long-horizon reasoning limitations rather than knowledge gaps or other factors.
What would settle it
A model that scores above 50 percent on the full LongCoT problems while also scoring near 100 percent when tested only on the isolated local steps used inside those problems would show that the benchmark does not isolate long-horizon limitations.
Figures
read the original abstract
As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities. Overall, LongCoT provides a rigorous measure of long-horizon reasoning, tracking the ability of frontier models to reason reliably over extended periods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LongCoT, a benchmark of 2,500 expert-designed problems across chemistry, mathematics, computer science, chess, and logic. Each problem has a short input and verifiable answer but requires navigating a graph of interdependent steps spanning tens to hundreds of thousands of reasoning tokens. The authors assert that local steps are individually tractable for frontier models, so end-to-end failures isolate long-horizon CoT limitations. They report that the best models achieve under 10% accuracy (GPT 5.2 at 9.8%, Gemini 3 Pro at 6.1%), positioning LongCoT as a rigorous measure of extended reasoning capabilities.
Significance. If the benchmark successfully isolates long-horizon reasoning from per-step tractability issues, it would provide a valuable, scalable tool for tracking progress on complex autonomous tasks where extended CoT is required. The reported performance gap could usefully direct research toward better planning and state management over long horizons.
major comments (2)
- [Abstract] Abstract: The claim that 'each local step is individually tractable for frontier models' is presented without any reported experiments, such as accuracy on single-step subproblems extracted from the 2,500 problems or ablations that supply intermediate states. This assumption is load-bearing for the central claim that the observed <10% end-to-end accuracy specifically measures long-horizon limitations rather than cumulative per-step error rates or knowledge gaps.
- [Abstract] Abstract: No details are provided on problem verification procedures, construction of inter-step dependency graphs, or controls for other failure modes (e.g., knowledge gaps or local reasoning errors). Without these, the soundness of attributing all failures to horizon length cannot be assessed.
Simulated Author's Rebuttal
We thank the referee for their insightful comments. We appreciate the opportunity to clarify and strengthen our presentation of the LongCoT benchmark. Below we respond point-by-point to the major comments.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that 'each local step is individually tractable for frontier models' is presented without any reported experiments, such as accuracy on single-step subproblems extracted from the 2,500 problems or ablations that supply intermediate states. This assumption is load-bearing for the central claim that the observed <10% end-to-end accuracy specifically measures long-horizon limitations rather than cumulative per-step error rates or knowledge gaps.
Authors: We agree that explicit validation of local step tractability would strengthen the central claim. While the problems were designed by experts with the intent that individual steps are solvable by frontier models (as detailed in the problem construction section of the full manuscript), we did not include quantitative ablations on subproblems. In the revised manuscript, we will add experiments measuring model accuracy on randomly extracted single-step subproblems from the benchmark, as well as an ablation where models are provided with intermediate states to isolate horizon effects. revision: yes
-
Referee: [Abstract] Abstract: No details are provided on problem verification procedures, construction of inter-step dependency graphs, or controls for other failure modes (e.g., knowledge gaps or local reasoning errors). Without these, the soundness of attributing all failures to horizon length cannot be assessed.
Authors: We acknowledge the need for more transparency on these aspects. The full manuscript includes a dedicated section on benchmark construction, where we describe expert design, verification by domain specialists, and how dependency graphs are built to ensure long-horizon requirements. However, to address the concern directly, we will expand this section with additional details on verification procedures, explicit controls for knowledge gaps (e.g., by ensuring all required knowledge is standard for the domains), and analysis of local error rates. This will allow readers to better assess the attribution to horizon length. revision: yes
Circularity Check
No circularity: empirical benchmark paper with direct measurements and no derivation chain
full rationale
The paper introduces LongCoT as an empirical benchmark consisting of 2,500 problems and reports direct accuracy measurements on frontier models (e.g., GPT 5.2 at 9.8%). No mathematical derivations, equations, fitted parameters, or predictions are present that could reduce to inputs by construction. The statement that 'each local step is individually tractable' is presented as a design premise rather than a derived result, with no self-citations, ansatzes, or uniqueness theorems invoked in any load-bearing way. The work is self-contained as a measurement instrument without circular reduction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Each local step is individually tractable for frontier models
- domain assumption Problems require navigating a graph of interdependent steps spanning tens to hundreds of thousands of reasoning tokens
Reference graph
Works this paper leans on
-
[1]
You can move U (up), D (down), L (left), or R (right)
-
[2]
You can push boxes by moving into them (but only if there’s space behind the box)
-
[3]
You cannot push boxes into walls (#) or other boxes
-
[4]
You cannot pull boxes
-
[5]
All boxes must be on goal positions to solve the puzzle. Symbols: 23 LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning •#= Wall (blocks movement) •@= Player •$= Box •.= Goal position • * = Box on a goal •+= Player on a goal • (space) = Empty floor You will be provided with a problem instance, given in the form: • Grid: a list of lists where ea...
-
[6]
Each row must contain all digits1–nexactly once
-
[7]
Each column must contain all digits1–nexactly once
-
[8]
Each of thenboxes (non-overlapping subgrids) must contain all digits1–nexactly once
-
[9]
Only use digits1–n(empty cells are represented as0in the input). You will be provided with a problem instance, given in the form: • Grid size: side×side (where side=block size×block size) • Block size: block size×block size • Puzzle grid:[row 0,row 1, . . . ,rown−1] Puzzle instance: [PUZZLE INSTANCE] Find the complete solution to this Sudoku puzzle. Forma...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.