LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

· 2026 · cs.LG · arXiv 2604.14140

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities. Overall, LongCoT provides a rigorous measure of long-horizon reasoning, tracking the ability of frontier models to reason reliably over extended periods.

representative citing papers

SentinelBench: A Benchmark for Long-Running Monitoring Agents

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

SentinelBench is a new benchmark for time-evolving monitoring tasks in web environments, measuring task completion, reaction time, and resource use with baselines from three models and two harnesses.

citing papers explorer

Showing 1 of 1 citing paper.

SentinelBench: A Benchmark for Long-Running Monitoring Agents cs.AI · 2026-06-03 · unverdicted · none · ref 7 · internal anchor
SentinelBench is a new benchmark for time-evolving monitoring tasks in web environments, measuring task completion, reaction time, and resource use with baselines from three models and two harnesses.

LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

fields

years

verdicts

representative citing papers

citing papers explorer