ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning

· 2026 · cs.AI · arXiv 2605.05737

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Current reasoning paradigms for LLMs include chain-of-thought, ReAct, and post-hoc self-critique. These paradigms rely on two assumptions that fail on long-horizon, multi-stage tasks. As a result, errors accumulate silently across reasoning steps, leaving an open question: can a reasoning system effectively detect and recover from its own failures? We present ReFlect, a \emph{harness} system for LLM reasoning that creates standalone error detection and recovery logic as a deterministic wrapper around the model. Controlled experiments across 6 reasoning domains show that prompt-level self-critique produces formulaic templates that flag no issues in 90 of 100 audited reflection blocks, and the investigated LLMs wrongly accept a wrong answer in at least 76\% of cases. Our ReFlect harness achieves task success rates ranging from 41\% on gpt-4o-mini to 56\% on Claude Sonnet 4.5 across six models spanning small and frontier scale, with per-model gains over Direct CoT ranging from +7 pp on Qwen2.5-72B to +29 pp on Claude Sonnet 4.5, and additionally raises SWE-bench patch-structural quality from 0\% (Direct CoT) to between 82\% (Qwen2.5-72B) and 87\% (GPT-4o). Notably, the harness gain is inversely proportional to the model's Direct CoT task success rate (the fitted slope is -1.69 with r=-0.76): each pp lost in baseline success rate is mechanically recovered by 1.69 pp of harness gain. We spot that adding structured reasoning state and operators yields only 15.0--18.7\% pair-mean on Llama-3.3-70B and Qwen2.5-72B because models at this scale cannot reliably populate the state its operators require. ReFlect is model-agnostic, training-free, and operates entirely at inference time.

representative citing papers

DAR: Deontic Reasoning with Agentic Harnesses

cs.CL · 2026-06-03 · unverdicted · novelty 4.0

DAR lets LLMs interact dynamically with statutes for deontic reasoning, improving results on hard DeonticBench subsets but with uneven gains and higher token use for weaker models.

citing papers explorer

Showing 1 of 1 citing paper.

DAR: Deontic Reasoning with Agentic Harnesses cs.CL · 2026-06-03 · unverdicted · none · ref 17 · internal anchor
DAR lets LLMs interact dynamically with statutes for deontic reasoning, improving results on hard DeonticBench subsets but with uneven gains and higher token use for weaker models.

ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning

fields

years

verdicts

representative citing papers

citing papers explorer