pith. sign in

arxiv: 2605.20075 · v1 · pith:HHK6FSKDnew · submitted 2026-05-19 · 💻 cs.CL · cs.AI

CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

Pith reviewed 2026-05-20 05:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords chain-of-thoughtcontrastive verifieron-policy thinkingcontinuous embeddingsreverse KL estimatorLLM reasoningagentic taskstoken efficiency
0
0 comments X

The pith

CopT reverses chain-of-thought by drafting an answer first then using on-policy reflection and continuous-embedding contrastive verifiers to improve accuracy and cut tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

CopT reformulates reasoning so models produce a draft answer before performing further thinking conditioned on that draft for correction and reflection. This order avoids performative reasoning where models think even after they could answer correctly. The method treats continuous embeddings as inference-time contrastive verifiers that compare token support under discrete versus continuous inputs to estimate answer reliability via a reverse KL quantity. Under stated assumptions this estimate equals mutual information between unresolved latent state and answer token. Experiments across mathematics, coding, and agentic tasks report up to 23 percent higher peak accuracy and up to 57 percent lower token use at equal or better accuracy without any training.

Core claim

CopT elicits a draft answer first and then invokes on-policy thinking conditioned on the draft for reflection and correction. It recasts continuous embeddings as contrastive verifiers that contrast model support for the same tokens under discrete-token and continuous-embedding inputs, producing a sequence-level reverse KL estimator. Analysis shows that under certain assumptions the expected estimate equals the mutual information between unresolved latent state and emitted answer token, so the verifier captures answer-relevant uncertainty. A second KL estimator controls draft visibility during further thinking. This pipeline raises peak accuracy by up to 23 percent and reduces tokens by up to

What carries the argument

The contrastive verifier that produces a sequence-level reverse KL estimator by contrasting the model's support for generated tokens under discrete-token inputs versus continuous-embedding inputs.

If this is right

  • Peak accuracy rises by up to 23 percent on mathematics, coding, and agentic reasoning tasks.
  • Token usage falls by up to 57 percent while maintaining or exceeding baseline accuracy.
  • No additional training is required for the gains.
  • A second KL estimator dynamically limits visibility of unreliable draft content during reflection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The draft-first order may reduce wasteful computation in any sequential generation setting where partial answers become available early.
  • Continuous-space contrastive checks could be tested on tasks that already use embedding-based retrieval or planning.

Load-bearing premise

The expected reverse KL estimate equals the mutual information between the unresolved latent state and the emitted answer token under the paper's stated assumptions.

What would settle it

Measure whether the contrastive verifier's score correlates more strongly with final answer correctness than with arbitrary latent factors unrelated to the answer; if the correlation is higher for correctness then the claim holds, otherwise it does not.

Figures

Figures reproduced from arXiv: 2605.20075 by Dachuan Shi, Hanlin Zhu, Kejing Xia, Wanjia Zhao, Wenke Lee, Wen Xiao, Xiangchi Yuan.

Figure 1
Figure 1. Figure 1: (a) Conceptual comparison between COT thinking and COPT on-policy thinking. (b) COPT contrasts the output distributions under discrete and continuous inputs. (c) COPT improves peak accuracy, marked by ∗ , across mathematics, coding, and agentic reasoning tasks and nearly halves token usage at matched accuracy. Abstract Chain-of-thought (COT) is a standard approach for eliciting reasoning capabilities from … view at source ↗
Figure 2
Figure 2. Figure 2: COPT starts with a draft answer and performs on-policy thinking conditioned on it. It contrasts the model’s support for the same chosen tokens under discrete and continuous inputs to estimate draft answer reliability, and during thinking, chunk by chunk, to determine the visibility of the draft answer across time steps. an early-stage answer at low cost, estimate its reliability with a normalized sequence-… view at source ↗
Figure 3
Figure 3. Figure 3: Left and center: Controllable reasoning effort by [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Left: The estimator κa identifies erroneous drafts more precisely than uniform selection. Right: The threshold τr trades off correction rate and token usage by controlling draft visibility. 5.6 Ablation Studies on Design Choices Draft answer reliability estimation. We first examine whether the draft-answer reliability estimator κa can effectively identify unreliable draft answers on GSM8K [PITH_FULL_IMAGE… view at source ↗
read the original abstract

Chain-of-thought (CoT) is a standard approach for eliciting reasoning capabilities from large language models (LLMs). However, the common CoT paradigm treats thinking as a prerequisite for answering, which can delay access to plausible answers and incur unnecessary token costs even when the model is able to identify an answer before extended thinking, a behavior known as performative reasoning. In this paper, we introduce CopT, a reformulated reasoning pipeline that reverses the usual order of thinking and answering. Instead of thinking before answering, CopT first elicits a draft answer and then invokes subsequent on-policy thinking conditioned on its own draft answer for reflection and correction. To assess whether the draft answer should be trusted, CopT recasts continuous embeddings as inference-time contrastive verifiers. Specifically, it contrasts the model's support for the same generated tokens under discrete-token inputs and continuous-embedding inputs, yielding a sequence-level reverse KL estimator for answer reliability. Our analysis shows that under certain assumptions, the expected estimate equals the mutual information between the unresolved latent state and the emitted answer token, explaining why it captures answer-relevant uncertainty rather than arbitrary uncertainty in the latent state. When the answer is deemed insufficiently reliable, CopT performs further on-policy thinking, where a second KL estimator dynamically controls draft-answer visibility, preserving useful partial information while reducing the risk of being misled by unreliable content. Across mathematics, coding, and agentic reasoning tasks, CopT improves peak accuracy by up to 23% and reduces token usage by up to 57% at comparable or higher accuracy, without any additional training. The code is available at https://github.com/sdc17/CopT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CopT, a reformulated LLM reasoning pipeline that reverses standard chain-of-thought by first eliciting a draft answer and then performing subsequent on-policy thinking conditioned on that draft. It recasts continuous embeddings as inference-time contrastive verifiers that contrast discrete-token and continuous-embedding support to yield a sequence-level reverse KL estimator for answer reliability. The abstract states that under certain assumptions this estimator equals the mutual information between the unresolved latent state and the answer token. A second KL estimator controls draft visibility during further thinking. Reported results across mathematics, coding, and agentic tasks show peak accuracy gains up to 23% and token reductions up to 57% at comparable or higher accuracy, with no additional training and code released.

Significance. If the theoretical equivalence holds under the operating regimes and the empirical gains prove robust to controls, the work could meaningfully advance training-free methods for efficient reasoning by reducing performative thinking and token overhead. The public code release is a clear strength that supports reproducibility.

major comments (2)
  1. [Abstract (contrastive verifier analysis paragraph)] Abstract, paragraph describing the contrastive verifier analysis: the claim that 'under certain assumptions, the expected estimate equals the mutual information between the unresolved latent state and the emitted answer token' is load-bearing for interpreting the verifier as capturing answer-relevant uncertainty rather than arbitrary latent uncertainty. The assumptions are not stated explicitly and no derivation or proof sketch is supplied, leaving open whether the discrete-to-continuous embedding contrast satisfies them on the evaluated LLMs and tasks.
  2. [§4 (Experiments)] Experimental results (abstract and §4): the claims of up to 23% accuracy improvement and 57% token reduction are presented without error bars, statistical tests, or explicit controls for prompting effects and baseline differences. This makes it difficult to attribute gains specifically to the reverse KL verifier mechanism rather than incidental factors.
minor comments (2)
  1. [§3] The notation and definitions for the two KL estimators would benefit from explicit equations in the main text rather than prose description only.
  2. [Abstract] Minor typos and inconsistent capitalization appear in the abstract (e.g., 'on-policy thinking' vs. 'On-Policy Thinking').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract (contrastive verifier analysis paragraph)] Abstract, paragraph describing the contrastive verifier analysis: the claim that 'under certain assumptions, the expected estimate equals the mutual information between the unresolved latent state and the emitted answer token' is load-bearing for interpreting the verifier as capturing answer-relevant uncertainty rather than arbitrary latent uncertainty. The assumptions are not stated explicitly and no derivation or proof sketch is supplied, leaving open whether the discrete-to-continuous embedding contrast satisfies them on the evaluated LLMs and tasks.

    Authors: We agree that the assumptions and derivation should be stated more explicitly. The key assumptions are that the continuous embedding serves as a sufficient statistic for the unresolved latent state and that the contrast is performed over identical token sequences. In the revision we will add an explicit statement of these assumptions to the abstract and include a concise proof sketch in Section 3 (or an appendix) showing how the expected reverse KL equals the mutual information under those conditions. This will also clarify applicability to the evaluated models and tasks. revision: yes

  2. Referee: [§4 (Experiments)] Experimental results (abstract and §4): the claims of up to 23% accuracy improvement and 57% token reduction are presented without error bars, statistical tests, or explicit controls for prompting effects and baseline differences. This makes it difficult to attribute gains specifically to the reverse KL verifier mechanism rather than incidental factors.

    Authors: We acknowledge that the current results would be more convincing with statistical rigor and controls. In the revised manuscript we will report means and standard deviations across multiple random seeds, include statistical significance tests for the accuracy gains, and add explicit controls that vary only the prompting structure while holding other factors fixed. These additions will help isolate the contribution of the reverse KL verifier. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper introduces a reversed reasoning pipeline (draft answer first, then on-policy thinking) and contrastive verifiers based on sequence-level reverse KL between discrete-token and continuous-embedding inputs. The key claim is that an analysis shows the expected reverse KL equals mutual information between unresolved latent state and answer token under certain assumptions, which is offered as justification for why the verifier targets answer-relevant uncertainty. This is presented as a derived explanatory result rather than a definitional identity or a parameter fitted to the target quantity. No equations are quoted that reduce the estimator to its inputs by construction, no self-citations are invoked as load-bearing uniqueness theorems, and no ansatzes or renamings of known results are evident. The reported accuracy and token-efficiency gains are empirical outcomes of the new pipeline without training, making the overall derivation self-contained against external benchmarks such as standard CoT.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on an unspecified set of assumptions that equate the reverse KL estimator to mutual information between latent state and answer token; no free parameters or new physical entities are mentioned.

axioms (1)
  • domain assumption Under certain assumptions, the expected estimate equals the mutual information between the unresolved latent state and the emitted answer token
    This equivalence is invoked to argue that the contrastive verifier captures answer-relevant uncertainty.
invented entities (1)
  • Continuous-embedding contrastive verifier no independent evidence
    purpose: To produce a sequence-level reverse KL estimator for draft-answer reliability by contrasting discrete-token and continuous-embedding inputs
    Introduced as the core inference-time mechanism for deciding whether further on-policy thinking is required.

pith-pipeline@v0.9.0 · 5852 in / 1411 out tokens · 45560 ms · 2026-05-20T05:21:00.985241+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 25 internal anchors

  1. [1]

    Phi-4-reasoning Technical Report

    Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, et al. Phi-4-reasoning technical report.arXiv preprint arXiv:2504.21318,

  2. [2]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743,

  3. [3]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

  4. [4]

    Program Synthesis with Large Language Models

    Anthropic. System card: Claude opus 4 & claude sonnet 4, 2025a. URL https://www.anthropic. com/claude-4-system-card. Anthropic. Claude opus 4.5 system card. https://assets.anthropic.com/m/ 64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf, November 2025b. System card. Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan...

  5. [5]

    Reasoning theater: Disentangling model beliefs from chain-of-thought

    Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow, Atticus Geiger, Owen Lewis, and Jack Merullo. Reasoning theater: Disentangling model beliefs from chain-of-thought. arXiv preprint arXiv:2603.05488,

  6. [6]

    Qwen3-Coder-Next Technical Report

    Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al. Qwen3-coder-next technical report.arXiv preprint arXiv:2603.00729,

  7. [7]

    Reasoning beyond language: A comprehensive survey on latent chain-of-thought reasoning.arXiv preprint arXiv:2505.16782, 2025a

    Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Hanlin Wang, Yanjun Chen, Wei Zhang, Jian Wang, Wenjie Li, and Xiaoyu Shen. Reasoning beyond language: A comprehensive survey on latent chain-of-thought reasoning.arXiv preprint arXiv:2505.16782, 2025a. Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, ...

  8. [8]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261,

  9. [9]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y . Wu, Y . K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. Deepseek-coder: When the large language model meets programming - the rise of code intelligence.arXiv preprint arXiv: 2401.14196,

  10. [10]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  11. [11]

    Training Large Language Models to Reason in a Continuous Latent Space

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769,

  12. [12]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

  13. [13]

    Does Your Reasoning Model Implicitly Know When to Stop Thinking?

    Zixuan Huang, Xin Xia, Yuxi Ren, Jianbin Zheng, Xuanda Wang, Zhixia Zhang, Hongyan Xie, Songshi Liang, Zehao Chen, Xuefeng Xiao, et al. Does your reasoning model implicitly know when to stop thinking?arXiv preprint arXiv:2602.08354,

  14. [14]

    Aime 2024 (american invitational mathematics examination 2024)

    HuggingFaceH4. Aime 2024 (american invitational mathematics examination 2024). Hugging Face dataset,

  15. [15]

    Qwen2.5-Coder Technical Report

    URLhttps://huggingface.co/datasets/HuggingFaceH4/aime_2024. Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, et al. Qwen2.5-coder technical report.arXiv preprint arXiv:2409.12186,

  16. [16]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

  17. [17]

    Implicit reasoning in large language models: A comprehensive survey.arXiv preprint arXiv:2509.02350,

    Jindong Li, Yali Fu, Li Fan, Jiahong Liu, Yao Shu, Chengwei Qin, Menglin Yang, Irwin King, and Rex Ying. Implicit reasoning in large language models: A comprehensive survey.arXiv preprint arXiv:2509.02350,

  18. [18]

    Emergent introspective awareness in large language models.arXiv preprint arXiv:2601.01828,

    Jack Lindsey. Emergent introspective awareness in large language models.arXiv preprint arXiv:2601.01828,

  19. [19]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

  20. [20]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025a. Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-li...

  21. [21]

    Let’s think dot by dot: Hidden computation in transformer language models.arXiv preprint arXiv:2404.15758,

    Jacob Pfau, William Merrill, and Samuel R Bowman. Let’s think dot by dot: Hidden computation in transformer language models.arXiv preprint arXiv:2404.15758,

  22. [22]

    Qwen2.5 Technical Report

    Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

  23. [23]

    Code Llama: Open Foundation Models for Code

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950,

  24. [24]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  25. [25]

    CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

    Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi: Compressing chain-of-thought into continuous space via self-distillation.arXiv preprint arXiv:2502.21074,

  26. [26]

    Swireasoning: Switch-thinking in latent and explicit for pareto-superior reasoning llms.arXiv preprint arXiv:2510.05069, 2025a

    Dachuan Shi, Abedelkadir Asi, Keying Li, Xiangchi Yuan, Leyan Pan, Wenke Lee, and Wen Xiao. Swireasoning: Switch-thinking in latent and explicit for pareto-superior reasoning llms.arXiv preprint arXiv:2510.05069, 2025a. Dachuan Shi, Yonggan Fu, Xiangchi Yuan, Zhongzhi Yu, Haoran You, Sixu Li, Xin Dong, Jan Kautz, Pavlo Molchanov, and Yingyan Celine Lin. L...

  27. [27]

    Llm pretraining with continuous concepts.arXiv preprint arXiv:2502.08524,

    12 Jihoon Tack, Jack Lanchantin, Jane Yu, Andrew Cohen, Ilia Kulikov, Janice Lan, Shibo Hao, Yuandong Tian, Jason Weston, and Xian Li. Llm pretraining with continuous concepts.arXiv preprint arXiv:2502.08524,

  28. [28]

    arXiv preprint arXiv:2505.16552 (2025)

    Wenhui Tan, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Jian Luan, and Ruihua Song. Think silently, think fast: Dynamic latent compression of llm reasoning chains.arXiv preprint arXiv:2505.16552,

  29. [29]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

  30. [30]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh- ery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171,

  31. [31]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, and et al. Chain-of-thought prompting elicits reasoning in large language models.arXiv preprint arXiv:2201.11903,

  32. [32]

    Sim-cot: Supervised implicit chain-of-thought.arXiv preprint arXiv:2509.20317,

    Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, and Dahua Lin. Sim-cot: Supervised implicit chain-of-thought.arXiv preprint arXiv:2509.20317,

  33. [33]

    Llms are single-threaded reasoners: Demystifying the working mechanism of soft thinking.arXiv preprint arXiv:2508.03440,

    Junhong Wu, Jinliang Lu, Zixuan Ren, Gangqiang Hu, Zhi Wu, Dai Dai, and Hua Wu. Llms are single-threaded reasoners: Demystifying the working mechanism of soft thinking.arXiv preprint arXiv:2508.03440,

  34. [34]

    Metastate: Persistent working memory enhances reasoning in discrete diffusion language models.arXiv preprint arXiv:2603.01331,

    Kejing Xia, Mingzhe Li, Lixuan Wei, Zhenbang Du, Xiangchi Yuan, Dachuan Shi, Qirui Jin, and Wenke Lee. Metastate: Persistent working memory enhances reasoning in discrete diffusion language models.arXiv preprint arXiv:2603.01331,

  35. [35]

    Softcot: Soft chain-of-thought for efficient reasoning with llms.arXiv preprint arXiv:2502.12134,

    Yige Xu, Xu Guo, Zhiwei Zeng, and Chunyan Miao. Softcot: Soft chain-of-thought for efficient reasoning with llms.arXiv preprint arXiv:2502.12134,

  36. [36]

    Thinking in uncertainty: Mitigating hallucinations in mlrms with latent entropy-aware decoding.arXiv preprint arXiv:2603.13366,

    Zhongxing Xu, Zhonghua Wang, Zhe Qian, Dachuan Shi, Feilong Tang, Ming Hu, Shiyan Su, Xiaocheng Zou, Wei Feng, Dwarikanath Mahapatra, et al. Thinking in uncertainty: Mitigating hallucinations in mlrms with latent entropy-aware decoding.arXiv preprint arXiv:2603.13366,

  37. [37]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  38. [38]

    Aime 2025 (american invitational mathematics examination 2025)

    Yentinglin. Aime 2025 (american invitational mathematics examination 2025). Hugging Face dataset,

  39. [39]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    URLhttps://huggingface.co/datasets/yentinglin/aime_2025. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

  40. [40]

    The latent space: Foundation, evolution, mechanism, ability, and outlook

    Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, et al. The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029,

  41. [41]

    Mitigating forgetting between supervised and reinforcement learning yields stronger reasoners

    Xiangchi Yuan, Xiang Chen, Tong Yu, Dachuan Shi, Can Jin, Wenke Lee, and Saayan Mitra. Mitigating forgetting between supervised and reinforcement learning yields stronger reasoners. arXiv preprint arXiv:2510.04454, 2025a. Xiangchi Yuan, Chunhui Zhang, Zheyuan Liu, Dachuan Shi, Soroush V osoughi, and Wenke Lee. Superficial self-improved reasoners benefit f...

  42. [42]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025a. Boyi Zeng, Shixiang Song, Siyuan Huang, Yixuan Wang, He Li, Ziwei He, Xinbing Wang, Zhiyu Li, and Zhouhan Lin. Pretraining langu...

  43. [43]

    Zebraarena: A diagnostic simulation environment for studying reasoning-action coupling in tool-augmented llms.arXiv preprint arXiv:2603.18614,

    Wanjia Zhao, Ludwig Schmidt, James Zou, Vidhisha Balachandran, and Lingjiao Chen. Zebraarena: A diagnostic simulation environment for studying reasoning-action coupling in tool-augmented llms.arXiv preprint arXiv:2603.18614,

  44. [44]

    Emergence of superposition: Unveiling the training dynamics of chain of continuous thought.arXiv preprint arXiv:2509.23365, 2025a

    Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Emergence of superposition: Unveiling the training dynamics of chain of continuous thought.arXiv preprint arXiv:2509.23365, 2025a. Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning by superposition: A theoretical perspective on c...

  45. [45]

    A survey on latent reasoning.arXiv preprint arXiv:2507.06203, 2025c

    Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, et al. A survey on latent reasoning.arXiv preprint arXiv:2507.06203, 2025c. 14 A Impact Statement This paper studies an inference-time method for reasoning LLMs. By improving accuracy and token efficiency, COPT may make reasoning...

  46. [46]

    A shorter per-turn cap helps control total context growth while still allowing the model to produce a concise draft answer at each turn

    Since ZebraArena requires multiple rounds of interaction, the effective draft budget can accumulate across turns. A shorter per-turn cap helps control total context growth while still allowing the model to produce a concise draft answer at each turn. D.3 Detailed Results on the LeetCode-Contest Benchmark Table 6: Per-split accuracy and generation length c...

  47. [47]

    We repeat evaluations eight times and report the average accuracy for both COPT and baselines on the AIME 2024 and AIME 2025 benchmarks

    For ZebraArena, we set the maximum generation length to 32,768 tokens for the small split, 65,536 tokens for the medium split, and 98,304 tokens for the large split. We repeat evaluations eight times and report the average accuracy for both COPT and baselines on the AIME 2024 and AIME 2025 benchmarks. E.3 Sequence Distributions for the KL Estimators We de...