CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning
Pith reviewed 2026-05-20 05:21 UTC · model grok-4.3
The pith
CopT reverses chain-of-thought by drafting an answer first then using on-policy reflection and continuous-embedding contrastive verifiers to improve accuracy and cut tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CopT elicits a draft answer first and then invokes on-policy thinking conditioned on the draft for reflection and correction. It recasts continuous embeddings as contrastive verifiers that contrast model support for the same tokens under discrete-token and continuous-embedding inputs, producing a sequence-level reverse KL estimator. Analysis shows that under certain assumptions the expected estimate equals the mutual information between unresolved latent state and emitted answer token, so the verifier captures answer-relevant uncertainty. A second KL estimator controls draft visibility during further thinking. This pipeline raises peak accuracy by up to 23 percent and reduces tokens by up to
What carries the argument
The contrastive verifier that produces a sequence-level reverse KL estimator by contrasting the model's support for generated tokens under discrete-token inputs versus continuous-embedding inputs.
If this is right
- Peak accuracy rises by up to 23 percent on mathematics, coding, and agentic reasoning tasks.
- Token usage falls by up to 57 percent while maintaining or exceeding baseline accuracy.
- No additional training is required for the gains.
- A second KL estimator dynamically limits visibility of unreliable draft content during reflection.
Where Pith is reading between the lines
- The draft-first order may reduce wasteful computation in any sequential generation setting where partial answers become available early.
- Continuous-space contrastive checks could be tested on tasks that already use embedding-based retrieval or planning.
Load-bearing premise
The expected reverse KL estimate equals the mutual information between the unresolved latent state and the emitted answer token under the paper's stated assumptions.
What would settle it
Measure whether the contrastive verifier's score correlates more strongly with final answer correctness than with arbitrary latent factors unrelated to the answer; if the correlation is higher for correctness then the claim holds, otherwise it does not.
Figures
read the original abstract
Chain-of-thought (CoT) is a standard approach for eliciting reasoning capabilities from large language models (LLMs). However, the common CoT paradigm treats thinking as a prerequisite for answering, which can delay access to plausible answers and incur unnecessary token costs even when the model is able to identify an answer before extended thinking, a behavior known as performative reasoning. In this paper, we introduce CopT, a reformulated reasoning pipeline that reverses the usual order of thinking and answering. Instead of thinking before answering, CopT first elicits a draft answer and then invokes subsequent on-policy thinking conditioned on its own draft answer for reflection and correction. To assess whether the draft answer should be trusted, CopT recasts continuous embeddings as inference-time contrastive verifiers. Specifically, it contrasts the model's support for the same generated tokens under discrete-token inputs and continuous-embedding inputs, yielding a sequence-level reverse KL estimator for answer reliability. Our analysis shows that under certain assumptions, the expected estimate equals the mutual information between the unresolved latent state and the emitted answer token, explaining why it captures answer-relevant uncertainty rather than arbitrary uncertainty in the latent state. When the answer is deemed insufficiently reliable, CopT performs further on-policy thinking, where a second KL estimator dynamically controls draft-answer visibility, preserving useful partial information while reducing the risk of being misled by unreliable content. Across mathematics, coding, and agentic reasoning tasks, CopT improves peak accuracy by up to 23% and reduces token usage by up to 57% at comparable or higher accuracy, without any additional training. The code is available at https://github.com/sdc17/CopT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CopT, a reformulated LLM reasoning pipeline that reverses standard chain-of-thought by first eliciting a draft answer and then performing subsequent on-policy thinking conditioned on that draft. It recasts continuous embeddings as inference-time contrastive verifiers that contrast discrete-token and continuous-embedding support to yield a sequence-level reverse KL estimator for answer reliability. The abstract states that under certain assumptions this estimator equals the mutual information between the unresolved latent state and the answer token. A second KL estimator controls draft visibility during further thinking. Reported results across mathematics, coding, and agentic tasks show peak accuracy gains up to 23% and token reductions up to 57% at comparable or higher accuracy, with no additional training and code released.
Significance. If the theoretical equivalence holds under the operating regimes and the empirical gains prove robust to controls, the work could meaningfully advance training-free methods for efficient reasoning by reducing performative thinking and token overhead. The public code release is a clear strength that supports reproducibility.
major comments (2)
- [Abstract (contrastive verifier analysis paragraph)] Abstract, paragraph describing the contrastive verifier analysis: the claim that 'under certain assumptions, the expected estimate equals the mutual information between the unresolved latent state and the emitted answer token' is load-bearing for interpreting the verifier as capturing answer-relevant uncertainty rather than arbitrary latent uncertainty. The assumptions are not stated explicitly and no derivation or proof sketch is supplied, leaving open whether the discrete-to-continuous embedding contrast satisfies them on the evaluated LLMs and tasks.
- [§4 (Experiments)] Experimental results (abstract and §4): the claims of up to 23% accuracy improvement and 57% token reduction are presented without error bars, statistical tests, or explicit controls for prompting effects and baseline differences. This makes it difficult to attribute gains specifically to the reverse KL verifier mechanism rather than incidental factors.
minor comments (2)
- [§3] The notation and definitions for the two KL estimators would benefit from explicit equations in the main text rather than prose description only.
- [Abstract] Minor typos and inconsistent capitalization appear in the abstract (e.g., 'on-policy thinking' vs. 'On-Policy Thinking').
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract (contrastive verifier analysis paragraph)] Abstract, paragraph describing the contrastive verifier analysis: the claim that 'under certain assumptions, the expected estimate equals the mutual information between the unresolved latent state and the emitted answer token' is load-bearing for interpreting the verifier as capturing answer-relevant uncertainty rather than arbitrary latent uncertainty. The assumptions are not stated explicitly and no derivation or proof sketch is supplied, leaving open whether the discrete-to-continuous embedding contrast satisfies them on the evaluated LLMs and tasks.
Authors: We agree that the assumptions and derivation should be stated more explicitly. The key assumptions are that the continuous embedding serves as a sufficient statistic for the unresolved latent state and that the contrast is performed over identical token sequences. In the revision we will add an explicit statement of these assumptions to the abstract and include a concise proof sketch in Section 3 (or an appendix) showing how the expected reverse KL equals the mutual information under those conditions. This will also clarify applicability to the evaluated models and tasks. revision: yes
-
Referee: [§4 (Experiments)] Experimental results (abstract and §4): the claims of up to 23% accuracy improvement and 57% token reduction are presented without error bars, statistical tests, or explicit controls for prompting effects and baseline differences. This makes it difficult to attribute gains specifically to the reverse KL verifier mechanism rather than incidental factors.
Authors: We acknowledge that the current results would be more convincing with statistical rigor and controls. In the revised manuscript we will report means and standard deviations across multiple random seeds, include statistical significance tests for the accuracy gains, and add explicit controls that vary only the prompting structure while holding other factors fixed. These additions will help isolate the contribution of the reverse KL verifier. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper introduces a reversed reasoning pipeline (draft answer first, then on-policy thinking) and contrastive verifiers based on sequence-level reverse KL between discrete-token and continuous-embedding inputs. The key claim is that an analysis shows the expected reverse KL equals mutual information between unresolved latent state and answer token under certain assumptions, which is offered as justification for why the verifier targets answer-relevant uncertainty. This is presented as a derived explanatory result rather than a definitional identity or a parameter fitted to the target quantity. No equations are quoted that reduce the estimator to its inputs by construction, no self-citations are invoked as load-bearing uniqueness theorems, and no ansatzes or renamings of known results are evident. The reported accuracy and token-efficiency gains are empirical outcomes of the new pipeline without training, making the overall derivation self-contained against external benchmarks such as standard CoT.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Under certain assumptions, the expected estimate equals the mutual information between the unresolved latent state and the emitted answer token
invented entities (1)
-
Continuous-embedding contrastive verifier
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Under Assumption 1 (Mixture-linear continuous prefix) ... ES∼w,A∼PS[κ(S,A)]=I(S;A)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_high_calibrated_iff unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
continuous embeddings ... sequence-level reverse KL estimator
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Phi-4-reasoning Technical Report
Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, et al. Phi-4-reasoning technical report.arXiv preprint arXiv:2504.21318,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
gpt-oss-120b & gpt-oss-20b Model Card
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Program Synthesis with Large Language Models
Anthropic. System card: Claude opus 4 & claude sonnet 4, 2025a. URL https://www.anthropic. com/claude-4-system-card. Anthropic. Claude opus 4.5 system card. https://assets.anthropic.com/m/ 64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf, November 2025b. System card. Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan...
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Reasoning theater: Disentangling model beliefs from chain-of-thought
Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow, Atticus Geiger, Owen Lewis, and Jack Merullo. Reasoning theater: Disentangling model beliefs from chain-of-thought. arXiv preprint arXiv:2603.05488,
-
[6]
Qwen3-Coder-Next Technical Report
Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al. Qwen3-coder-next technical report.arXiv preprint arXiv:2603.00729,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Hanlin Wang, Yanjun Chen, Wei Zhang, Jian Wang, Wenjie Li, and Xiaoyu Shen. Reasoning beyond language: A comprehensive survey on latent chain-of-thought reasoning.arXiv preprint arXiv:2505.16782, 2025a. Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, ...
-
[8]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y . Wu, Y . K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. Deepseek-coder: When the large language model meets programming - the rise of code intelligence.arXiv preprint arXiv: 2401.14196,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Training Large Language Models to Reason in a Continuous Latent Space
Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Does Your Reasoning Model Implicitly Know When to Stop Thinking?
Zixuan Huang, Xin Xia, Yuxi Ren, Jianbin Zheng, Xuanda Wang, Zhixia Zhang, Hongyan Xie, Songshi Liang, Zehao Chen, Xuefeng Xiao, et al. Does your reasoning model implicitly know when to stop thinking?arXiv preprint arXiv:2602.08354,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Aime 2024 (american invitational mathematics examination 2024)
HuggingFaceH4. Aime 2024 (american invitational mathematics examination 2024). Hugging Face dataset,
work page 2024
-
[15]
Qwen2.5-Coder Technical Report
URLhttps://huggingface.co/datasets/HuggingFaceH4/aime_2024. Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, et al. Qwen2.5-coder technical report.arXiv preprint arXiv:2409.12186,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Implicit reasoning in large language models: A comprehensive survey.arXiv preprint arXiv:2509.02350,
Jindong Li, Yali Fu, Li Fan, Jiahong Liu, Yao Shu, Chengwei Qin, Menglin Yang, Irwin King, and Rex Ying. Implicit reasoning in large language models: A comprehensive survey.arXiv preprint arXiv:2509.02350,
-
[18]
Emergent introspective awareness in large language models.arXiv preprint arXiv:2601.01828,
Jack Lindsey. Emergent introspective awareness in large language models.arXiv preprint arXiv:2601.01828,
-
[19]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025a. Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-li...
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Jacob Pfau, William Merrill, and Samuel R Bowman. Let’s think dot by dot: Hidden computation in transformer language models.arXiv preprint arXiv:2404.15758,
-
[22]
Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Code Llama: Open Foundation Models for Code
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation
Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi: Compressing chain-of-thought into continuous space via self-distillation.arXiv preprint arXiv:2502.21074,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Dachuan Shi, Abedelkadir Asi, Keying Li, Xiangchi Yuan, Leyan Pan, Wenke Lee, and Wen Xiao. Swireasoning: Switch-thinking in latent and explicit for pareto-superior reasoning llms.arXiv preprint arXiv:2510.05069, 2025a. Dachuan Shi, Yonggan Fu, Xiangchi Yuan, Zhongzhi Yu, Haoran You, Sixu Li, Xin Dong, Jan Kautz, Pavlo Molchanov, and Yingyan Celine Lin. L...
-
[27]
Llm pretraining with continuous concepts.arXiv preprint arXiv:2502.08524,
12 Jihoon Tack, Jack Lanchantin, Jane Yu, Andrew Cohen, Ilia Kulikov, Janice Lan, Shibo Hao, Yuandong Tian, Jason Weston, and Xian Li. Llm pretraining with continuous concepts.arXiv preprint arXiv:2502.08524,
-
[28]
arXiv preprint arXiv:2505.16552 (2025)
Wenhui Tan, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Jian Luan, and Ruihua Song. Think silently, think fast: Dynamic latent compression of llm reasoning chains.arXiv preprint arXiv:2505.16552,
-
[29]
Kimi K2: Open Agentic Intelligence
Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh- ery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, and et al. Chain-of-thought prompting elicits reasoning in large language models.arXiv preprint arXiv:2201.11903,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Sim-cot: Supervised implicit chain-of-thought.arXiv preprint arXiv:2509.20317,
Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, and Dahua Lin. Sim-cot: Supervised implicit chain-of-thought.arXiv preprint arXiv:2509.20317,
-
[33]
Junhong Wu, Jinliang Lu, Zixuan Ren, Gangqiang Hu, Zhi Wu, Dai Dai, and Hua Wu. Llms are single-threaded reasoners: Demystifying the working mechanism of soft thinking.arXiv preprint arXiv:2508.03440,
-
[34]
Kejing Xia, Mingzhe Li, Lixuan Wei, Zhenbang Du, Xiangchi Yuan, Dachuan Shi, Qirui Jin, and Wenke Lee. Metastate: Persistent working memory enhances reasoning in discrete diffusion language models.arXiv preprint arXiv:2603.01331,
-
[35]
Softcot: Soft chain-of-thought for efficient reasoning with llms.arXiv preprint arXiv:2502.12134,
Yige Xu, Xu Guo, Zhiwei Zeng, and Chunyan Miao. Softcot: Soft chain-of-thought for efficient reasoning with llms.arXiv preprint arXiv:2502.12134,
-
[36]
Zhongxing Xu, Zhonghua Wang, Zhe Qian, Dachuan Shi, Feilong Tang, Ming Hu, Shiyan Su, Xiaocheng Zou, Wei Feng, Dwarikanath Mahapatra, et al. Thinking in uncertainty: Mitigating hallucinations in mlrms with latent entropy-aware decoding.arXiv preprint arXiv:2603.13366,
-
[37]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Aime 2025 (american invitational mathematics examination 2025)
Yentinglin. Aime 2025 (american invitational mathematics examination 2025). Hugging Face dataset,
work page 2025
-
[39]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
URLhttps://huggingface.co/datasets/yentinglin/aime_2025. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
The latent space: Foundation, evolution, mechanism, ability, and outlook
Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, et al. The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029,
-
[41]
Mitigating forgetting between supervised and reinforcement learning yields stronger reasoners
Xiangchi Yuan, Xiang Chen, Tong Yu, Dachuan Shi, Can Jin, Wenke Lee, and Saayan Mitra. Mitigating forgetting between supervised and reinforcement learning yields stronger reasoners. arXiv preprint arXiv:2510.04454, 2025a. Xiangchi Yuan, Chunhui Zhang, Zheyuan Liu, Dachuan Shi, Soroush V osoughi, and Wenke Lee. Superficial self-improved reasoners benefit f...
-
[42]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025a. Boyi Zeng, Shixiang Song, Siyuan Huang, Yixuan Wang, He Li, Ziwei He, Xinbing Wang, Zhiyu Li, and Zhouhan Lin. Pretraining langu...
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Wanjia Zhao, Ludwig Schmidt, James Zou, Vidhisha Balachandran, and Lingjiao Chen. Zebraarena: A diagnostic simulation environment for studying reasoning-action coupling in tool-augmented llms.arXiv preprint arXiv:2603.18614,
-
[44]
Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Emergence of superposition: Unveiling the training dynamics of chain of continuous thought.arXiv preprint arXiv:2509.23365, 2025a. Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning by superposition: A theoretical perspective on c...
-
[45]
A survey on latent reasoning.arXiv preprint arXiv:2507.06203, 2025c
Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, et al. A survey on latent reasoning.arXiv preprint arXiv:2507.06203, 2025c. 14 A Impact Statement This paper studies an inference-time method for reasoning LLMs. By improving accuracy and token efficiency, COPT may make reasoning...
-
[46]
Since ZebraArena requires multiple rounds of interaction, the effective draft budget can accumulate across turns. A shorter per-turn cap helps control total context growth while still allowing the model to produce a concise draft answer at each turn. D.3 Detailed Results on the LeetCode-Contest Benchmark Table 6: Per-split accuracy and generation length c...
-
[47]
For ZebraArena, we set the maximum generation length to 32,768 tokens for the small split, 65,536 tokens for the medium split, and 98,304 tokens for the large split. We repeat evaluations eight times and report the average accuracy for both COPT and baselines on the AIME 2024 and AIME 2025 benchmarks. E.3 Sequence Distributions for the KL Estimators We de...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.