arxiv: 2605.07021 · v1 · submitted 2026-05-07 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight

Christopher Z. Cui , Taylor W. Killian , Prithviraj Ammanabrolu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:50 UTC · model grok-4.3

classification 💻 cs.AI

keywords behavior cuesLLM reasoningscalable oversightmonitorabilityAI safetycontrollabilityreasoning efficiencyreinforcement learning

0 comments

The pith

Training LLMs to emit Behavior Cues before behaviors makes reasoning monitorable, allowing recovery of safe actions from 80% of unsafe traces and doubling success rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Behavior Cue Reasoning to make LLM reasoning more controllable by having models emit special token sequences right before specific behaviors. These cues provide signals that external monitors can use to prune inefficient or unsafe reasoning paths early. In tests, this setup lets monitors cut up to half the wasted tokens in math problems and recover safe outcomes in most cases that would otherwise fail, all while keeping the original performance level. If true, it offers a way to build oversight into the reasoning process itself rather than only checking the final output.

Core claim

The central discovery is that fine-tuning LLMs to produce Behavior Cues immediately prior to target behaviors provides sufficient information in a compressed form for monitors to improve efficiency by pruning 50% of reasoning tokens in math tasks and to enhance safety by recovering 80% of unsafe action proposals, thereby raising success rates from 46% to 96% in constrained environments without any loss to base reasoning capability.

What carries the argument

Behavior Cues: special token sequences that the model is trained to emit immediately before specific behaviors, serving as signals and control points for external monitoring and intervention.

If this is right

Monitors can prune up to 50% of wasted reasoning tokens in complex math problem solving based on cue information alone.
An almost optimal rule-based monitor recovers safe actions from 80% of traces ending in unsafe proposals, more than doubling success from 46% to 96%.
The approach works across two model families and three domains with no degradation to core performance.
Training the monitored model to reason in a more monitorable way advances scalable oversight techniques.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If Behavior Cues prove stable under fine-tuning, they could be applied to guide reasoning in open-ended tasks where current monitors struggle.
The compressed cue view suggests that full reasoning traces may contain redundant information for oversight purposes, potentially reducing monitoring compute needs.
Combining this with other oversight methods might allow handling of more complex misalignments that emerge only at the end of long reasoning chains.

Load-bearing premise

LLMs can be fine-tuned to emit Behavior Cues right before the intended behaviors without any drop in their main reasoning quality, and that monitors can make effective decisions using only the information from those cues.

What would settle it

A result where the fine-tuned model emits cues inconsistently or at incorrect times, or where cue-based monitoring shows no significant improvement in pruning efficiency or safety recovery rates compared to monitoring full traces.

Figures

Figures reproduced from arXiv: 2605.07021 by Christopher Z. Cui, Prithviraj Ammanabrolu, Taylor W. Killian.

**Figure 2.** Figure 2: Validation success rate and percent of maximum possible token savings over the course [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Reasoning in Large Language Models (LLMs) poses a challenge for oversight as many misaligned behaviors do not surface until reasoning concludes. To address this, we introduce Behavior Cue Reasoning for making LLM reasoning more controllable and monitorable. Behavior Cues are special token sequences that a model is trained to emit immediately before specific implicit and explicit behaviors, acting as dual purpose signal and control levers. When fine-tuning a weaker external monitor with Reinforcement Learning for reasoning oversight, a compressed view of only information surfaced by Behavior Cues is sufficient signal for the monitor to prune up to 50% of otherwise wasted reasoning tokens in complex math problem solving. When leveraged by an almost optimal rule-based monitor in an environment where excessive constraint violations results in failure, \ours allows for the recovery of safe actions from 80% of reasoning traces that would otherwise end with the proposal of an unsafe action, more than doubling the success rate from 46% to 96%. Through evaluation across two model families and three domains, we show that \bcreasoning improves reasoning monitorability and controllability with no cost to performance. More broadly, our work progresses scalable oversight by demonstrating how the monitored model itself can be trained to reason more tractably to oversight. Code to be released at https://github.com/christopherzc/text-games

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Training LLMs to emit Behavior Cues before key steps makes oversight cheaper and more effective, with 50% token pruning and doubled safety success in the reported tests.

read the letter

The core finding is that fine-tuning models to insert explicit Behavior Cues right before target behaviors gives monitors a compact, usable signal. In their math domain this cuts wasted tokens by half with a learned monitor; in constrained safety settings an almost-optimal rule-based monitor recovers safe actions from 80% of traces that would have failed, lifting success from 46% to 96%. Both gains come with no drop in base task performance across two model families and three domains, and ablations confirm the cue training itself does not degrade reasoning accuracy.

Referee Report

1 major / 2 minor

Summary. The paper introduces Behavior Cue Reasoning (BCR), in which LLMs are fine-tuned to emit special token sequences ('Behavior Cues') immediately before target behaviors. These cues serve as signals for external monitors, enabling token pruning (up to 50% in math domains with no accuracy loss) and recovery of safe actions (80% of otherwise unsafe traces, raising success from 46% to 96% in constrained environments). Results are reported across two model families and three domains, with ablations indicating preserved core reasoning performance and code to be released.

Significance. If the results hold, the work advances scalable oversight by showing that the model itself can be trained to produce more monitorable reasoning traces, improving both efficiency and safety without performance trade-offs. The inclusion of ablations verifying no degradation and the planned code release strengthen reproducibility.

major comments (1)

[Experiments and ablations] The fine-tuning procedure for cue emission and the exact experimental controls (hyperparameters, dataset construction, and baseline comparisons) are insufficiently detailed to fully substantiate the claim of no degradation to reasoning performance; this detail is load-bearing for the central 'no cost to performance' assertion across domains.

minor comments (2)

[Safety experiments] Clarify the precise definition and construction of the 'almost optimal rule-based monitor' used for the safety recovery experiments, including how the 80% recovery rate was measured.
[Results] The abstract and main text would benefit from explicit statements on the number of runs, random seeds, and statistical significance for the reported gains (50% pruning, 46% to 96% success).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment, recognition of the work's significance for scalable oversight, and recommendation for minor revision. We address the major comment below and will incorporate clarifications to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments and ablations] The fine-tuning procedure for cue emission and the exact experimental controls (hyperparameters, dataset construction, and baseline comparisons) are insufficiently detailed to fully substantiate the claim of no degradation to reasoning performance; this detail is load-bearing for the central 'no cost to performance' assertion across domains.

Authors: We agree that greater specificity on these elements would strengthen the paper and better support the central claim. In the revised manuscript, we will expand the Methods and Experimental Setup sections to include: the precise fine-tuning objective and procedure for cue emission (including loss formulation and training dynamics); complete hyperparameter tables for all models, domains, and ablations; detailed dataset construction protocols (including prompt templates, behavior labeling, and split statistics); and more granular baseline comparisons with quantitative metrics. These additions will directly address the load-bearing nature of the 'no cost' assertion. The planned code release will provide the full implementation for exact reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's claims rest on empirical results from fine-tuning LLMs to emit Behavior Cues before target behaviors and then measuring monitor performance on compressed cue views in math and constrained environments. Reported quantities such as 80% recovery of safe actions, doubling success from 46% to 96%, and 50% token pruning are direct experimental outcomes with ablations for preserved accuracy; they are not quantities defined in terms of fitted parameters from the same data, nor do they reduce via self-citation or ansatz to the inputs. The central premise of improved monitorability is validated externally through the described training and evaluation procedures rather than by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on the assumption that Behavior Cues can be trained into the model and that cue information suffices for monitoring decisions; these are domain assumptions without independent verification in the provided abstract.

axioms (2)

domain assumption LLMs can be fine-tuned via RL to emit specific token sequences before target behaviors without harming base performance.
Invoked to support the no-cost-to-performance claim.
domain assumption A compressed view consisting only of Behavior Cues supplies enough signal for effective monitor decisions.
Central to the reported pruning and recovery improvements.

invented entities (1)

Behavior Cues no independent evidence
purpose: Special token sequences emitted before behaviors to serve as signals and control levers.
Newly introduced construct in this work.

pith-pipeline@v0.9.0 · 5543 in / 1501 out tokens · 67435 ms · 2026-05-11T00:50:04.133299+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Behavior Cues are special token sequences that a model is trained to emit immediately before specific implicit and explicit behaviors, acting as dual purpose signal and control levers.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

When leveraged by an almost optimal rule-based monitor... recovery of safe actions from 80% of reasoning traces... doubling the success rate from 46% to 96%.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 17 canonical work pages · 3 internal anchors

[1]

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety.arXiv preprint arXiv:1606.06565, 2016

work page internal anchor Pith review arXiv 2016
[2]

Building with extended thinking

Anthropic. Building with extended thinking. https://docs.claude.com/en/docs/ build-with-claude/extended-thinking, 2025

2025
[3]

The internal state of an LLM knows when it’s lying

Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it’s lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, 2023

2023
[4]

Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi

Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y . Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation, 2025

2025
[5]

Measuring progress on scalable oversight for large language models.arXiv preprint arXiv:2211.03540, 2022

Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamil˙e Lukoši¯ut˙e, Amanda Askell, Andy Jones, Anna Chen, et al. Measuring progress on scalable oversight for large language models.arXiv preprint arXiv:2211.03540, 2022

work page arXiv 2022
[6]

Weak-to-strong generalization: Eliciting strong capabilities with weak supervision

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschen- brenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. In International Conference on Machine Learning, 2024

2024
[7]

Textworld: A learning environment for text-based games

Marc-Alexandre Côté, Akos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, et al. Textworld: A learning environment for text-based games. InWorkshop on Computer Games, pages 41–75. Springer, 2018

2018
[8]

Tales: Text adventure learning environment suite, 2025

Christopher Zhang Cui, Xingdi Yuan, Ziang Xiao, Prithviraj Ammanabrolu, and Marc- Alexandre Côté. Tales: Text adventure learning environment suite, 2025

2025
[9]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

2026
[10]

Group-in-Group Policy Optimization for LLM Agent Training

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025

work page internal anchor Pith review arXiv 2025
[11]

Think before you speak: Training language models with pause tokens

Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. In The Twelfth International Conference on Learning Representations
[12]

Bowman, and Evan Hubinger

Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, and Evan Hubinger. Alignment faking in large language models, 2024

2024
[13]

Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y

Melody Y . Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y . Wei, Marcus Williams, Benjamin Arnav, Joost Huizinga, Ian Kivlichan, Mia Glaese, Jakub Pachocki, and Bowen Baker. Monitoring monitorability, 2025

2025
[14]

Inter- active fiction games: A colossal adventure

Matthew Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan. Inter- active fiction games: A colossal adventure. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7903–7910, 2020

2020
[15]

Collapse of self-trained language models, 2024

David Herel and Tomas Mikolov. Collapse of self-trained language models, 2024

2024
[16]

The ends justify the thoughts: Rl-induced motivated reasoning in llm cots, 2026

Nikolaus Howe and Micah Carroll. The ends justify the thoughts: Rl-induced motivated reasoning in llm cots, 2026

2026
[17]

Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096, 2025

Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, et al. Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096, 2025. 10

work page arXiv 2025
[18]

Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents.Advances in Neural Information Processing Systems, 37:10088–10116, 2024

Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark. Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents.Advances in Neural Information Processing Systems, 37:10088–10116, 2024

2024
[19]

well, keep thinking

Hyunbin Jin, Je Won Yeom, Seunghyun Bae, and Taesup Kim. “well, keep thinking”: Enhancing LLM reasoning with adaptive injection decoding. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computa- tional Linguistics: ACL 2025, pages 9989–10018, Vienna, Austria, July 2025. Association fo...

2025
[20]

Siegel, J’anos Kram’ar, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D

Zachary Kenton, Noah Y . Siegel, János Kramár, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D. Goodman, and Rohin Shah. On scalable oversight with weak llms judging strong llms.arXiv preprint arXiv:2407.04622, 2024

work page arXiv 2024
[21]

Learning to insert [pause] tokens for better reasoning.arXiv preprint arXiv:2506.03616, 2025

Eunki Kim, Sangryul Kim, and James Thorne. Learning to insert [pause] tokens for better reasoning.arXiv preprint arXiv:2506.03616, 2025

work page arXiv 2025
[22]

Chain of thought monitorability: A new and fragile opportunity for ai safety, 2025

Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksand...

2025
[23]

Bowman, and Ethan Perez

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil ˙e Lukoši¯ut˙e, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Lar- son, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy ...

2023
[24]

Early stopping chain-of-thoughts in large language models, 2025

Minjia Mao, Bowen Yin, Yu Zhu, and Xiao Fang. Early stopping chain-of-thoughts in large language models, 2025

2025
[25]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candes, and Tatsunori Hashimoto. s1: Simple test-time scaling. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Lan...

2025
[26]

Mlgym: A new framework and benchmark for advancing ai research agents,

Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vin- cent Moens, Amar Budhiraja, Despoina Magka, Vladislav V orotilov, Gaurav Chaurasia, et al. Mlgym: A new framework and benchmark for advancing ai research agents.arXiv preprint arXiv:2502.14499, 2025

work page arXiv 2025
[27]

Learning to reason with llms

OpenAI. Learning to reason with llms. https://openai.com/index/ learning-to-reason-with-llms/, 2024

2024
[28]

arXiv preprint arXiv:2411.13543 , year=

Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuci ´nski, Lerrel Pinto, Rob Fergus, et al. Balrog: Bench- marking agentic llm and vlm reasoning on games.arXiv preprint arXiv:2411.13543, 2024

work page arXiv 2024
[29]

Logit-entropy adaptive stopping heuristic for efficient chain-of-thought reasoning, 2025

Mohammad Atif Quamar and Mohammad Areeb. Logit-entropy adaptive stopping heuristic for efficient chain-of-thought reasoning, 2025. 11

2025
[30]

Learning a continue-thinking token for enhanced test-time scaling

Liran Ringel, Elad Tolochinsky, and Yaniv Romano. Learning a continue-thinking token for enhanced test-time scaling. In Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Push- pak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, and Dhirendra Pratap Singh, editors,Proceedings of the 14th International Joint Conference on Natural Lan...

2025
[31]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

2017
[32]

Backtracking for safety.arXiv preprint arXiv:2503.08919, 2025

Bilgehan Sel, Dingcheng Li, Phillip Wallis, Vaishakh Keshava, Ming Jin, and Siddhartha Reddy Jonnalagadda. Backtracking for safety.arXiv preprint arXiv:2503.08919, 2025

work page arXiv 2025
[33]

Think just enough: Sequence-level entropy as a confidence signal for llm reasoning, 2025

Aman Sharma and Paras Chopra. Think just enough: Sequence-level entropy as a confidence signal for llm reasoning, 2025

2025
[34]

Alfworld: Aligning text and embodied environments for interactive learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations
[35]

Stop when enough: Adaptive early-stopping for chain-of-thought reasoning, 2025

Renliang Sun, Wei Cheng, Dawei Li, Haifeng Chen, and Wei Wang. Stop when enough: Adaptive early-stopping for chain-of-thought reasoning, 2025

2025
[36]

Concisehint: Boosting efficient reasoning via continuous concise hints during generation.arXiv preprint arXiv:2506.18810, 2025

Siao Tang, Xinyin Ma, Gongfan Fang, and Xinchao Wang. Concisehint: Boosting efficient reasoning via continuous concise hints during generation.arXiv preprint arXiv:2506.18810, 2025

work page arXiv 2025
[37]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023

2023
[38]

Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022

Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022

2022
[39]

BudgetThinker: Empowering Budget-aware LLM Reasoning with Control Tokens.ArXiv, abs/2508.17196, 2025

Hao Wen, Xinrui Wu, Yi Sun, Feifei Zhang, Liye Chen, Jie Wang, Yunxin Liu, Yunhao Liu, Ya-Qin Zhang, and Yuanchun Li. Budgetthinker: Empowering budget-aware llm reasoning with control tokens.arXiv preprint arXiv:2508.17196, 2025

work page arXiv 2025
[40]

Effec- tively Controlling Reasoning Models Through Thinking Intervention

Tong Wu, Chong Xiang, Jiachen T Wang, G Edward Suh, and Prateek Mittal. Effectively controlling reasoning models through thinking intervention.arXiv preprint arXiv:2503.24370, 2025

work page arXiv 2025
[41]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Test-time prompt intervention, 2025

Chenxu Yang, Qingyi Si, Mz Dai, Dingyu Yao, Mingyu Zheng, Minghui Chen, Zheng Lin, and Weiping Wang. Test-time prompt intervention, 2025

2025
[43]

arXiv preprint arXiv:2504.15895 , year=

Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Minghui Chen, Zheng Lin, and Weiping Wang. Dynamic early exit in reasoning models.arXiv preprint arXiv:2504.15895, 2025

work page arXiv 2025
[44]

Swe-agent: agent-computer interfaces enable automated software engineering

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: agent-computer interfaces enable automated software engineering. InProceedings of the 38th International Conference on Neural Information Processing Systems, pages 50528–50652, 2024

2024
[45]

Safe reinforcement learning with natural language constraints.Advances in Neural Information Processing Systems, 34:13794–13808, 2021

Tsung-Yen Yang, Michael Y Hu, Yinlam Chow, Peter J Ramadge, and Karthik Narasimhan. Safe reinforcement learning with natural language constraints.Advances in Neural Information Processing Systems, 34:13794–13808, 2021. 12

2021
[46]

Step back to leap forward: Self-backtracking for boosting reasoning of language models.arXiv:2502.04404, 2025a

Xiao-Wen Yang, Xuan-Yi Zhu, Wen-Da Wei, Ding-Chu Zhang, Jie-Jing Shao, Zhi Zhou, Lan- Zhe Guo, and Yu-Feng Li. Step back to leap forward: Self-backtracking for boosting reasoning of language models.arXiv preprint arXiv:2502.04404, 2025

work page arXiv 2025
[47]

debug-gym: A text-based environment for interactive debugging, 2025

Xingdi Yuan, Morgane M Moss, Charbel El Feghali, Chinmay Singh, Darya Moldavskaya, Drew MacPhee, Lucas Caccia, Matheus Pereira, Minseon Kim, Alessandro Sordoni, and Marc-Alexandre Côté. debug-gym: A text-based environment for interactive debugging, 2025

2025
[48]

Reasoning models know when they’re right: Probing hidden states for self-verification.arXiv preprint arXiv:2504.05419, 2025

Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, and He He. Reasoning models know when they’re right: Probing hidden states for self-verification.arXiv preprint arXiv:2504.05419, 2025

work page arXiv 2025
[49]

Xingsheng Zhang, Luxi Xing, Chen Zhang, Yanbing Liu, Yifan Deng, Yunpeng Li, Yue Hu, and Chenxu Niu. Can we steer reasoning direction by thinking intervention? In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 3888–3913, Suzhou, China, Nov...

2025
[50]

Backtracking improves generation safety

Yiming Zhang, Jianfeng Chi, Hailey Nguyen, Kartikeya Upasani, Daniel M Bikel, Jason E Weston, and Eric Michael Smith. Backtracking improves generation safety. InThe Thirteenth International Conference on Learning Representations
[51]

monitorability tax

Zekai Zhao, Qi Liu, Kun Zhou, Zihan Liu, Yifei Shao, Zhiting Hu, and Biwei Huang. Activation control for efficiently eliciting long chain-of-thought ability of language models.arXiv preprint arXiv:2505.17697, 2025. A Embedded Answer Fixation When initially attempting to train models to perform Behavior Cue Reasoning, our initial approaches were pure promp...

work page arXiv 2025