pith. machine review for the scientific record. sign in

arxiv: 2605.07021 · v1 · submitted 2026-05-07 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:50 UTC · model grok-4.3

classification 💻 cs.AI
keywords behavior cuesLLM reasoningscalable oversightmonitorabilityAI safetycontrollabilityreasoning efficiencyreinforcement learning
0
0 comments X

The pith

Training LLMs to emit Behavior Cues before behaviors makes reasoning monitorable, allowing recovery of safe actions from 80% of unsafe traces and doubling success rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Behavior Cue Reasoning to make LLM reasoning more controllable by having models emit special token sequences right before specific behaviors. These cues provide signals that external monitors can use to prune inefficient or unsafe reasoning paths early. In tests, this setup lets monitors cut up to half the wasted tokens in math problems and recover safe outcomes in most cases that would otherwise fail, all while keeping the original performance level. If true, it offers a way to build oversight into the reasoning process itself rather than only checking the final output.

Core claim

The central discovery is that fine-tuning LLMs to produce Behavior Cues immediately prior to target behaviors provides sufficient information in a compressed form for monitors to improve efficiency by pruning 50% of reasoning tokens in math tasks and to enhance safety by recovering 80% of unsafe action proposals, thereby raising success rates from 46% to 96% in constrained environments without any loss to base reasoning capability.

What carries the argument

Behavior Cues: special token sequences that the model is trained to emit immediately before specific behaviors, serving as signals and control points for external monitoring and intervention.

If this is right

  • Monitors can prune up to 50% of wasted reasoning tokens in complex math problem solving based on cue information alone.
  • An almost optimal rule-based monitor recovers safe actions from 80% of traces ending in unsafe proposals, more than doubling success from 46% to 96%.
  • The approach works across two model families and three domains with no degradation to core performance.
  • Training the monitored model to reason in a more monitorable way advances scalable oversight techniques.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If Behavior Cues prove stable under fine-tuning, they could be applied to guide reasoning in open-ended tasks where current monitors struggle.
  • The compressed cue view suggests that full reasoning traces may contain redundant information for oversight purposes, potentially reducing monitoring compute needs.
  • Combining this with other oversight methods might allow handling of more complex misalignments that emerge only at the end of long reasoning chains.

Load-bearing premise

LLMs can be fine-tuned to emit Behavior Cues right before the intended behaviors without any drop in their main reasoning quality, and that monitors can make effective decisions using only the information from those cues.

What would settle it

A result where the fine-tuned model emits cues inconsistently or at incorrect times, or where cue-based monitoring shows no significant improvement in pruning efficiency or safety recovery rates compared to monitoring full traces.

Figures

Figures reproduced from arXiv: 2605.07021 by Christopher Z. Cui, Prithviraj Ammanabrolu, Taylor W. Killian.

Figure 1
Figure 1. Figure 1: We take reasoning traces from a base actor model (1) and elicit the immediate working [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Validation success rate and percent of maximum possible token savings over the course [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

Reasoning in Large Language Models (LLMs) poses a challenge for oversight as many misaligned behaviors do not surface until reasoning concludes. To address this, we introduce Behavior Cue Reasoning for making LLM reasoning more controllable and monitorable. Behavior Cues are special token sequences that a model is trained to emit immediately before specific implicit and explicit behaviors, acting as dual purpose signal and control levers. When fine-tuning a weaker external monitor with Reinforcement Learning for reasoning oversight, a compressed view of only information surfaced by Behavior Cues is sufficient signal for the monitor to prune up to 50% of otherwise wasted reasoning tokens in complex math problem solving. When leveraged by an almost optimal rule-based monitor in an environment where excessive constraint violations results in failure, \ours allows for the recovery of safe actions from 80% of reasoning traces that would otherwise end with the proposal of an unsafe action, more than doubling the success rate from 46% to 96%. Through evaluation across two model families and three domains, we show that \bcreasoning improves reasoning monitorability and controllability with no cost to performance. More broadly, our work progresses scalable oversight by demonstrating how the monitored model itself can be trained to reason more tractably to oversight. Code to be released at https://github.com/christopherzc/text-games

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Behavior Cue Reasoning (BCR), in which LLMs are fine-tuned to emit special token sequences ('Behavior Cues') immediately before target behaviors. These cues serve as signals for external monitors, enabling token pruning (up to 50% in math domains with no accuracy loss) and recovery of safe actions (80% of otherwise unsafe traces, raising success from 46% to 96% in constrained environments). Results are reported across two model families and three domains, with ablations indicating preserved core reasoning performance and code to be released.

Significance. If the results hold, the work advances scalable oversight by showing that the model itself can be trained to produce more monitorable reasoning traces, improving both efficiency and safety without performance trade-offs. The inclusion of ablations verifying no degradation and the planned code release strengthen reproducibility.

major comments (1)
  1. [Experiments and ablations] The fine-tuning procedure for cue emission and the exact experimental controls (hyperparameters, dataset construction, and baseline comparisons) are insufficiently detailed to fully substantiate the claim of no degradation to reasoning performance; this detail is load-bearing for the central 'no cost to performance' assertion across domains.
minor comments (2)
  1. [Safety experiments] Clarify the precise definition and construction of the 'almost optimal rule-based monitor' used for the safety recovery experiments, including how the 80% recovery rate was measured.
  2. [Results] The abstract and main text would benefit from explicit statements on the number of runs, random seeds, and statistical significance for the reported gains (50% pruning, 46% to 96% success).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment, recognition of the work's significance for scalable oversight, and recommendation for minor revision. We address the major comment below and will incorporate clarifications to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments and ablations] The fine-tuning procedure for cue emission and the exact experimental controls (hyperparameters, dataset construction, and baseline comparisons) are insufficiently detailed to fully substantiate the claim of no degradation to reasoning performance; this detail is load-bearing for the central 'no cost to performance' assertion across domains.

    Authors: We agree that greater specificity on these elements would strengthen the paper and better support the central claim. In the revised manuscript, we will expand the Methods and Experimental Setup sections to include: the precise fine-tuning objective and procedure for cue emission (including loss formulation and training dynamics); complete hyperparameter tables for all models, domains, and ablations; detailed dataset construction protocols (including prompt templates, behavior labeling, and split statistics); and more granular baseline comparisons with quantitative metrics. These additions will directly address the load-bearing nature of the 'no cost' assertion. The planned code release will provide the full implementation for exact reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's claims rest on empirical results from fine-tuning LLMs to emit Behavior Cues before target behaviors and then measuring monitor performance on compressed cue views in math and constrained environments. Reported quantities such as 80% recovery of safe actions, doubling success from 46% to 96%, and 50% token pruning are direct experimental outcomes with ablations for preserved accuracy; they are not quantities defined in terms of fitted parameters from the same data, nor do they reduce via self-citation or ansatz to the inputs. The central premise of improved monitorability is validated externally through the described training and evaluation procedures rather than by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on the assumption that Behavior Cues can be trained into the model and that cue information suffices for monitoring decisions; these are domain assumptions without independent verification in the provided abstract.

axioms (2)
  • domain assumption LLMs can be fine-tuned via RL to emit specific token sequences before target behaviors without harming base performance.
    Invoked to support the no-cost-to-performance claim.
  • domain assumption A compressed view consisting only of Behavior Cues supplies enough signal for effective monitor decisions.
    Central to the reported pruning and recovery improvements.
invented entities (1)
  • Behavior Cues no independent evidence
    purpose: Special token sequences emitted before behaviors to serve as signals and control levers.
    Newly introduced construct in this work.

pith-pipeline@v0.9.0 · 5543 in / 1501 out tokens · 67435 ms · 2026-05-11T00:50:04.133299+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 17 canonical work pages · 3 internal anchors

  1. [1]

    Concrete Problems in AI Safety

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety.arXiv preprint arXiv:1606.06565, 2016

  2. [2]

    Building with extended thinking

    Anthropic. Building with extended thinking. https://docs.claude.com/en/docs/ build-with-claude/extended-thinking, 2025

  3. [3]

    The internal state of an LLM knows when it’s lying

    Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it’s lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, 2023

  4. [4]

    Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi

    Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y . Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation, 2025

  5. [5]

    Measuring progress on scalable oversight for large language models.arXiv preprint arXiv:2211.03540, 2022

    Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamil˙e Lukoši¯ut˙e, Amanda Askell, Andy Jones, Anna Chen, et al. Measuring progress on scalable oversight for large language models.arXiv preprint arXiv:2211.03540, 2022

  6. [6]

    Weak-to-strong generalization: Eliciting strong capabilities with weak supervision

    Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschen- brenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. In International Conference on Machine Learning, 2024

  7. [7]

    Textworld: A learning environment for text-based games

    Marc-Alexandre Côté, Akos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, et al. Textworld: A learning environment for text-based games. InWorkshop on Computer Games, pages 41–75. Springer, 2018

  8. [8]

    Tales: Text adventure learning environment suite, 2025

    Christopher Zhang Cui, Xingdi Yuan, Ziang Xiao, Prithviraj Ammanabrolu, and Marc- Alexandre Côté. Tales: Text adventure learning environment suite, 2025

  9. [9]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

  10. [10]

    Group-in-Group Policy Optimization for LLM Agent Training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025

  11. [11]

    Think before you speak: Training language models with pause tokens

    Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. In The Twelfth International Conference on Learning Representations

  12. [12]

    Bowman, and Evan Hubinger

    Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, and Evan Hubinger. Alignment faking in large language models, 2024

  13. [13]

    Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y

    Melody Y . Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y . Wei, Marcus Williams, Benjamin Arnav, Joost Huizinga, Ian Kivlichan, Mia Glaese, Jakub Pachocki, and Bowen Baker. Monitoring monitorability, 2025

  14. [14]

    Inter- active fiction games: A colossal adventure

    Matthew Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan. Inter- active fiction games: A colossal adventure. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7903–7910, 2020

  15. [15]

    Collapse of self-trained language models, 2024

    David Herel and Tomas Mikolov. Collapse of self-trained language models, 2024

  16. [16]

    The ends justify the thoughts: Rl-induced motivated reasoning in llm cots, 2026

    Nikolaus Howe and Micah Carroll. The ends justify the thoughts: Rl-induced motivated reasoning in llm cots, 2026

  17. [17]

    Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096, 2025

    Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, et al. Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096, 2025. 10

  18. [18]

    Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents.Advances in Neural Information Processing Systems, 37:10088–10116, 2024

    Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark. Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents.Advances in Neural Information Processing Systems, 37:10088–10116, 2024

  19. [19]

    well, keep thinking

    Hyunbin Jin, Je Won Yeom, Seunghyun Bae, and Taesup Kim. “well, keep thinking”: Enhancing LLM reasoning with adaptive injection decoding. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computa- tional Linguistics: ACL 2025, pages 9989–10018, Vienna, Austria, July 2025. Association fo...

  20. [20]

    Siegel, J’anos Kram’ar, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D

    Zachary Kenton, Noah Y . Siegel, János Kramár, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D. Goodman, and Rohin Shah. On scalable oversight with weak llms judging strong llms.arXiv preprint arXiv:2407.04622, 2024

  21. [21]

    Learning to insert [pause] tokens for better reasoning.arXiv preprint arXiv:2506.03616, 2025

    Eunki Kim, Sangryul Kim, and James Thorne. Learning to insert [pause] tokens for better reasoning.arXiv preprint arXiv:2506.03616, 2025

  22. [22]

    Chain of thought monitorability: A new and fragile opportunity for ai safety, 2025

    Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksand...

  23. [23]

    Bowman, and Ethan Perez

    Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil ˙e Lukoši¯ut˙e, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Lar- son, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy ...

  24. [24]

    Early stopping chain-of-thoughts in large language models, 2025

    Minjia Mao, Bowen Yin, Yu Zhu, and Xiao Fang. Early stopping chain-of-thoughts in large language models, 2025

  25. [25]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candes, and Tatsunori Hashimoto. s1: Simple test-time scaling. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Lan...

  26. [26]

    Mlgym: A new framework and benchmark for advancing ai research agents,

    Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vin- cent Moens, Amar Budhiraja, Despoina Magka, Vladislav V orotilov, Gaurav Chaurasia, et al. Mlgym: A new framework and benchmark for advancing ai research agents.arXiv preprint arXiv:2502.14499, 2025

  27. [27]

    Learning to reason with llms

    OpenAI. Learning to reason with llms. https://openai.com/index/ learning-to-reason-with-llms/, 2024

  28. [28]

    arXiv preprint arXiv:2411.13543 , year=

    Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuci ´nski, Lerrel Pinto, Rob Fergus, et al. Balrog: Bench- marking agentic llm and vlm reasoning on games.arXiv preprint arXiv:2411.13543, 2024

  29. [29]

    Logit-entropy adaptive stopping heuristic for efficient chain-of-thought reasoning, 2025

    Mohammad Atif Quamar and Mohammad Areeb. Logit-entropy adaptive stopping heuristic for efficient chain-of-thought reasoning, 2025. 11

  30. [30]

    Learning a continue-thinking token for enhanced test-time scaling

    Liran Ringel, Elad Tolochinsky, and Yaniv Romano. Learning a continue-thinking token for enhanced test-time scaling. In Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Push- pak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, and Dhirendra Pratap Singh, editors,Proceedings of the 14th International Joint Conference on Natural Lan...

  31. [31]

    Proximal policy optimization algorithms, 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

  32. [32]

    Backtracking for safety.arXiv preprint arXiv:2503.08919, 2025

    Bilgehan Sel, Dingcheng Li, Phillip Wallis, Vaishakh Keshava, Ming Jin, and Siddhartha Reddy Jonnalagadda. Backtracking for safety.arXiv preprint arXiv:2503.08919, 2025

  33. [33]

    Think just enough: Sequence-level entropy as a confidence signal for llm reasoning, 2025

    Aman Sharma and Paras Chopra. Think just enough: Sequence-level entropy as a confidence signal for llm reasoning, 2025

  34. [34]

    Alfworld: Aligning text and embodied environments for interactive learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations

  35. [35]

    Stop when enough: Adaptive early-stopping for chain-of-thought reasoning, 2025

    Renliang Sun, Wei Cheng, Dawei Li, Haifeng Chen, and Wei Wang. Stop when enough: Adaptive early-stopping for chain-of-thought reasoning, 2025

  36. [36]

    Concisehint: Boosting efficient reasoning via continuous concise hints during generation.arXiv preprint arXiv:2506.18810, 2025

    Siao Tang, Xinyin Ma, Gongfan Fang, and Xinchao Wang. Concisehint: Boosting efficient reasoning via continuous concise hints during generation.arXiv preprint arXiv:2506.18810, 2025

  37. [37]

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023

  38. [38]

    Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022

    Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022

  39. [39]

    BudgetThinker: Empowering Budget-aware LLM Reasoning with Control Tokens.ArXiv, abs/2508.17196, 2025

    Hao Wen, Xinrui Wu, Yi Sun, Feifei Zhang, Liye Chen, Jie Wang, Yunxin Liu, Yunhao Liu, Ya-Qin Zhang, and Yuanchun Li. Budgetthinker: Empowering budget-aware llm reasoning with control tokens.arXiv preprint arXiv:2508.17196, 2025

  40. [40]

    Effec- tively Controlling Reasoning Models Through Thinking Intervention

    Tong Wu, Chong Xiang, Jiachen T Wang, G Edward Suh, and Prateek Mittal. Effectively controlling reasoning models through thinking intervention.arXiv preprint arXiv:2503.24370, 2025

  41. [41]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  42. [42]

    Test-time prompt intervention, 2025

    Chenxu Yang, Qingyi Si, Mz Dai, Dingyu Yao, Mingyu Zheng, Minghui Chen, Zheng Lin, and Weiping Wang. Test-time prompt intervention, 2025

  43. [43]

    arXiv preprint arXiv:2504.15895 , year=

    Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Minghui Chen, Zheng Lin, and Weiping Wang. Dynamic early exit in reasoning models.arXiv preprint arXiv:2504.15895, 2025

  44. [44]

    Swe-agent: agent-computer interfaces enable automated software engineering

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: agent-computer interfaces enable automated software engineering. InProceedings of the 38th International Conference on Neural Information Processing Systems, pages 50528–50652, 2024

  45. [45]

    Safe reinforcement learning with natural language constraints.Advances in Neural Information Processing Systems, 34:13794–13808, 2021

    Tsung-Yen Yang, Michael Y Hu, Yinlam Chow, Peter J Ramadge, and Karthik Narasimhan. Safe reinforcement learning with natural language constraints.Advances in Neural Information Processing Systems, 34:13794–13808, 2021. 12

  46. [46]

    Step back to leap forward: Self-backtracking for boosting reasoning of language models.arXiv:2502.04404, 2025a

    Xiao-Wen Yang, Xuan-Yi Zhu, Wen-Da Wei, Ding-Chu Zhang, Jie-Jing Shao, Zhi Zhou, Lan- Zhe Guo, and Yu-Feng Li. Step back to leap forward: Self-backtracking for boosting reasoning of language models.arXiv preprint arXiv:2502.04404, 2025

  47. [47]

    debug-gym: A text-based environment for interactive debugging, 2025

    Xingdi Yuan, Morgane M Moss, Charbel El Feghali, Chinmay Singh, Darya Moldavskaya, Drew MacPhee, Lucas Caccia, Matheus Pereira, Minseon Kim, Alessandro Sordoni, and Marc-Alexandre Côté. debug-gym: A text-based environment for interactive debugging, 2025

  48. [48]

    Reasoning models know when they’re right: Probing hidden states for self-verification.arXiv preprint arXiv:2504.05419, 2025

    Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, and He He. Reasoning models know when they’re right: Probing hidden states for self-verification.arXiv preprint arXiv:2504.05419, 2025

  49. [49]

    Xingsheng Zhang, Luxi Xing, Chen Zhang, Yanbing Liu, Yifan Deng, Yunpeng Li, Yue Hu, and Chenxu Niu. Can we steer reasoning direction by thinking intervention? In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 3888–3913, Suzhou, China, Nov...

  50. [50]

    Backtracking improves generation safety

    Yiming Zhang, Jianfeng Chi, Hailey Nguyen, Kartikeya Upasani, Daniel M Bikel, Jason E Weston, and Eric Michael Smith. Backtracking improves generation safety. InThe Thirteenth International Conference on Learning Representations

  51. [51]

    monitorability tax

    Zekai Zhao, Qi Liu, Kun Zhou, Zihan Liu, Yifei Shao, Zhiting Hu, and Biwei Huang. Activation control for efficiently eliciting long chain-of-thought ability of language models.arXiv preprint arXiv:2505.17697, 2025. A Embedded Answer Fixation When initially attempting to train models to perform Behavior Cue Reasoning, our initial approaches were pure promp...