pith. sign in

arxiv: 2606.18327 · v1 · pith:XNCMYB5Tnew · submitted 2026-06-16 · 💻 cs.LG · cs.AI

Self-CTRL: Self-Consistency Training with Reinforcement Learning

Pith reviewed 2026-06-27 01:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords self-consistency trainingreinforcement learninglanguage model alignmentmodel transparencyconstitutional AIbias reportingrefusal prediction
0
0 comments X

The pith

Language models trained for self-consistency produce explanations that better match their behavior on new inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Self-CTRL uses reinforcement learning to enforce consistency between what language models say about their behavior and what they actually do. In a probabilistic reasoning setup, this raises the correlation between self-reported biases and measured behavior from 0.24 to 0.64 on unseen distributions. In a constitutional AI setup, it generates rules that let an auditor predict refusals at 92% accuracy and cuts harmful outputs from 15% to 0.5% failure rate. The method works by either refining explanations to fit behavior or adjusting behavior to fit explanations.

Core claim

Self-Consistency Training with Reinforcement Learning optimizes for consistency between a LM's self-explanations and behavior on related inputs by updating explanations to better predict behavior or updating behavior to better match explanations. We apply our method in two domains. First, in a formal probabilistic reasoning task, consistency training improves the correlation between self-reported and behaviorally-measured latent biases from R²=0.24 to R²=0.64 on held-out distributions. Second, in a constitutional AI domain, Self-CTRL produces rules that faithfully describe the model's behavior on held-out requests and improves alignment by reducing HarmBench failure rate from 15.0% to 0.5%.

What carries the argument

Self-CTRL, a reinforcement learning procedure that updates either self-explanations or model behavior to increase their mutual consistency.

If this is right

  • Consistency training achieves generalization on bias reporting comparable to direct ground-truth supervision.
  • Self-generated rules enable high-accuracy prediction of refusal behavior by external auditors.
  • Behavior updates via consistency reduce harmful responses on benchmarks while preserving appropriate compliance on safe inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applying this consistency objective during pretraining or fine-tuning could scale transparency benefits to larger models.
  • Combining explanation updates and behavior updates in a single training loop might produce even stronger alignment.
  • The approach offers a path to auditability that relies less on external human labels for what the model should do.

Load-bearing premise

The observed gains come specifically from the consistency optimization rather than from other aspects of the reinforcement learning setup or data used.

What would settle it

Train a control model using the same reinforcement learning procedure but with a different reward signal unrelated to consistency, then measure whether bias correlation and refusal prediction accuracy still improve on held-out data.

Figures

Figures reproduced from arXiv: 2606.18327 by Belinda Z. Li, Itamar Pres, Jacob Andreas, Laura Ruis, Melat Ghebreselassie.

Figure 1
Figure 1. Figure 1: Self-CTRL aligns what models say with what they do. Real examples from our constitutional setting. Explanation training to maximize the consistency function ϕ causes LM-generated rules to be predictive of their responses; behavior training to maximize ϕ causes responses to match the LM-generated rule. and behavior are produced in different contexts. An LM can therefore learn to answer a meta-level question… view at source ↗
Figure 2
Figure 2. Figure 2: Self-CTRL improves self-reporting of latent coin biases. Each point represents one coin. FS denotes fully supervised coins, EC denotes experimental coins used for Self-CTRL, and H denotes held-out coins. The top row compares articulated bias to the model’s empirical rollout bias, while the bottom row compares articulated bias to ground-truth bias. The closer the points are to y = x, the better. Columns acr… view at source ↗
Figure 3
Figure 3. Figure 3: Self-CTRL improves agreement between stated principles and behavior. Average consistency reward increases for explanation training, behavior training, and mixed updates across validation, held-out-category, and held￾out-principle splits. Held-out improvements suggest that consistency training generalizes beyond the exact categories and principles seen during training. Baselines. We compare models trained w… view at source ↗
Figure 4
Figure 4. Figure 4: Self-CTRL improves the safety–simulatability Pareto frontier. We evaluate Self-CTRL in the constitutional setting, plotting HarmBench safety (1 − ASR; higher is safer) against normalized simulatability gain (NSG; higher means explanations better predict behavior). Compared to the untrained model and supervised explanation-only or behavior-only baselines, Self-CTRL shifts the Pareto frontier upward: explana… view at source ↗
Figure 5
Figure 5. Figure 5: Self-CTRL improves counterfactual simulatability. We give generated explanations to a third-party LM, which generates counterfactual requests that should be refused or complied with if the explanation is faithful to behavior. We then test whether model responses match these labels. Across explanation and mixed training Self-CTRL improves model refusal accuracy while preserving high compliance accuracy. Beh… view at source ↗
Figure 6
Figure 6. Figure 6: Self-CTRL does not lead to major over-refusal or MMLU decrease. Left: compliance rate on non-toxic WildChat prompts. Explanation (λ=0) and mixed (λ=0.5) updates preserve non-refusal, while behavior (λ=1.0) updates cause only a small decrease. Right: MMLU accuracy (n=200) stays within ∼2 points of the base model across all settings. 5 Related work Consistency evaluation and training. Consistency in LMs has … view at source ↗
Figure 7
Figure 7. Figure 7: Consistency training improves explanation–behavior agreement for Qwen3-8B. Average jury consistency increases during both explanation and behavior updates. Each panel shows one update direction, with validation, held-out-category, and held-out-principle splits. Improvements on the held-out splits suggest that consistency training transfers to new request categories and new principles within familiar catego… view at source ↗
Figure 8
Figure 8. Figure 8: Qwen provides little boundary signal during explanation training. Jury disagreement and refusal behavior (as judged by Gemini 2.5 Flash) over training with λ=0. Llama begins with high jury disagreement and a substantial refusal rate, giving Self-CTRL a signal for refining rules around the comply/refuse boundary. In contrast, Qwen’s jury is near-unanimous from the first step and its refusal rate remains low… view at source ↗
Figure 9
Figure 9. Figure 9: Self-CTRL for Qwen3-8B modestly improves simulatability, while behavior updates improve safety. We plot HarmBench safety (1 − ASR; higher is safer) against normalized simulatability gain (NSG; higher means explanations better predict behavior). Because Qwen3-8B is highly permissive before training, explanation-only updates yield limited gains, while behavior and mixed training move the model toward safer r… view at source ↗
Figure 10
Figure 10. Figure 10: Consistency training for Qwen3-8B improves counterfactual consistency relative to the baseline, with limited simulatability across the board. We compare the base model against variants that update either explanations or behaviors. Unlike with Llama, Qwen3’s lack of refusal on the dataset leads to poor simulatability on the refusal side. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qwen maintains capabilities while becoming less compliant on benign prompts. MMLU accuracy is stable, while non-toxic compliance decreases under mixed and behavior training. J Simulatability challenges for behavior training Base Expl. ( =0) Mixed ( =0.5) Beh. ( =1.0) Beh. baseline Expl. baseline 50 55 60 65 70 75 80 85 Unconditional predictions agreeing with behavior (%) 73.0 73.2 76.0 78.5 81.5 75.2 Base… view at source ↗
Figure 12
Figure 12. Figure 12: Behavior training makes behavior more predictable even without the explanation. Unconditional predictor accuracy, with no access to the stated rule, rises under behavior training (λ=1, 78.5) and the behavior baseline (81.5), but not explanation training (λ=0, 73.2 versus 73.0 base). Thus, the gains accrue to the no-explanation baseline and depress NSG. Since Self-CTRL aligns behaviors to explanations rega… view at source ↗
read the original abstract

Language models (LMs) that faithfully describe their own behavior can more easily be audited, understood, and trusted by users. This paper describes Self-Consistency Training with Reinforcement Learning (Self-CTRL), a method that optimizes for consistency between a LM's self-explanations and behavior on related inputs by updating explanations to better predict behavior or updating behavior to better match explanations. We apply our method in two domains. First, we study a formal probabilistic reasoning task in which LMs must learn to imitate a family of biased samplers and evaluated on their ability to report the associated biases. We find that consistency training improves the correlation between self-reported and behaviorally-measured latent biases from $R^2=0.24$ to $R^2=0.64$ on a set of held-out distributions, matching the generalization of direct ground-truth supervision. Second, we study a constitutional AI domain in which LMs must describe when they will refuse or comply with user requests. Here, Self-CTRL produces rules that faithfully describe the model's behavior on held-out requests, improving the refusal predictions of a third-party auditor model from $36\%$ to $92\%$. In the other direction, behavior updates improve alignment, reducing HarmBench failure rate from $15.0\%$ to $0.5\%$ without substantially increasing refusal on harmless prompts. By aligning explanations and behavior, our work provides a general recipe for training AI models to be safer, more transparent, and more controllable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Self-CTRL, a reinforcement learning approach that optimizes consistency between a language model's self-explanations and its behavior on related inputs, either by updating explanations to match behavior or vice versa. In a probabilistic reasoning task, consistency training raises the correlation between self-reported and behaviorally measured latent biases from R²=0.24 to R²=0.64 on held-out distributions, matching direct ground-truth supervision. In a constitutional AI setting, it improves a third-party auditor's refusal prediction accuracy from 36% to 92% and reduces HarmBench failure rate from 15.0% to 0.5% without substantially raising refusals on harmless prompts.

Significance. If the empirical results hold under closer scrutiny, the work demonstrates a bidirectional consistency objective that can produce faithful self-descriptions and aligned behavior at a level comparable to direct supervision. This supplies a concrete recipe for improving transparency and controllability in LMs without requiring ground-truth labels for every case, with potential applicability to auditing and alignment tasks.

major comments (2)
  1. [Section 4] Section 4 (Probabilistic Reasoning Experiments): the reported R² lift from 0.24 to 0.64 on held-out distributions is central to the claim of generalization matching direct supervision, yet the manuscript provides no statistical significance tests, standard errors, or number of held-out distributions; without these, it is impossible to determine whether the improvement is robust or could arise from sampling variance.
  2. [Section 5] Section 5 (Constitutional AI Experiments): the reduction in HarmBench failure rate from 15.0% to 0.5% is presented as resulting from behavior updates, but the text does not report an ablation isolating the consistency reward from other RL fine-tuning effects; this leaves open whether the alignment gain is attributable to the proposed bidirectional objective.
minor comments (2)
  1. The abstract and method sections would benefit from an explicit statement of the RL algorithm (e.g., PPO hyperparameters) and the precise form of the consistency reward function.
  2. Figure captions for the auditor accuracy and HarmBench plots should include the number of evaluation prompts and the identity of the third-party auditor model.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive overall assessment. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses
  1. Referee: [Section 4] Section 4 (Probabilistic Reasoning Experiments): the reported R² lift from 0.24 to 0.64 on held-out distributions is central to the claim of generalization matching direct supervision, yet the manuscript provides no statistical significance tests, standard errors, or number of held-out distributions; without these, it is impossible to determine whether the improvement is robust or could arise from sampling variance.

    Authors: We agree that reporting the number of held-out distributions, standard errors, and statistical significance tests is important for assessing robustness. In the revised manuscript we will specify the exact number of held-out distributions, include standard errors (computed across multiple random seeds or bootstrap resampling), and add appropriate significance tests (e.g., paired t-tests or permutation tests) comparing the baseline and Self-CTRL R² values. revision: yes

  2. Referee: [Section 5] Section 5 (Constitutional AI Experiments): the reduction in HarmBench failure rate from 15.0% to 0.5% is presented as resulting from behavior updates, but the text does not report an ablation isolating the consistency reward from other RL fine-tuning effects; this leaves open whether the alignment gain is attributable to the proposed bidirectional objective.

    Authors: We acknowledge that an ablation isolating the consistency reward from generic RL fine-tuning effects would strengthen the causal attribution. In the revised manuscript we will add an ablation comparing (i) standard RL fine-tuning without the consistency term against (ii) the full Self-CTRL objective, reporting HarmBench failure rates and refusal rates on harmless prompts for both. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's method applies RL-based consistency optimization between explanations and behavior on training inputs, then reports performance lifts on explicitly held-out distributions (R² correlation), held-out requests (auditor accuracy), and standard benchmarks (HarmBench). These metrics are external to the training objective and not defined in terms of the consistency loss itself; the reported gains are measured against ground-truth supervision and third-party models rather than reducing to the inputs by construction. No self-citation load-bearing steps, self-definitional relations, or fitted-input predictions appear in the abstract or summary. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides insufficient detail to identify specific free parameters or additional axioms. The approach builds on standard RL and LM training assumptions.

axioms (1)
  • domain assumption Reinforcement learning updates can effectively optimize for consistency between explanations and behavior without unintended side effects.
    The method relies on this to achieve the reported improvements in both directions.

pith-pipeline@v0.9.1-grok · 5810 in / 1307 out tokens · 58331 ms · 2026-06-27T01:35:57.332833+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 15 canonical work pages · 8 internal anchors

  1. [1]

    OpenCodeInstruct: A large-scale instruction tuning dataset for code LLMs.arXiv preprint arXiv:2504.04030, 2025

    Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majumdar, and Boris Ginsburg. OpenCodeInstruct: A large-scale instruction tuning dataset for code LLMs.arXiv preprint arXiv:2504.04030, 2025

  2. [2]

    Ahmed, Kevin Klyman, Yi Zeng, Sanmi Koyejo, and Percy Liang

    Ahmed M. Ahmed, Kevin Klyman, Yi Zeng, Sanmi Koyejo, and Percy Liang. SpecEval: Evaluating model adherence to behavior specifications.Transactions on Machine Learning Research, 2026

  3. [3]

    An introduction to centralized training for decentralized execution in cooperative multi-agent reinforcement learning.arXiv preprint arXiv:2409.03052, 2024

    Christopher Amato. An introduction to centralized training for decentralized execution in cooperative multi-agent reinforcement learning.arXiv preprint arXiv:2409.03052, 2024

  4. [4]

    Claude’s constitution.https://www.anthropic.com/constitution, 2026

    Anthropic. Claude’s constitution.https://www.anthropic.com/constitution, 2026

  5. [5]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

  6. [6]

    Does the whole exceed its parts? The effect of AI explanations on comple- mentary team performance

    Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel Weld. Does the whole exceed its parts? The effect of AI explanations on comple- mentary team performance. InProceedings of the CHI Conference On Human Factors in Computing Systems, May 8-13, pages 1–16, 2021

  7. [7]

    Taken out of context: On measuring situational awareness in LLMs.arXiv preprint arXiv:2309.00667, 2023

    Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann, Meg Tong, Tomasz Korbak, Daniel Kokotajlo, and Owain Evans. Taken out of context: On measuring situational awareness in LLMs.arXiv preprint arXiv:2309.00667, 2023. 13 Pres et al. Self-CTRL: Self-Consistency Training with Reinforcement Learning

  8. [8]

    Tell me about yourself: LLMs are aware of their learned behaviors

    Jan Betley, Xuchan Bao, Martín Soto, Anna Sztyber-Betley, James Chua, and Owain Evans. Tell me about yourself: LLMs are aware of their learned behaviors. InInternational Conference on Learning Representations, April 24-28, 2025

  9. [9]

    Sycophantic AI decreases prosocial intentions and promotes dependence.Science, 391(6792), March 2026

    Myra Cheng, Cinoo Lee, Pranav Khadpe, Sunny Yu, Dyllan Han, and Dan Jurafsky. Sycophantic AI decreases prosocial intentions and promotes dependence.Science, 391(6792), March 2026

  10. [10]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  11. [11]

    Gallegos, Ryan A

    Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K. Ahmed. Bias and fairness in large language models: A survey. Computational Linguistics, 50(3):1097–1179, September 2024

  12. [12]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  13. [13]

    Alignment faking in large language models

    Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, et al. Alignment faking in large language models.arXiv preprint arXiv:2412.14093, 2024

  14. [14]

    Michael Alvarez

    Pengrui Han, Rafal Dariusz Kocielnik, Peiyang Song, Ramit Debnath, Dean Mobbs, Anima Anandkumar, and R. Michael Alvarez. The personality illusion: Revealing dissociation between self-reports & behavior in LLMs. InNeurIPS Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning, 2025

  15. [15]

    Peter Hase and Mohit Bansal. Evaluating explainable AI: Which algorithmic explanations help users predict model behavior? InProceedings of the Annual Meeting of the Association for Computational Linguistics, July 5-10, pages 5540–5552. Association for Computational Linguistics, July 2020

  16. [16]

    Counterfactual simulation training for chain-of-thought faithfulness

    Peter Hase and Christopher Potts. Counterfactual simulation training for chain-of-thought faithfulness. arXiv preprint arXiv:2602.20710, 2026

  17. [17]

    Statutory construction and interpretation for artificial intelligence

    Luxi He, Nimra Nadeem, Michel Liao, Howard Chen, Danqi Chen, and Peter Henderson. Statutory construction and interpretation for artificial intelligence. InNeurIPS Workshop on Regulatable ML, 2025

  18. [18]

    Studies in the logic of explanation.Philosophy of Science, 15(2):135– 175, 1948

    Carl G Hempel and Paul Oppenheim. Studies in the logic of explanation.Philosophy of Science, 15(2):135– 175, 1948

  19. [19]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, May 3-7, 2021

  20. [20]

    Yu, and Zhijiang Guo

    Xuming Hu, Junzhe Chen, Xiaochuan Li, Yufei Guo, Lijie Wen, Philip S. Yu, and Zhijiang Guo. Towards understanding factual knowledge of large language models. InInternational Conference on Learning Representations, May 7-11, 2024

  21. [21]

    Adversarial example generation with syntactically controlled paraphrase networks

    Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. Adversarial example generation with syntactically controlled paraphrase networks. InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), June 1-6, pages 1875–1885. Association for Comp...

  22. [22]

    Measuring Faithfulness in Chain-of-Thought Reasoning

    Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of- thought reasoning.arXiv preprint arXiv:2307.13702, 2023. 14 Pres et al. Self-CTRL: Self-Consistency Training with Reinforcement Learning

  23. [23]

    Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks

    Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. InICML Workshop on challenges in representation learning, volume 3, page 896. Atlanta, 2013

  24. [24]

    RLAIF vs

    Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. RLAIF vs. RLHF: scaling reinforcement learning from human feedback with AI feedback. InInternational Conference on Machine Learning, July 21-27, volume 235 ofProceedings of Machine Learning...

  25. [25]

    EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics

    Shuyue Stella Li, Rui Xin, Teng Xiao, Yike Wang, Rulin Shao, Zoey Hao, Melanie Sclar, Sewoong Oh, Faeze Brahman, Pang Wei Koh, and Yulia Tsvetkov. EvoLM: Self-evolving language models through co-evolved discriminative rubrics.arXiv preprint arXiv:2605.03871, 2026

  26. [26]

    Self-refine: Iterative re- finement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative re- finement with self-feedback. InAdvances in Neural Information Processing S...

  27. [27]

    Marks, J

    Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, Siddharth Mishra- Sharma, Daniel Ziegler, Emmanuel Ameisen, Joshua Batson, Tim Belonax, et al. Auditing language models for hidden objectives.arXiv preprint arXiv:2503.10965, 2025

  28. [28]

    Harry Mayne, Justin Singh Kang, Dewi Gould, Kannan Ramchandran, Adam Mahdi, and Noah Y. Siegel. A positive case for faithfulness: LLM self-explanations help predict model behavior.arXiv preprint arXiv:2602.02639, 2026

  29. [29]

    Forsyth, and Dan Hendrycks

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David A. Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. InInternational Conference on Machine Learning, July 21-27, volume 235 ofProceedings of Machine Learning...

  30. [30]

    Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies

    Avni Mittal. Do LLMs follow their own rules? A reflexive audit of self-stated safety policies.arXiv preprint arXiv:2604.09189, 2026

  31. [31]

    Nemotron-sft-instruction-following-chat-v2

    NVIDIA. Nemotron-sft-instruction-following-chat-v2. https://huggingface.co/datasets/nvidia/ Nemotron-SFT-Instruction-Following-Chat-v2, 2025. Hugging Face dataset

  32. [32]

    Self-interpretability: LLMs can describe complex internal processes that drive their decisions.arXiv preprint arXiv:2505.17120, 2025

    Dillon Plunkett, Adam Morris, Keerthi Reddy, and Jorge Morales. Self-interpretability: LLMs can describe complex internal processes that drive their decisions.arXiv preprint arXiv:2505.17120, 2025

  33. [33]

    Li, Laura Ruis, Zifan Carl Guo, Keya Hu, Mehul Damani, Isha Puri, Ekdeep Singh Lubana, and Jacob Andreas

    Itamar Pres, Belinda Z. Li, Laura Ruis, Zifan Carl Guo, Keya Hu, Mehul Damani, Isha Puri, Ekdeep Singh Lubana, and Jacob Andreas. Position: It’s time to optimize for self-consistency. InInternational Conference on Machine Learning, July 6-11, 2026

  34. [34]

    Semantically equivalent adversarial rules for debugging NLP models

    Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Semantically equivalent adversarial rules for debugging NLP models. InProceedings of the Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), July 15-20, pages 856–865. Association for Computational Linguistics, July 2018

  35. [35]

    XSTest: A test suite for identifying exaggerated safety behaviours in large language models

    Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A test suite for identifying exaggerated safety behaviours in large language models. InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), June...

  36. [36]

    Quantifying language models’ sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting

    Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models’ sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting. In International Conference on Learning Representations, May 7-11, 2024

  37. [37]

    A multimodal automated interpretability agent

    Tamar Rott Shaham, Sarah Schwettmann, Franklin Wang, Achyuta Rajaram, Evan Hernandez, Jacob Andreas, and Antonio Torralba. A multimodal automated interpretability agent. InInternational Conference on Machine Learning, July 21-27, volume 235 ofProceedings of Machine Learning Research, pages 44293–44321, 2024

  38. [38]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  39. [39]

    Bow- man, Esin Durmus, Zac Hatfield-Dodds, Scott R

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bow- man, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models. InInternational Confere...

  40. [40]

    Self-distillation enables continual learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning. InICLR 2026 Workshop on Lifelong Agents: Learning, Aligning, Evolving, 2026

  41. [41]

    Grosse, and Owain Evans

    Johannes Treutlein, Dami Choi, Jan Betley, Samuel Marks, Cem Anil, Roger B. Grosse, and Owain Evans. Connecting the dots: LLMs can infer and verbalize latent structure from disparate training data. In Advances in Neural Information Processing Systems, December 10 - 15, 2024

  42. [42]

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. InAdvances in Neural Information Processing Systems, December 10 - 16, 2023

  43. [43]

    Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8(3):229–256, 1992

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8(3):229–256, 1992

  44. [44]

    Self-rewarding language models

    Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models. InInternational Conference on Machine Learning, July 21-27, 2024

  45. [45]

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. STaR: Bootstrapping reasoning with reasoning. InAdvances in Neural Information Processing Systems, November 28 - December 9, 2022

  46. [46]

    Zhilu Zhang and Mert R. Sabuncu. Self-distillation as instance-specific label smoothing. InAdvances in Neural Information Processing Systems, December 6-12, 2020

  47. [47]

    WildChat: 1M ChatGPT interaction logs in the wild

    Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. WildChat: 1M ChatGPT interaction logs in the wild. InInternational Conference on Learning Representations, May 7-11, 2024

  48. [48]

    H":p heads,

    Ran Zmigrod, Sabrina J Mielke, Hanna Wallach, and Ryan Cotterell. Counterfactual data augmentation for mitigating gender stereotypes in languages with rich morphology. InProceedings of the Annual Meeting of the Association for Computational Linguistics, July 28 - August 2, pages 1651–1661, 2019. A Self-CTRL qualitative examples We present several qualitat...

  49. [49]

    The principle is not relevant to the request --- it addresses a different topic or concern than what the request is about

  50. [50]

    Yes" if the response follows the stated principle. Principle/Rule: {principle} Request: {prompt} Response: {response} Is the response compliant? Answer only with

    The response does not follow the principle. Only answer "Yes" if the response follows the stated principle. Principle/Rule: {principle} Request: {prompt} Response: {response} Is the response compliant? Answer only with "Yes" or "No". DO NOT OUTPUT ANY OTHER TEXT. The eight juror system prompts share the prefix ”You are a reviewer judging whether a respons...

  51. [51]

    The response makes a genuine attempt to address the request --- even if that attempt is flawed, incomplete, or misguided, OR

  52. [52]

    Yes" or

    The response declines the request AND the stated rule, applied to this specific request, reasonably requires declining it. Answer "No" if the response refuses, declines, or deflects the request when the stated rule does not reasonably require refusal for this specific request. A rule may sound cautious in the abstract but still not warrant refusal for a p...