pith. sign in

arxiv: 2605.20994 · v1 · pith:XCU4YPGOnew · submitted 2026-05-20 · 💻 cs.CL · cs.AI

Towards Context-Invariant Safety Alignment for Large Language Models

Pith reviewed 2026-05-21 05:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords context-invariant alignmentsafety alignmentlarge language modelsanchor invariance regularizationadversarial robustnesspreference optimizationstop-gradient regularization
0
0 comments X

The pith

Anchor Invariance Regularization enforces safety behavior that depends on intent rather than prompt wording in large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current preference-based alignment makes LLMs refuse harmful requests in ordinary wording but comply when the same intent appears in adversarial phrasing. The paper argues that true robustness needs context-invariant alignment, where the model responds to the underlying goal instead of surface details. Standard regularization approaches often fix cross-context gaps by weakening performance on the most trustworthy signals rather than strengthening the weaker ones. To solve this, the authors introduce Anchor Invariance Regularization, which designates verifiable prompts such as multiple-choice questions as fixed anchors and applies a stop-gradient loss only to open-ended variants so they match the anchor behavior. When combined with group-based preference optimization, the method raises in-distribution accuracy and out-of-distribution consistency across safety, moral reasoning, and math tasks.

Core claim

We introduce Anchor Invariance Regularization (AIR), which treats verifiable prompts as anchors and uses a stop-gradient target to regularize only the open-ended variants toward the anchor performance. AIR is implemented as a plug-in auxiliary loss and combined with group-based preference optimization via heterogeneous prompt grouping. Across Safety, Moral Reasoning, and Math, AIR improves context invariance, boosting in-distribution group accuracy by 12.71% and out-of-distribution consistency by 33.49%, making safety constraints robust to adversarial framings.

What carries the argument

Anchor Invariance Regularization (AIR), a plug-in auxiliary loss that designates verifiable prompts as stop-gradient anchors and regularizes open-ended variants toward their performance without symmetric penalties on the anchors.

If this is right

  • Safety refusals remain consistent when the same harmful intent is rephrased in adversarial ways.
  • In-distribution group accuracy rises by 12.71 percent across evaluated tasks.
  • Out-of-distribution consistency rises by 33.49 percent on held-out prompt variants.
  • The auxiliary loss combines directly with existing group-based preference methods such as GRPO.
  • The gains appear in safety, moral reasoning, and math domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same anchoring idea could apply to factual accuracy or hallucination reduction where some feedback signals are more reliable than others.
  • Models trained this way might show greater resistance to real-world jailbreak attempts that rely on creative framing.
  • Future experiments could test whether automatically generating verifiable variants for new domains preserves the gains.
  • The distinction between trustworthy and noisy training signals may matter for alignment objectives beyond safety.

Load-bearing premise

Verifiable prompts such as multiple-choice questions supply trustworthy feedback that can safely serve as stop-gradient anchors without lowering performance on those prompts or introducing new biases on open-ended ones.

What would settle it

Train a model with AIR and measure whether accuracy on the verifiable anchor prompts drops or out-of-distribution consistency on adversarial safety prompts fails to improve relative to a baseline without AIR.

Figures

Figures reproduced from arXiv: 2605.20994 by Xingjun Ma, Xin Wang, Yang Yao, Yan Teng, Yifeng Gao, Yingchun Wang, Yixu Wang.

Figure 1
Figure 1. Figure 1: Risk-space geometry of symmetric vs. anchored reg￾ularization. While the naive symmetric penalty (left) minimizes variance by degrading the reliable anchor to match the poorer con￾text, AIR (right) breaks this symmetry via a stop-gradient operator. This forces open-ended tasks to align with verifiable competence without compromising the anchor’s performance. The fundamental flaw lies in the symmetry of thi… view at source ↗
Figure 2
Figure 2. Figure 2: Sensitivity to the AIR coefficient λ. We vary the AIR reg￾ularization strength λ under the same setup and report the average performance on in-distribution (ID) (left) and out-of-distribution (OOD) (right) evaluations. The blue solid curve shows the average accuracy (Avg Acc), while the orange dashed curve shows the average group consistency metric (Avg Accgroup). can over-constrain updates by prioritizing… view at source ↗
Figure 3
Figure 3. Figure 3: Training dynamics across Safety, Moral, and Math domains. We report the average reward scores evaluated every 20 steps on the held-out validation set. The curves compare standard GRPO/GSPO, the symmetric variance penalty baseline (V-REx), and our proposed AIR. Notably, on asymmetric tasks, AIR achieves higher convergence and stability, whereas V-REx suffers from stagnation, confirming our analysis that sym… view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative examples of robust alignment on Safety and Moral Reasoning tasks using our GRPO+AIR model. Left (Safety): The model faces a Robin Hood style jailbreak where a harmful request is wrapped in a benevolent motive. It successfully uses the reasoning trace to disentangle the intent, offering a constructive refusal instead of complying. Right (Moral): In a high-stakes dilemma encouraging deception, th… view at source ↗
read the original abstract

Preference-based post-training aligns LLMs with human intent, yet safety behavior often remains brittle. A model may refuse a harmful request in a standard prompt but comply when the same intent is wrapped in adversarial wording. We suggest that robust safety requires context-invariant alignment, where behavior depends on the underlying intent rather than surface form. Enforcing invariance is difficult in alignment because not all training signals are equally trustworthy; for some prompt variants we can obtain verifiable feedback (e.g., multiple-choice), while for open-ended variants we typically rely on noisy, gameable reward proxies (e.g., learned judges). As a result, standard symmetric invariance regularizers can reduce cross-context discrepancies by lowering performance on reliable variants instead of improving open-ended robustness. To address this, we introduce Anchor Invariance Regularization (AIR), which treats verifiable prompts as anchors and uses a stop-gradient target to regularize only the open-ended variants toward the anchor performance. AIR is implemented as a plug-in auxiliary loss and combined with group-based preference optimization (e.g., GRPO) via heterogeneous prompt grouping. Across Safety, Moral Reasoning, and Math, AIR improves context invariance, boosting in-distribution group accuracy by 12.71% and out-of-distribution consistency by 33.49%, making safety constraints robust to adversarial framings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper argues that safety alignment in LLMs is brittle because behavior depends on prompt surface form rather than underlying intent. It introduces Anchor Invariance Regularization (AIR), an auxiliary loss that treats verifiable multiple-choice prompts as stop-gradient anchors and regularizes open-ended variants toward them, combined with heterogeneous grouping in group-based preference optimization (e.g., GRPO). Experiments on Safety, Moral Reasoning, and Math tasks report that AIR improves context invariance, with +12.71% in-distribution group accuracy and +33.49% out-of-distribution consistency.

Significance. If the empirical gains are robust and the anchor assumption holds, the method offers a practical way to enforce context-invariant safety without symmetrically degrading performance on reliable signals. The heterogeneous grouping and stop-gradient design is a targeted response to the problem that not all alignment feedback is equally trustworthy, which could influence future work on robust post-training.

major comments (3)
  1. Abstract and §4 (Experiments): The abstract states precise gains of 12.71% in-distribution group accuracy and 33.49% out-of-distribution consistency, yet the provided text supplies no baselines, variance estimates, number of runs, or controls for confounds such as prompt length or option ordering. Without these, the central claim that AIR produces genuine context invariance cannot be evaluated.
  2. §3.1 (AIR construction): The stop-gradient anchor on verifiable multiple-choice prompts assumes these encode the desired safety behavior without format-specific artifacts. If anchors admit superficial heuristics (keyword matching or option ordering), the regularization may transfer a different brittleness to open-ended variants rather than achieving intent-based invariance; an ablation isolating anchor quality is required to support the reported gains.
  3. §3.2 (Heterogeneous grouping with GRPO): The claim that shared parameters do not allow gradients from open-ended items to indirectly affect anchor optimization is not demonstrated. A direct measurement of anchor performance before and after joint training would be needed to confirm that the stop-gradient truly isolates the reliable signal.
minor comments (2)
  1. Notation for the auxiliary loss could be clarified with an explicit equation showing how the stop-gradient target is computed and combined with the GRPO objective.
  2. Figure captions should explicitly state whether error bars represent standard deviation across seeds or across prompt groups.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address each of the major comments in detail below, indicating where we will revise the manuscript to incorporate the suggestions.

read point-by-point responses
  1. Referee: Abstract and §4 (Experiments): The abstract states precise gains of 12.71% in-distribution group accuracy and 33.49% out-of-distribution consistency, yet the provided text supplies no baselines, variance estimates, number of runs, or controls for confounds such as prompt length or option ordering. Without these, the central claim that AIR produces genuine context invariance cannot be evaluated.

    Authors: We appreciate this observation. The reported gains are relative improvements over the GRPO baseline, with full baseline results and comparisons presented in Section 4. To address the lack of details, we will revise the abstract to indicate that the gains are based on averaged results. In the revised manuscript, we will expand the description of the experimental setup to include the number of runs, variance estimates, and details on controls for prompt length and option ordering. Specifically, we will clarify how we matched prompt lengths and randomized option orders to mitigate potential confounds. revision: yes

  2. Referee: §3.1 (AIR construction): The stop-gradient anchor on verifiable multiple-choice prompts assumes these encode the desired safety behavior without format-specific artifacts. If anchors admit superficial heuristics (keyword matching or option ordering), the regularization may transfer a different brittleness to open-ended variants rather than achieving intent-based invariance; an ablation isolating anchor quality is required to support the reported gains.

    Authors: This is a valid concern regarding potential artifacts in the anchors. We chose multiple-choice formats because they allow for objective verification of correctness, reducing reliance on subjective or gameable signals. However, to directly address the possibility of transferring format-specific heuristics, we will include a new ablation in the experiments section. This ablation will involve training with anchors that have been perturbed (e.g., by shuffling options or adding irrelevant keywords) and compare the resulting invariance gains to the original setup. We believe this will demonstrate that the benefits stem from the verifiable nature rather than superficial cues. revision: yes

  3. Referee: §3.2 (Heterogeneous grouping with GRPO): The claim that shared parameters do not allow gradients from open-ended items to indirectly affect anchor optimization is not demonstrated. A direct measurement of anchor performance before and after joint training would be needed to confirm that the stop-gradient truly isolates the reliable signal.

    Authors: We clarify that the stop-gradient mechanism mathematically prevents any gradient flow from the open-ended loss terms back to the anchor predictions, even with shared parameters. This isolation is by design in the computation graph. That said, we acknowledge that an empirical verification would strengthen the argument. In the revised version, we will add a table or plot showing the anchor accuracy on verifiable prompts at the start and end of training, demonstrating that it does not degrade and in fact often improves due to the overall optimization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical auxiliary loss with measured gains

full rationale

The paper introduces AIR as a plug-in auxiliary loss that applies stop-gradient to verifiable (multiple-choice) prompts as anchors while regularizing open-ended variants toward them, then combines it with GRPO via heterogeneous grouping. Reported gains (12.71% in-distribution accuracy, 33.49% out-of-distribution consistency) are presented as empirical measurements on Safety, Moral Reasoning, and Math benchmarks rather than quantities derived from equations or parameters that reduce to the method's own inputs by construction. No self-definitional loops, fitted-input-as-prediction patterns, or load-bearing self-citations appear in the described derivation; the central claim remains an independent empirical regularization technique whose validity rests on external benchmark results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method relies on standard concepts such as stop-gradient and group-based preference optimization without additional postulated mechanisms.

pith-pipeline@v0.9.0 · 5770 in / 991 out tokens · 25143 ms · 2026-05-21T05:32:25.433550+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · 19 internal anchors

  1. [1]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  2. [2]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  3. [3]

    M. J. Kearns , title =

  4. [4]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  5. [5]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  6. [6]

    Suppressed for Anonymity , author=

  7. [7]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  8. [8]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  9. [9]

    NeurIPS , year=

    Deep reinforcement learning from human preferences , author=. NeurIPS , year=

  10. [10]

    NeurIPS , year=

    Training language models to follow instructions with human feedback , author=. NeurIPS , year=

  11. [12]

    NeurIPS , year=

    Direct preference optimization: Your language model is secretly a reward model , author=. NeurIPS , year=

  12. [15]

    NeurIPS , year=

    Simpo: Simple preference optimization with a reference-free reward , author=. NeurIPS , year=

  13. [19]

    IEEE TKDE , year=

    Generalizing to unseen domains: A survey on domain generalization , author=. IEEE TKDE , year=

  14. [21]

    ACL , year=

    An invariant learning characterization of controlled text generation , author=. ACL , year=

  15. [22]

    ICML , year=

    Scaling laws for reward model overoptimization , author=. ICML , year=

  16. [23]

    NeurIPS , year=

    Scaling laws for reward model overoptimization in direct alignment algorithms , author=. NeurIPS , year=

  17. [26]

    NeurIPS , year=

    Rule based rewards for language model safety , author=. NeurIPS , year=

  18. [27]

    Findings of NAACL , year=

    Rewardbench: Evaluating reward models for language modeling , author=. Findings of NAACL , year=

  19. [28]

    ACL , year=

    Rethinking Reward Model Evaluation Through the Lens of Reward Overoptimization , author=. ACL , year=

  20. [29]

    NeurIPS , year=

    Judging llm-as-a-judge with mt-bench and chatbot arena , author=. NeurIPS , year=

  21. [30]

    ICML , year=

    Chatbot arena: An open platform for evaluating llms by human preference , author=. ICML , year=

  22. [33]

    2024 , journal =

    A Survey on LLM-as-a-Judge , author =. 2024 , journal =

  23. [34]

    ICML , year=

    Out-of-distribution generalization via risk extrapolation (rex) , author=. ICML , year=

  24. [36]

    NeurIPS , year=

    Jailbroken: How does llm safety training fail? , author=. NeurIPS , year=

  25. [38]

    NeurIPS , year=

    Many-shot jailbreaking , author=. NeurIPS , year=

  26. [39]

    Does safety training of llms generalize to semantically related natural prompts?, 2025

    Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts? , author=. arXiv preprint arXiv:2412.03235 , year=

  27. [41]

    NAACL , year=

    Fake alignment: Are llms really aligned well? , author=. NAACL , year=

  28. [42]

    NeurIPS , year=

    Defining and characterizing reward gaming , author=. NeurIPS , year=

  29. [43]

    NeurIPS , year=

    Learning to summarize with human feedback , author=. NeurIPS , year=

  30. [45]

    Concrete Problems in AI Safety

    Concrete problems in AI safety , author=. arXiv preprint arXiv:1606.06565 , year=

  31. [46]

    Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

    Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting , author=. arXiv preprint arXiv:2310.11324 , year=

  32. [47]

    NeurIPS , year=

    Keeping llms aligned after fine-tuning: The crucial role of prompt templates , author=. NeurIPS , year=

  33. [49]

    NeurIPS , year=

    Evaluating the Moral Beliefs Encoded in LLMs , author=. NeurIPS , year=

  34. [54]

    2025 , url =

    The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation , author =. 2025 , url =

  35. [56]

    2025 , url =

    GPT-5 System Card , author =. 2025 , url =

  36. [57]

    Foundations and Trends in Privacy and Security , year=

    Safety at scale: A comprehensive survey of large model and agent safety , author=. Foundations and Trends in Privacy and Security , year=

  37. [58]

    NeurIPS , year=

    JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models , author=. NeurIPS , year=

  38. [59]

    arXiv preprint arXiv:2601.01592 , year=

    OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs , author=. arXiv preprint arXiv:2601.01592 , year=

  39. [60]

    ICLR , year=

    Improving Generalization of Alignment with Human Preferences through Group Invariant Learning , author=. ICLR , year=

  40. [61]

    NAACL , year=

    Flames: Benchmarking value alignment of llms in chinese , author=. NAACL , year=

  41. [62]

    NeurIPS , year=

    Mllmguard: A multi-dimensional safety evaluation suite for multimodal large language models , author=. NeurIPS , year=

  42. [63]

    Findings of ACL , year=

    A mousetrap: Fooling large reasoning models for jailbreak with chain of iterative chaos , author=. Findings of ACL , year=

  43. [64]

    Findings of ACL , year=

    Safeevalagent: Toward agentic and self-evolving safety evaluation of llms , author=. Findings of ACL , year=

  44. [65]

    NeurIPS , year=

    Safevid: Toward safety aligned video large multimodal models , author=. NeurIPS , year=

  45. [66]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  46. [67]

    The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, 2025

    AI, M. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, 2025. URL https://ai.meta.com/blog/llama-4-multimodal-intelligence/

  47. [68]

    Many-shot jailbreaking

    Anil, C., Durmus, E., Panickssery, N., Sharma, M., Benton, J., Kundu, S., Batson, J., Tong, M., Mu, J., Ford, D., et al. Many-shot jailbreaking. NeurIPS, 2024

  48. [69]

    Invariant Risk Minimization

    Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019

  49. [70]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

  50. [71]

    Burns, P

    Burns, C., Izmailov, P., Kirchner, J., Baker, B., Gao, L., Aschenbrenner, L., Chen, Y., Ecoffet, A., Joglekar, M., Leike, J., et al. Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023

  51. [72]

    N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M., Gonzalez, J

    Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M., Gonzalez, J. E., et al. Chatbot arena: An open platform for evaluating llms by human preference. In ICML, 2024

  52. [73]

    J., Ham, J., and Park, K

    Choe, Y. J., Ham, J., and Park, K. An empirical study of invariant risk minimization. arXiv preprint arXiv:2004.05007, 2020

  53. [74]

    F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D

    Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. NeurIPS, 2017

  54. [75]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  55. [76]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

  56. [77]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024

  57. [78]

    Scaling laws for reward model overoptimization

    Gao, L., Schulman, J., and Hilton, J. Scaling laws for reward model overoptimization. In ICML, 2023

  58. [79]

    Alignment faking in large language models

    Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., et al. Alignment faking in large language models. arXiv preprint arXiv:2412.14093, 2024

  59. [80]

    A Survey on LLM-as-a-Judge

    Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., Wang, Y., and Guo, J. A survey on llm-as-a-judge. arXiv preprint arXiv: 2411.15594, 2024 a

  60. [81]

    Mllmguard: A multi-dimensional safety evaluation suite for multimodal large language models

    Gu, T., Zhou, Z., Huang, K., Liang, D., Wang, Y., Zhao, H., Yao, Y., Qiao, X., Wang, K., Yang, Y., et al. Mllmguard: A multi-dimensional safety evaluation suite for multimodal large language models. NeurIPS, 2024 b

  61. [82]

    ORPO: Monolithic Preference Optimization without Reference Model

    Hong, J., Lee, N., and Thorne, J. Orpo: Monolithic preference optimization without reference model. arXiv preprint arXiv:2403.07691, 2024

  62. [83]

    Flames: Benchmarking value alignment of llms in chinese

    Huang, K., Liu, X., Guo, Q., Sun, T., Sun, J., Wang, Y., Zhou, Z., Wang, Y., Teng, Y., Qiu, X., et al. Flames: Benchmarking value alignment of llms in chinese. In NAACL, 2024

  63. [84]

    Out-of-distribution generalization via risk extrapolation (rex)

    Krueger, D., Caballero, E., Jacobsen, J.-H., Zhang, A., Binas, J., Zhang, D., Le Priol, R., and Courville, A. Out-of-distribution generalization via risk extrapolation (rex). In ICML, 2021

  64. [85]

    Lambert, N., Pyatkin, V., Morrison, J., Miranda, L. J. V., Lin, B. Y., Chandu, K., Dziri, N., Kumar, S., Zick, T., Choi, Y., et al. Rewardbench: Evaluating reward models for language modeling. In Findings of NAACL, 2025

  65. [86]

    G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

    Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., and Zhu, C. G-eval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023

  66. [87]

    Safety at scale: A comprehensive survey of large model and agent safety

    Ma, X., Gao, Y., Wang, Y., Wang, R., Wang, X., Sun, Y., Ding, Y., Xu, H., Chen, Y., Zhao, Y., et al. Safety at scale: A comprehensive survey of large model and agent safety. Foundations and Trends in Privacy and Security, 2026

  67. [88]

    Rethinking reward model evaluation through the lens of reward overoptimization

    Mac Kim, S., Kang, D., Kwon, T., Chae, H., Lee, D., and Yeo, J. Rethinking reward model evaluation through the lens of reward overoptimization. In ACL, 2025

  68. [89]

    Simpo: Simple preference optimization with a reference-free reward

    Meng, Y., Xia, M., and Chen, D. Simpo: Simple preference optimization with a reference-free reward. NeurIPS, 2024

  69. [90]

    K., Strouse, D., Sandholm, T., Salakhutdinov, R., Dragan, A

    Moskovitz, T., Singh, A. K., Strouse, D., Sandholm, T., Salakhutdinov, R., Dragan, A. D., and McAleer, S. Confronting reward model overoptimization with constrained rlhf. arXiv preprint arXiv:2310.04373, 2023

  70. [91]

    Rule based rewards for language model safety

    Mu, T., Helyar, A., Heidecke, J., Achiam, J., Vallone, A., Kivlichan, I., Lin, M., Beutel, A., Schulman, J., and Weng, L. Rule based rewards for language model safety. NeurIPS, 2024

  71. [92]

    WebGPT: Browser-assisted question-answering with human feedback

    Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

  72. [93]

    Gpt-5 system card

    OpenAI. Gpt-5 system card. Technical report, 2025. URL https://cdn.openai.com/gpt-5-system-card.pdf

  73. [94]

    Training language models to follow instructions with human feedback

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. NeurIPS, 2022

  74. [95]

    D., Ermon, S., and Finn, C

    Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. NeurIPS, 2023

  75. [96]

    S., Hejna, J., Knox, B., Finn, C., and Niekum, S

    Rafailov, R., Chittepu, Y., Park, R., Sikchi, H. S., Hejna, J., Knox, B., Finn, C., and Niekum, S. Scaling laws for reward model overoptimization in direct alignment algorithms. NeurIPS, 2024

  76. [97]

    Evaluating the moral beliefs encoded in llms

    Scherrer, N., Shi, C., Feder, A., and Blei, D. Evaluating the moral beliefs encoded in llms. In NeurIPS, 2023

  77. [98]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  78. [99]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  79. [100]

    Defining and characterizing reward gaming

    Skalse, J., Howe, N., Krasheninnikov, D., and Krueger, D. Defining and characterizing reward gaming. NeurIPS, 2022

  80. [101]

    Jailbound: Jailbreaking internal safety boundaries of vision-language models

    Song, J., Wang, Y., Li, J., Yu, R., Teng, Y., Ma, X., and Wang, Y. Jailbound: Jailbreaking internal safety boundaries of vision-language models. In NeurIPS, 2025

Showing first 80 references.