Towards Context-Invariant Safety Alignment for Large Language Models

Xingjun Ma; Xin Wang; Yang Yao; Yan Teng; Yifeng Gao; Yingchun Wang; Yixu Wang

arxiv: 2605.20994 · v1 · pith:XCU4YPGOnew · submitted 2026-05-20 · 💻 cs.CL · cs.AI

Towards Context-Invariant Safety Alignment for Large Language Models

Yixu Wang , Yang Yao , Xin Wang , Yifeng Gao , Yan Teng , Xingjun Ma , Yingchun Wang This is my paper

Pith reviewed 2026-05-21 05:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords context-invariant alignmentsafety alignmentlarge language modelsanchor invariance regularizationadversarial robustnesspreference optimizationstop-gradient regularization

0 comments

The pith

Anchor Invariance Regularization enforces safety behavior that depends on intent rather than prompt wording in large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current preference-based alignment makes LLMs refuse harmful requests in ordinary wording but comply when the same intent appears in adversarial phrasing. The paper argues that true robustness needs context-invariant alignment, where the model responds to the underlying goal instead of surface details. Standard regularization approaches often fix cross-context gaps by weakening performance on the most trustworthy signals rather than strengthening the weaker ones. To solve this, the authors introduce Anchor Invariance Regularization, which designates verifiable prompts such as multiple-choice questions as fixed anchors and applies a stop-gradient loss only to open-ended variants so they match the anchor behavior. When combined with group-based preference optimization, the method raises in-distribution accuracy and out-of-distribution consistency across safety, moral reasoning, and math tasks.

Core claim

We introduce Anchor Invariance Regularization (AIR), which treats verifiable prompts as anchors and uses a stop-gradient target to regularize only the open-ended variants toward the anchor performance. AIR is implemented as a plug-in auxiliary loss and combined with group-based preference optimization via heterogeneous prompt grouping. Across Safety, Moral Reasoning, and Math, AIR improves context invariance, boosting in-distribution group accuracy by 12.71% and out-of-distribution consistency by 33.49%, making safety constraints robust to adversarial framings.

What carries the argument

Anchor Invariance Regularization (AIR), a plug-in auxiliary loss that designates verifiable prompts as stop-gradient anchors and regularizes open-ended variants toward their performance without symmetric penalties on the anchors.

If this is right

Safety refusals remain consistent when the same harmful intent is rephrased in adversarial ways.
In-distribution group accuracy rises by 12.71 percent across evaluated tasks.
Out-of-distribution consistency rises by 33.49 percent on held-out prompt variants.
The auxiliary loss combines directly with existing group-based preference methods such as GRPO.
The gains appear in safety, moral reasoning, and math domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same anchoring idea could apply to factual accuracy or hallucination reduction where some feedback signals are more reliable than others.
Models trained this way might show greater resistance to real-world jailbreak attempts that rely on creative framing.
Future experiments could test whether automatically generating verifiable variants for new domains preserves the gains.
The distinction between trustworthy and noisy training signals may matter for alignment objectives beyond safety.

Load-bearing premise

Verifiable prompts such as multiple-choice questions supply trustworthy feedback that can safely serve as stop-gradient anchors without lowering performance on those prompts or introducing new biases on open-ended ones.

What would settle it

Train a model with AIR and measure whether accuracy on the verifiable anchor prompts drops or out-of-distribution consistency on adversarial safety prompts fails to improve relative to a baseline without AIR.

Figures

Figures reproduced from arXiv: 2605.20994 by Xingjun Ma, Xin Wang, Yang Yao, Yan Teng, Yifeng Gao, Yingchun Wang, Yixu Wang.

**Figure 1.** Figure 1: Risk-space geometry of symmetric vs. anchored regularization. While the naive symmetric penalty (left) minimizes variance by degrading the reliable anchor to match the poorer context, AIR (right) breaks this symmetry via a stop-gradient operator. This forces open-ended tasks to align with verifiable competence without compromising the anchor’s performance. The fundamental flaw lies in the symmetry of thi… view at source ↗

**Figure 2.** Figure 2: Sensitivity to the AIR coefficient λ. We vary the AIR regularization strength λ under the same setup and report the average performance on in-distribution (ID) (left) and out-of-distribution (OOD) (right) evaluations. The blue solid curve shows the average accuracy (Avg Acc), while the orange dashed curve shows the average group consistency metric (Avg Accgroup). can over-constrain updates by prioritizing… view at source ↗

**Figure 3.** Figure 3: Training dynamics across Safety, Moral, and Math domains. We report the average reward scores evaluated every 20 steps on the held-out validation set. The curves compare standard GRPO/GSPO, the symmetric variance penalty baseline (V-REx), and our proposed AIR. Notably, on asymmetric tasks, AIR achieves higher convergence and stability, whereas V-REx suffers from stagnation, confirming our analysis that sym… view at source ↗

**Figure 5.** Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative examples of robust alignment on Safety and Moral Reasoning tasks using our GRPO+AIR model. Left (Safety): The model faces a Robin Hood style jailbreak where a harmful request is wrapped in a benevolent motive. It successfully uses the reasoning trace to disentangle the intent, offering a constructive refusal instead of complying. Right (Moral): In a high-stakes dilemma encouraging deception, th… view at source ↗

read the original abstract

Preference-based post-training aligns LLMs with human intent, yet safety behavior often remains brittle. A model may refuse a harmful request in a standard prompt but comply when the same intent is wrapped in adversarial wording. We suggest that robust safety requires context-invariant alignment, where behavior depends on the underlying intent rather than surface form. Enforcing invariance is difficult in alignment because not all training signals are equally trustworthy; for some prompt variants we can obtain verifiable feedback (e.g., multiple-choice), while for open-ended variants we typically rely on noisy, gameable reward proxies (e.g., learned judges). As a result, standard symmetric invariance regularizers can reduce cross-context discrepancies by lowering performance on reliable variants instead of improving open-ended robustness. To address this, we introduce Anchor Invariance Regularization (AIR), which treats verifiable prompts as anchors and uses a stop-gradient target to regularize only the open-ended variants toward the anchor performance. AIR is implemented as a plug-in auxiliary loss and combined with group-based preference optimization (e.g., GRPO) via heterogeneous prompt grouping. Across Safety, Moral Reasoning, and Math, AIR improves context invariance, boosting in-distribution group accuracy by 12.71% and out-of-distribution consistency by 33.49%, making safety constraints robust to adversarial framings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AIR uses stop-gradient anchors on verifiable prompts to regularize open-ended variants for better context invariance, but the reported gains rest on unshown experimental details.

read the letter

The main point is that this paper introduces Anchor Invariance Regularization to make safety behavior less dependent on prompt surface form. It treats multiple-choice style prompts as fixed anchors via stop-gradient and pulls open-ended variants toward them, then combines the auxiliary loss with heterogeneous grouping inside group preference optimization like GRPO. This avoids the symmetric regularization problem where reliable signals get dragged down along with the noisy ones. The abstract reports lifts of 12.71% in-distribution group accuracy and 33.49% out-of-distribution consistency across safety, moral reasoning, and math tasks. That framing is practical and directly targets a known failure mode in deployed models. The asymmetric design and the decision to protect verifiable feedback are the clearest contributions here. The approach feels like a targeted engineering response rather than a broad theoretical claim. The soft spots sit mostly in the missing experimental layer. The abstract supplies percentage improvements without baselines, variance numbers, or controls for format-specific artifacts in the anchors. The stress-test concern lands: if multiple-choice prompts reward keyword matching or ordering biases, the regularization could simply move that brittleness to the open-ended cases instead of removing surface dependence. Shared parameters also raise the question of whether gradients from open-ended items stay fully isolated from the anchors. This work is aimed at alignment researchers who already use preference optimization and want to harden safety against rephrasing attacks. A reader running their own GRPO-style training would see immediate implementation ideas worth testing. I would send it to peer review. The motivation and the asymmetric mechanism are clear enough that referees can focus on whether the numbers hold up under scrutiny.

Referee Report

3 major / 2 minor

Summary. The paper argues that safety alignment in LLMs is brittle because behavior depends on prompt surface form rather than underlying intent. It introduces Anchor Invariance Regularization (AIR), an auxiliary loss that treats verifiable multiple-choice prompts as stop-gradient anchors and regularizes open-ended variants toward them, combined with heterogeneous grouping in group-based preference optimization (e.g., GRPO). Experiments on Safety, Moral Reasoning, and Math tasks report that AIR improves context invariance, with +12.71% in-distribution group accuracy and +33.49% out-of-distribution consistency.

Significance. If the empirical gains are robust and the anchor assumption holds, the method offers a practical way to enforce context-invariant safety without symmetrically degrading performance on reliable signals. The heterogeneous grouping and stop-gradient design is a targeted response to the problem that not all alignment feedback is equally trustworthy, which could influence future work on robust post-training.

major comments (3)

Abstract and §4 (Experiments): The abstract states precise gains of 12.71% in-distribution group accuracy and 33.49% out-of-distribution consistency, yet the provided text supplies no baselines, variance estimates, number of runs, or controls for confounds such as prompt length or option ordering. Without these, the central claim that AIR produces genuine context invariance cannot be evaluated.
§3.1 (AIR construction): The stop-gradient anchor on verifiable multiple-choice prompts assumes these encode the desired safety behavior without format-specific artifacts. If anchors admit superficial heuristics (keyword matching or option ordering), the regularization may transfer a different brittleness to open-ended variants rather than achieving intent-based invariance; an ablation isolating anchor quality is required to support the reported gains.
§3.2 (Heterogeneous grouping with GRPO): The claim that shared parameters do not allow gradients from open-ended items to indirectly affect anchor optimization is not demonstrated. A direct measurement of anchor performance before and after joint training would be needed to confirm that the stop-gradient truly isolates the reliable signal.

minor comments (2)

Notation for the auxiliary loss could be clarified with an explicit equation showing how the stop-gradient target is computed and combined with the GRPO objective.
Figure captions should explicitly state whether error bars represent standard deviation across seeds or across prompt groups.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address each of the major comments in detail below, indicating where we will revise the manuscript to incorporate the suggestions.

read point-by-point responses

Referee: Abstract and §4 (Experiments): The abstract states precise gains of 12.71% in-distribution group accuracy and 33.49% out-of-distribution consistency, yet the provided text supplies no baselines, variance estimates, number of runs, or controls for confounds such as prompt length or option ordering. Without these, the central claim that AIR produces genuine context invariance cannot be evaluated.

Authors: We appreciate this observation. The reported gains are relative improvements over the GRPO baseline, with full baseline results and comparisons presented in Section 4. To address the lack of details, we will revise the abstract to indicate that the gains are based on averaged results. In the revised manuscript, we will expand the description of the experimental setup to include the number of runs, variance estimates, and details on controls for prompt length and option ordering. Specifically, we will clarify how we matched prompt lengths and randomized option orders to mitigate potential confounds. revision: yes
Referee: §3.1 (AIR construction): The stop-gradient anchor on verifiable multiple-choice prompts assumes these encode the desired safety behavior without format-specific artifacts. If anchors admit superficial heuristics (keyword matching or option ordering), the regularization may transfer a different brittleness to open-ended variants rather than achieving intent-based invariance; an ablation isolating anchor quality is required to support the reported gains.

Authors: This is a valid concern regarding potential artifacts in the anchors. We chose multiple-choice formats because they allow for objective verification of correctness, reducing reliance on subjective or gameable signals. However, to directly address the possibility of transferring format-specific heuristics, we will include a new ablation in the experiments section. This ablation will involve training with anchors that have been perturbed (e.g., by shuffling options or adding irrelevant keywords) and compare the resulting invariance gains to the original setup. We believe this will demonstrate that the benefits stem from the verifiable nature rather than superficial cues. revision: yes
Referee: §3.2 (Heterogeneous grouping with GRPO): The claim that shared parameters do not allow gradients from open-ended items to indirectly affect anchor optimization is not demonstrated. A direct measurement of anchor performance before and after joint training would be needed to confirm that the stop-gradient truly isolates the reliable signal.

Authors: We clarify that the stop-gradient mechanism mathematically prevents any gradient flow from the open-ended loss terms back to the anchor predictions, even with shared parameters. This isolation is by design in the computation graph. That said, we acknowledge that an empirical verification would strengthen the argument. In the revised version, we will add a table or plot showing the anchor accuracy on verifiable prompts at the start and end of training, demonstrating that it does not degrade and in fact often improves due to the overall optimization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical auxiliary loss with measured gains

full rationale

The paper introduces AIR as a plug-in auxiliary loss that applies stop-gradient to verifiable (multiple-choice) prompts as anchors while regularizing open-ended variants toward them, then combines it with GRPO via heterogeneous grouping. Reported gains (12.71% in-distribution accuracy, 33.49% out-of-distribution consistency) are presented as empirical measurements on Safety, Moral Reasoning, and Math benchmarks rather than quantities derived from equations or parameters that reduce to the method's own inputs by construction. No self-definitional loops, fitted-input-as-prediction patterns, or load-bearing self-citations appear in the described derivation; the central claim remains an independent empirical regularization technique whose validity rests on external benchmark results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method relies on standard concepts such as stop-gradient and group-based preference optimization without additional postulated mechanisms.

pith-pipeline@v0.9.0 · 5770 in / 991 out tokens · 25143 ms · 2026-05-21T05:32:25.433550+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AIR replaces the symmetric variance penalty with an anchor-referenced penalty applied exclusively to non-anchor contexts: Ω_AIR(θ) = Σ (Rc(θ) − τ_acr(θ))² where τ_acr(θ) = sg[Rc_acr(θ)]
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We identify that symmetric invariance regularization fails in safety alignment... propose Anchor Invariance Regularization (AIR)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · 19 internal anchors

[1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000
[2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980
[3]

M. J. Kearns , title =

work page
[4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983
[5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000
[6]

Suppressed for Anonymity , author=

work page
[7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981
[8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959
[9]

NeurIPS , year=

Deep reinforcement learning from human preferences , author=. NeurIPS , year=

work page
[10]

NeurIPS , year=

Training language models to follow instructions with human feedback , author=. NeurIPS , year=

work page
[12]

NeurIPS , year=

Direct preference optimization: Your language model is secretly a reward model , author=. NeurIPS , year=

work page
[15]

NeurIPS , year=

Simpo: Simple preference optimization with a reference-free reward , author=. NeurIPS , year=

work page
[19]

IEEE TKDE , year=

Generalizing to unseen domains: A survey on domain generalization , author=. IEEE TKDE , year=

work page
[21]

ACL , year=

An invariant learning characterization of controlled text generation , author=. ACL , year=

work page
[22]

ICML , year=

Scaling laws for reward model overoptimization , author=. ICML , year=

work page
[23]

NeurIPS , year=

Scaling laws for reward model overoptimization in direct alignment algorithms , author=. NeurIPS , year=

work page
[26]

NeurIPS , year=

Rule based rewards for language model safety , author=. NeurIPS , year=

work page
[27]

Findings of NAACL , year=

Rewardbench: Evaluating reward models for language modeling , author=. Findings of NAACL , year=

work page
[28]

ACL , year=

Rethinking Reward Model Evaluation Through the Lens of Reward Overoptimization , author=. ACL , year=

work page
[29]

NeurIPS , year=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. NeurIPS , year=

work page
[30]

ICML , year=

Chatbot arena: An open platform for evaluating llms by human preference , author=. ICML , year=

work page
[33]

2024 , journal =

A Survey on LLM-as-a-Judge , author =. 2024 , journal =

work page 2024
[34]

ICML , year=

Out-of-distribution generalization via risk extrapolation (rex) , author=. ICML , year=

work page
[36]

NeurIPS , year=

Jailbroken: How does llm safety training fail? , author=. NeurIPS , year=

work page
[38]

NeurIPS , year=

Many-shot jailbreaking , author=. NeurIPS , year=

work page
[39]

Does safety training of llms generalize to semantically related natural prompts?, 2025

Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts? , author=. arXiv preprint arXiv:2412.03235 , year=

work page arXiv
[41]

NAACL , year=

Fake alignment: Are llms really aligned well? , author=. NAACL , year=

work page
[42]

NeurIPS , year=

Defining and characterizing reward gaming , author=. NeurIPS , year=

work page
[43]

NeurIPS , year=

Learning to summarize with human feedback , author=. NeurIPS , year=

work page
[45]

Concrete Problems in AI Safety

Concrete problems in AI safety , author=. arXiv preprint arXiv:1606.06565 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting , author=. arXiv preprint arXiv:2310.11324 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

NeurIPS , year=

Keeping llms aligned after fine-tuning: The crucial role of prompt templates , author=. NeurIPS , year=

work page
[49]

NeurIPS , year=

Evaluating the Moral Beliefs Encoded in LLMs , author=. NeurIPS , year=

work page
[54]

2025 , url =

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation , author =. 2025 , url =

work page 2025
[56]

2025 , url =

GPT-5 System Card , author =. 2025 , url =

work page 2025
[57]

Foundations and Trends in Privacy and Security , year=

Safety at scale: A comprehensive survey of large model and agent safety , author=. Foundations and Trends in Privacy and Security , year=

work page
[58]

NeurIPS , year=

JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models , author=. NeurIPS , year=

work page
[59]

arXiv preprint arXiv:2601.01592 , year=

OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs , author=. arXiv preprint arXiv:2601.01592 , year=

work page arXiv
[60]

ICLR , year=

Improving Generalization of Alignment with Human Preferences through Group Invariant Learning , author=. ICLR , year=

work page
[61]

NAACL , year=

Flames: Benchmarking value alignment of llms in chinese , author=. NAACL , year=

work page
[62]

NeurIPS , year=

Mllmguard: A multi-dimensional safety evaluation suite for multimodal large language models , author=. NeurIPS , year=

work page
[63]

Findings of ACL , year=

A mousetrap: Fooling large reasoning models for jailbreak with chain of iterative chaos , author=. Findings of ACL , year=

work page
[64]

Findings of ACL , year=

Safeevalagent: Toward agentic and self-evolving safety evaluation of llms , author=. Findings of ACL , year=

work page
[65]

NeurIPS , year=

Safevid: Toward safety aligned video large multimodal models , author=. NeurIPS , year=

work page
[66]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, 2025

AI, M. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, 2025. URL https://ai.meta.com/blog/llama-4-multimodal-intelligence/

work page 2025
[68]

Many-shot jailbreaking

Anil, C., Durmus, E., Panickssery, N., Sharma, M., Benton, J., Kundu, S., Batson, J., Tong, M., Mu, J., Ford, D., et al. Many-shot jailbreaking. NeurIPS, 2024

work page 2024
[69]

Invariant Risk Minimization

Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[70]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[71]

Burns, P

Burns, C., Izmailov, P., Kirchner, J., Baker, B., Gao, L., Aschenbrenner, L., Chen, Y., Ecoffet, A., Joglekar, M., Leike, J., et al. Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023

work page arXiv 2023
[72]

N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M., Gonzalez, J

Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M., Gonzalez, J. E., et al. Chatbot arena: An open platform for evaluating llms by human preference. In ICML, 2024

work page 2024
[73]

J., Ham, J., and Park, K

Choe, Y. J., Ham, J., and Park, K. An empirical study of invariant risk minimization. arXiv preprint arXiv:2004.05007, 2020

work page arXiv 2004
[74]

F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D

Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. NeurIPS, 2017

work page 2017
[75]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[76]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

KTO: Model Alignment as Prospect Theoretic Optimization

Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[78]

Scaling laws for reward model overoptimization

Gao, L., Schulman, J., and Hilton, J. Scaling laws for reward model overoptimization. In ICML, 2023

work page 2023
[79]

Alignment faking in large language models

Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., et al. Alignment faking in large language models. arXiv preprint arXiv:2412.14093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[80]

A Survey on LLM-as-a-Judge

Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., Wang, Y., and Guo, J. A survey on llm-as-a-judge. arXiv preprint arXiv: 2411.15594, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024
[81]

Mllmguard: A multi-dimensional safety evaluation suite for multimodal large language models

Gu, T., Zhou, Z., Huang, K., Liang, D., Wang, Y., Zhao, H., Yao, Y., Qiao, X., Wang, K., Yang, Y., et al. Mllmguard: A multi-dimensional safety evaluation suite for multimodal large language models. NeurIPS, 2024 b

work page 2024
[82]

ORPO: Monolithic Preference Optimization without Reference Model

Hong, J., Lee, N., and Thorne, J. Orpo: Monolithic preference optimization without reference model. arXiv preprint arXiv:2403.07691, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[83]

Flames: Benchmarking value alignment of llms in chinese

Huang, K., Liu, X., Guo, Q., Sun, T., Sun, J., Wang, Y., Zhou, Z., Wang, Y., Teng, Y., Qiu, X., et al. Flames: Benchmarking value alignment of llms in chinese. In NAACL, 2024

work page 2024
[84]

Out-of-distribution generalization via risk extrapolation (rex)

Krueger, D., Caballero, E., Jacobsen, J.-H., Zhang, A., Binas, J., Zhang, D., Le Priol, R., and Courville, A. Out-of-distribution generalization via risk extrapolation (rex). In ICML, 2021

work page 2021
[85]

Lambert, N., Pyatkin, V., Morrison, J., Miranda, L. J. V., Lin, B. Y., Chandu, K., Dziri, N., Kumar, S., Zick, T., Choi, Y., et al. Rewardbench: Evaluating reward models for language modeling. In Findings of NAACL, 2025

work page 2025
[86]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., and Zhu, C. G-eval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[87]

Safety at scale: A comprehensive survey of large model and agent safety

Ma, X., Gao, Y., Wang, Y., Wang, R., Wang, X., Sun, Y., Ding, Y., Xu, H., Chen, Y., Zhao, Y., et al. Safety at scale: A comprehensive survey of large model and agent safety. Foundations and Trends in Privacy and Security, 2026

work page 2026
[88]

Rethinking reward model evaluation through the lens of reward overoptimization

Mac Kim, S., Kang, D., Kwon, T., Chae, H., Lee, D., and Yeo, J. Rethinking reward model evaluation through the lens of reward overoptimization. In ACL, 2025

work page 2025
[89]

Simpo: Simple preference optimization with a reference-free reward

Meng, Y., Xia, M., and Chen, D. Simpo: Simple preference optimization with a reference-free reward. NeurIPS, 2024

work page 2024
[90]

K., Strouse, D., Sandholm, T., Salakhutdinov, R., Dragan, A

Moskovitz, T., Singh, A. K., Strouse, D., Sandholm, T., Salakhutdinov, R., Dragan, A. D., and McAleer, S. Confronting reward model overoptimization with constrained rlhf. arXiv preprint arXiv:2310.04373, 2023

work page arXiv 2023
[91]

Rule based rewards for language model safety

Mu, T., Helyar, A., Heidecke, J., Achiam, J., Vallone, A., Kivlichan, I., Lin, M., Beutel, A., Schulman, J., and Weng, L. Rule based rewards for language model safety. NeurIPS, 2024

work page 2024
[92]

WebGPT: Browser-assisted question-answering with human feedback

Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[93]

Gpt-5 system card

OpenAI. Gpt-5 system card. Technical report, 2025. URL https://cdn.openai.com/gpt-5-system-card.pdf

work page 2025
[94]

Training language models to follow instructions with human feedback

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. NeurIPS, 2022

work page 2022
[95]

D., Ermon, S., and Finn, C

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. NeurIPS, 2023

work page 2023
[96]

S., Hejna, J., Knox, B., Finn, C., and Niekum, S

Rafailov, R., Chittepu, Y., Park, R., Sikchi, H. S., Hejna, J., Knox, B., Finn, C., and Niekum, S. Scaling laws for reward model overoptimization in direct alignment algorithms. NeurIPS, 2024

work page 2024
[97]

Evaluating the moral beliefs encoded in llms

Scherrer, N., Shi, C., Feder, A., and Blei, D. Evaluating the moral beliefs encoded in llms. In NeurIPS, 2023

work page 2023
[98]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[99]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[100]

Defining and characterizing reward gaming

Skalse, J., Howe, N., Krasheninnikov, D., and Krueger, D. Defining and characterizing reward gaming. NeurIPS, 2022

work page 2022
[101]

Jailbound: Jailbreaking internal safety boundaries of vision-language models

Song, J., Wang, Y., Li, J., Yu, R., Teng, Y., Ma, X., and Wang, Y. Jailbound: Jailbreaking internal safety boundaries of vision-language models. In NeurIPS, 2025

work page 2025

Showing first 80 references.

[1] [1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000

[2] [2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980

[3] [3]

M. J. Kearns , title =

work page

[4] [4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983

[5] [5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000

[6] [6]

Suppressed for Anonymity , author=

work page

[7] [7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981

[8] [8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959

[9] [9]

NeurIPS , year=

Deep reinforcement learning from human preferences , author=. NeurIPS , year=

work page

[10] [10]

NeurIPS , year=

Training language models to follow instructions with human feedback , author=. NeurIPS , year=

work page

[11] [12]

NeurIPS , year=

Direct preference optimization: Your language model is secretly a reward model , author=. NeurIPS , year=

work page

[12] [15]

NeurIPS , year=

Simpo: Simple preference optimization with a reference-free reward , author=. NeurIPS , year=

work page

[13] [19]

IEEE TKDE , year=

Generalizing to unseen domains: A survey on domain generalization , author=. IEEE TKDE , year=

work page

[14] [21]

ACL , year=

An invariant learning characterization of controlled text generation , author=. ACL , year=

work page

[15] [22]

ICML , year=

Scaling laws for reward model overoptimization , author=. ICML , year=

work page

[16] [23]

NeurIPS , year=

Scaling laws for reward model overoptimization in direct alignment algorithms , author=. NeurIPS , year=

work page

[17] [26]

NeurIPS , year=

Rule based rewards for language model safety , author=. NeurIPS , year=

work page

[18] [27]

Findings of NAACL , year=

Rewardbench: Evaluating reward models for language modeling , author=. Findings of NAACL , year=

work page

[19] [28]

ACL , year=

Rethinking Reward Model Evaluation Through the Lens of Reward Overoptimization , author=. ACL , year=

work page

[20] [29]

NeurIPS , year=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. NeurIPS , year=

work page

[21] [30]

ICML , year=

Chatbot arena: An open platform for evaluating llms by human preference , author=. ICML , year=

work page

[22] [33]

2024 , journal =

A Survey on LLM-as-a-Judge , author =. 2024 , journal =

work page 2024

[23] [34]

ICML , year=

Out-of-distribution generalization via risk extrapolation (rex) , author=. ICML , year=

work page

[24] [36]

NeurIPS , year=

Jailbroken: How does llm safety training fail? , author=. NeurIPS , year=

work page

[25] [38]

NeurIPS , year=

Many-shot jailbreaking , author=. NeurIPS , year=

work page

[26] [39]

Does safety training of llms generalize to semantically related natural prompts?, 2025

Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts? , author=. arXiv preprint arXiv:2412.03235 , year=

work page arXiv

[27] [41]

NAACL , year=

Fake alignment: Are llms really aligned well? , author=. NAACL , year=

work page

[28] [42]

NeurIPS , year=

Defining and characterizing reward gaming , author=. NeurIPS , year=

work page

[29] [43]

NeurIPS , year=

Learning to summarize with human feedback , author=. NeurIPS , year=

work page

[30] [45]

Concrete Problems in AI Safety

Concrete problems in AI safety , author=. arXiv preprint arXiv:1606.06565 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[31] [46]

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting , author=. arXiv preprint arXiv:2310.11324 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [47]

NeurIPS , year=

Keeping llms aligned after fine-tuning: The crucial role of prompt templates , author=. NeurIPS , year=

work page

[33] [49]

NeurIPS , year=

Evaluating the Moral Beliefs Encoded in LLMs , author=. NeurIPS , year=

work page

[34] [54]

2025 , url =

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation , author =. 2025 , url =

work page 2025

[35] [56]

2025 , url =

GPT-5 System Card , author =. 2025 , url =

work page 2025

[36] [57]

Foundations and Trends in Privacy and Security , year=

Safety at scale: A comprehensive survey of large model and agent safety , author=. Foundations and Trends in Privacy and Security , year=

work page

[37] [58]

NeurIPS , year=

JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models , author=. NeurIPS , year=

work page

[38] [59]

arXiv preprint arXiv:2601.01592 , year=

OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs , author=. arXiv preprint arXiv:2601.01592 , year=

work page arXiv

[39] [60]

ICLR , year=

Improving Generalization of Alignment with Human Preferences through Group Invariant Learning , author=. ICLR , year=

work page

[40] [61]

NAACL , year=

Flames: Benchmarking value alignment of llms in chinese , author=. NAACL , year=

work page

[41] [62]

NeurIPS , year=

Mllmguard: A multi-dimensional safety evaluation suite for multimodal large language models , author=. NeurIPS , year=

work page

[42] [63]

Findings of ACL , year=

A mousetrap: Fooling large reasoning models for jailbreak with chain of iterative chaos , author=. Findings of ACL , year=

work page

[43] [64]

Findings of ACL , year=

Safeevalagent: Toward agentic and self-evolving safety evaluation of llms , author=. Findings of ACL , year=

work page

[44] [65]

NeurIPS , year=

Safevid: Toward safety aligned video large multimodal models , author=. NeurIPS , year=

work page

[45] [66]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [67]

The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, 2025

AI, M. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, 2025. URL https://ai.meta.com/blog/llama-4-multimodal-intelligence/

work page 2025

[47] [68]

Many-shot jailbreaking

Anil, C., Durmus, E., Panickssery, N., Sharma, M., Benton, J., Kundu, S., Batson, J., Tong, M., Mu, J., Ford, D., et al. Many-shot jailbreaking. NeurIPS, 2024

work page 2024

[48] [69]

Invariant Risk Minimization

Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[49] [70]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[50] [71]

Burns, P

Burns, C., Izmailov, P., Kirchner, J., Baker, B., Gao, L., Aschenbrenner, L., Chen, Y., Ecoffet, A., Joglekar, M., Leike, J., et al. Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023

work page arXiv 2023

[51] [72]

N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M., Gonzalez, J

Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M., Gonzalez, J. E., et al. Chatbot arena: An open platform for evaluating llms by human preference. In ICML, 2024

work page 2024

[52] [73]

J., Ham, J., and Park, K

Choe, Y. J., Ham, J., and Park, K. An empirical study of invariant risk minimization. arXiv preprint arXiv:2004.05007, 2020

work page arXiv 2004

[53] [74]

F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D

Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. NeurIPS, 2017

work page 2017

[54] [75]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[55] [76]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [77]

KTO: Model Alignment as Prospect Theoretic Optimization

Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [78]

Scaling laws for reward model overoptimization

Gao, L., Schulman, J., and Hilton, J. Scaling laws for reward model overoptimization. In ICML, 2023

work page 2023

[58] [79]

Alignment faking in large language models

Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., et al. Alignment faking in large language models. arXiv preprint arXiv:2412.14093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[59] [80]

A Survey on LLM-as-a-Judge

Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., Wang, Y., and Guo, J. A survey on llm-as-a-judge. arXiv preprint arXiv: 2411.15594, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024

[60] [81]

Mllmguard: A multi-dimensional safety evaluation suite for multimodal large language models

Gu, T., Zhou, Z., Huang, K., Liang, D., Wang, Y., Zhao, H., Yao, Y., Qiao, X., Wang, K., Yang, Y., et al. Mllmguard: A multi-dimensional safety evaluation suite for multimodal large language models. NeurIPS, 2024 b

work page 2024

[61] [82]

ORPO: Monolithic Preference Optimization without Reference Model

Hong, J., Lee, N., and Thorne, J. Orpo: Monolithic preference optimization without reference model. arXiv preprint arXiv:2403.07691, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[62] [83]

Flames: Benchmarking value alignment of llms in chinese

Huang, K., Liu, X., Guo, Q., Sun, T., Sun, J., Wang, Y., Zhou, Z., Wang, Y., Teng, Y., Qiu, X., et al. Flames: Benchmarking value alignment of llms in chinese. In NAACL, 2024

work page 2024

[63] [84]

Out-of-distribution generalization via risk extrapolation (rex)

Krueger, D., Caballero, E., Jacobsen, J.-H., Zhang, A., Binas, J., Zhang, D., Le Priol, R., and Courville, A. Out-of-distribution generalization via risk extrapolation (rex). In ICML, 2021

work page 2021

[64] [85]

Lambert, N., Pyatkin, V., Morrison, J., Miranda, L. J. V., Lin, B. Y., Chandu, K., Dziri, N., Kumar, S., Zick, T., Choi, Y., et al. Rewardbench: Evaluating reward models for language modeling. In Findings of NAACL, 2025

work page 2025

[65] [86]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., and Zhu, C. G-eval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[66] [87]

Safety at scale: A comprehensive survey of large model and agent safety

Ma, X., Gao, Y., Wang, Y., Wang, R., Wang, X., Sun, Y., Ding, Y., Xu, H., Chen, Y., Zhao, Y., et al. Safety at scale: A comprehensive survey of large model and agent safety. Foundations and Trends in Privacy and Security, 2026

work page 2026

[67] [88]

Rethinking reward model evaluation through the lens of reward overoptimization

Mac Kim, S., Kang, D., Kwon, T., Chae, H., Lee, D., and Yeo, J. Rethinking reward model evaluation through the lens of reward overoptimization. In ACL, 2025

work page 2025

[68] [89]

Simpo: Simple preference optimization with a reference-free reward

Meng, Y., Xia, M., and Chen, D. Simpo: Simple preference optimization with a reference-free reward. NeurIPS, 2024

work page 2024

[69] [90]

K., Strouse, D., Sandholm, T., Salakhutdinov, R., Dragan, A

Moskovitz, T., Singh, A. K., Strouse, D., Sandholm, T., Salakhutdinov, R., Dragan, A. D., and McAleer, S. Confronting reward model overoptimization with constrained rlhf. arXiv preprint arXiv:2310.04373, 2023

work page arXiv 2023

[70] [91]

Rule based rewards for language model safety

Mu, T., Helyar, A., Heidecke, J., Achiam, J., Vallone, A., Kivlichan, I., Lin, M., Beutel, A., Schulman, J., and Weng, L. Rule based rewards for language model safety. NeurIPS, 2024

work page 2024

[71] [92]

WebGPT: Browser-assisted question-answering with human feedback

Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[72] [93]

Gpt-5 system card

OpenAI. Gpt-5 system card. Technical report, 2025. URL https://cdn.openai.com/gpt-5-system-card.pdf

work page 2025

[73] [94]

Training language models to follow instructions with human feedback

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. NeurIPS, 2022

work page 2022

[74] [95]

D., Ermon, S., and Finn, C

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. NeurIPS, 2023

work page 2023

[75] [96]

S., Hejna, J., Knox, B., Finn, C., and Niekum, S

Rafailov, R., Chittepu, Y., Park, R., Sikchi, H. S., Hejna, J., Knox, B., Finn, C., and Niekum, S. Scaling laws for reward model overoptimization in direct alignment algorithms. NeurIPS, 2024

work page 2024

[76] [97]

Evaluating the moral beliefs encoded in llms

Scherrer, N., Shi, C., Feder, A., and Blei, D. Evaluating the moral beliefs encoded in llms. In NeurIPS, 2023

work page 2023

[77] [98]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[78] [99]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[79] [100]

Defining and characterizing reward gaming

Skalse, J., Howe, N., Krasheninnikov, D., and Krueger, D. Defining and characterizing reward gaming. NeurIPS, 2022

work page 2022

[80] [101]

Jailbound: Jailbreaking internal safety boundaries of vision-language models

Song, J., Wang, Y., Li, J., Yu, R., Teng, Y., Ma, X., and Wang, Y. Jailbound: Jailbreaking internal safety boundaries of vision-language models. In NeurIPS, 2025

work page 2025