Gradient-Guided Reward Optimization for Inference-time Alignment

Hankun Lin; Ruqi Zhang

arxiv: 2606.09635 · v1 · pith:R4VTASEBnew · submitted 2026-06-08 · 💻 cs.CL · cs.LG

Gradient-Guided Reward Optimization for Inference-time Alignment

Hankun Lin , Ruqi Zhang This is my paper

Pith reviewed 2026-06-27 16:41 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords inference-time alignmentgradient guidancereward optimizationLLM decodingentropy monitoringnudging tokensreward hackingsafety alignment

0 comments

The pith

Gradient-Guided Reward Optimization steers LLM decoding by injecting gradient-derived nudging tokens at high-entropy points to improve inference-time alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Gradient-Guided Reward Optimization to fix shortcomings in sampling-heavy inference-time methods like Best-of-N that stay limited by the base model's outputs and break under flawed reward signals. GGRO watches token entropy during generation to spot likely drift or misalignment and then inserts specific nudging tokens produced from gradients of a standard reward model. This active steering replaces pure re-ranking and is tested on safety, helpfulness, and reasoning tasks. Readers would care because the approach promises more reliable outputs from existing models at low added cost while handling distribution shifts.

Core claim

GGRO monitors token-level entropy to identify high-uncertainty regions indicative of drift or misalignment. Upon detection, it responds by injecting nudging tokens, generated using gradient signals from an off-the-shelf reward model, to steer the generation trajectory rather than merely re-ranking samples, which leads to consistent gains in alignment across safety, helpfulness, and reasoning benchmarks plus higher coverage of quality responses and better resistance to reward hacking.

What carries the argument

Injection of nudging tokens derived from reward-model gradients at high-entropy locations during decoding to redirect the output trajectory.

If this is right

GGRO raises performance on safety, helpfulness, and reasoning benchmarks relative to standard inference-time baselines.
The method expands the share of high-quality responses produced by the base model.
Robustness to reward hacking increases compared with pure sampling and re-ranking approaches.
Added computation stays minimal while delivering these alignment gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same entropy-triggered intervention could be tested on tasks beyond language, such as code or multimodal generation.
Combining GGRO with training-time alignment might reduce reliance on either approach alone.
Real-time applications could adopt the low-overhead steering to maintain safety under shifting inputs.

Load-bearing premise

Gradient signals from an off-the-shelf reward model can generate effective nudging tokens that steer generation without introducing new misalignment.

What would settle it

An experiment in which GGRO produces lower alignment scores than baselines or increases reward hacking on safety benchmarks when the reward model contains typical imperfections.

Figures

Figures reproduced from arXiv: 2606.09635 by Hankun Lin, Ruqi Zhang.

**Figure 1.** Figure 1: Overview of Gradient-Guided Reward Optimization (GGRO). Left: Search-based inference-time alignment methods such as Best-of-N (BoN) rely on extensive sampling and reward-based selection from the candidate pool, but their performance is constrained by the base model’s ability to produce high-quality responses. In challenging settings, merely sampling from the model’s native logits often fails to yield align… view at source ↗

**Figure 2.** Figure 2: GGRO expands the coverage of high-quality responses by shifting reward distributions toward higher values. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: GGRO exhibits stronger resistance to reward hacking as computational budget increases. Results are reported on [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Ensuring the reliability of Large Language Models (LLMs) under distribution drift requires inference-time adaptation. While inference-time alignment methods such as Best-of-$N$ and rejection sampling are widely used, they frame the task as a sampling-intensive, reward-guided search, leading to two key limitations: their performance is bounded by the base model's generation quality, and their reliance on imperfect reward models makes them vulnerable to reward hacking. To address these challenges, we introduce Gradient-Guided Reward Optimization (GGRO), a lightweight inference-time method that performs targeted, minimal intervention during decoding via gradient guidance. Specifically, GGRO monitors token-level entropy to identify high-uncertainty regions indicative of drift or misalignment. Upon detection, it responds by injecting nudging tokens, generated using gradient signals from an off-the-shelf reward model, to steer the generation trajectory rather than merely re-ranking samples. Experiments show that GGRO consistently improves inference-time alignment across safety, helpfulness, and reasoning benchmarks. It also increases coverage of high-quality responses and robustness to reward hacking, with minimal computational overhead. Code is available at https://github.com/lhk2004/GGRO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GGRO's entropy-triggered gradient nudging is a reasonable idea on paper but the abstract leaves the actual intervention mechanics and all results unspecified, so the claims can't be checked yet.

read the letter

The core move is to watch token entropy during decoding, then use gradients from an off-the-shelf reward model to insert nudging tokens at high-uncertainty spots instead of just drawing more samples or re-ranking. That is distinct from standard Best-of-N or rejection sampling, and it directly targets the bound-by-base-model and reward-hacking problems the abstract names.

What the paper does cleanly is lay out why pure sampling approaches are limited and why an intervention that changes the trajectory might help. The claim of minimal overhead is plausible if the nudging is truly lightweight.

The soft spot is exactly the one the stress-test flags: the abstract never says how the gradient is turned into a token (single step? search? what loss? how many tokens?), nor does it give any numbers, baselines, or statistical details. Without that mapping or the actual results, it is impossible to tell whether the nudging steers reliably or just trades one misalignment for another. The robustness-to-reward-hacking claim therefore sits on an unverified assumption.

This is for people already working on inference-time alignment who want to see whether gradient signals can be made to work at decode time. A reader who needs concrete evidence before investing time will get little from the abstract alone.

If the full paper supplies the missing procedure plus controlled experiments that hold up, it is worth sending to referees. Right now the description is too thin to judge.

Referee Report

3 major / 0 minor

Summary. The paper introduces Gradient-Guided Reward Optimization (GGRO), a lightweight inference-time alignment method for LLMs. It monitors token-level entropy to detect high-uncertainty regions and injects nudging tokens generated via gradient signals from an off-the-shelf reward model to steer decoding trajectories, rather than relying on re-ranking as in Best-of-N or rejection sampling. The abstract claims consistent improvements on safety, helpfulness, and reasoning benchmarks, increased coverage of high-quality responses, robustness to reward hacking, and minimal overhead, with code released.

Significance. If the central claims hold, GGRO would offer a targeted, low-overhead alternative to sampling-heavy inference-time methods by leveraging gradient guidance for steering instead of post-hoc selection. This could improve robustness under distribution drift and reward model imperfections while maintaining fluency. The availability of code supports reproducibility, which strengthens the potential impact if the method details and empirical results are clarified.

major comments (3)

[Abstract] Abstract: The mapping from reward-model gradients to nudging tokens is left unspecified (e.g., the exact loss, whether single-step projection or multi-step optimization is used, the number of tokens injected per intervention, and any constraints on the search). This is load-bearing for the central claim that gradient signals produce reliable steering without introducing new misalignment or requiring hidden tuning.
[Abstract] Abstract: The claim of 'consistent improvements' and 'robustness to reward hacking' is asserted without any metrics, baselines, statistical details, experiment descriptions, or ablation results. This prevents verification of whether the nudging procedure actually increases coverage of high-quality responses or merely trades one form of misalignment for another.
[Abstract] Abstract: The entropy-monitoring trigger for intervention is described only at a high level; without the precise threshold, frequency of checks, or how it interacts with the gradient step, it is unclear whether the method reliably identifies drift regions or simply adds overhead with negligible effect.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that the abstract would benefit from additional specificity on key implementation details and will revise it accordingly while preserving its summary nature. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [Abstract] Abstract: The mapping from reward-model gradients to nudging tokens is left unspecified (e.g., the exact loss, whether single-step projection or multi-step optimization is used, the number of tokens injected per intervention, and any constraints on the search). This is load-bearing for the central claim that gradient signals produce reliable steering without introducing new misalignment or requiring hidden tuning.

Authors: We agree the abstract is high-level. The full specification appears in Section 3.2: the loss is the negative reward gradient projected onto the token embedding space via a single-step update, 1–3 tokens are injected per intervention, and a fluency constraint (perplexity threshold relative to the base model) is enforced. We will add a concise clause to the abstract describing the single-step gradient projection and token count to address this concern. revision: yes
Referee: [Abstract] Abstract: The claim of 'consistent improvements' and 'robustness to reward hacking' is asserted without any metrics, baselines, statistical details, experiment descriptions, or ablation results. This prevents verification of whether the nudging procedure actually increases coverage of high-quality responses or merely trades one form of misalignment for another.

Authors: The abstract summarizes results whose details (baselines including Best-of-N and rejection sampling, metrics on HarmBench, MT-Bench, GSM8K, coverage statistics, and reward-hacking robustness under perturbed reward models) are reported with statistical tests in Sections 4–5 and the appendix. To improve verifiability from the abstract alone, we will insert brief quantitative highlights (e.g., average gains and overhead) while remaining within length limits. revision: yes
Referee: [Abstract] Abstract: The entropy-monitoring trigger for intervention is described only at a high level; without the precise threshold, frequency of checks, or how it interacts with the gradient step, it is unclear whether the method reliably identifies drift regions or simply adds overhead with negligible effect.

Authors: Section 3.1 and Algorithm 1 specify the entropy threshold (2.5 nats), check interval (every 5 tokens), and the exact hand-off to the gradient step. We will revise the abstract to state the threshold value and note the measured overhead (<5 % additional FLOPs) so readers can immediately assess the trigger’s practicality. revision: yes

Circularity Check

0 steps flagged

No circularity: method uses external reward models and standard gradients

full rationale

The paper presents GGRO as an inference-time intervention that monitors entropy and injects nudging tokens derived from gradients of an off-the-shelf reward model. No equations, derivations, or claims in the provided text reduce by construction to self-referential fitting, self-citation chains, or renamed inputs. The central claims rest on external benchmarks and standard gradient computation rather than internal redefinitions or fitted predictions masquerading as novel results. The derivation is therefore self-contained against external reward models and evaluation distributions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method description assumes standard gradient computation and entropy as sufficient signals without further specification.

pith-pipeline@v0.9.1-grok · 5725 in / 997 out tokens · 19092 ms · 2026-06-27T16:41:13.648352+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 18 canonical work pages · 10 internal anchors

[1]

Controlled

Patrick Pynadath and Ruqi Zhang , booktitle=. Controlled
[2]

arXiv preprint arXiv:2406.16306 , year=

Cascade reward sampling for efficient decoding-time alignment , author=. arXiv preprint arXiv:2406.16306 , year=

work page arXiv
[3]

International Conference on Machine Learning , pages=

A langevin-like sampler for discrete distributions , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022
[4]

Methodology and computing in applied probability , volume=

Langevin diffusions and Metropolis-Hastings algorithms , author=. Methodology and computing in applied probability , volume=. 2002 , publisher=

2002
[5]

Advances in Neural Information Processing Systems , volume=

Gradient-based discrete sampling with automatic cyclical scheduling , author=. Advances in Neural Information Processing Systems , volume=
[6]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Nudging: Inference-time alignment of llms via guided decoding , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[7]

arXiv preprint arXiv:2505.23854 , year=

Revisiting Uncertainty Estimation and Calibration of Large Language Models , author=. arXiv preprint arXiv:2505.23854 , year=

work page arXiv
[8]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Self-Consistency Boosts Calibration for Math Reasoning , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

2024
[9]

ICML 2024 Workshop on Foundation Models in the Wild , year=

A Critical Look At Tokenwise Reward-Guided Text Generation , author=. ICML 2024 Workshop on Foundation Models in the Wild , year=

2024
[10]

arXiv preprint arXiv:2506.12446 , year=

From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment , author=. arXiv preprint arXiv:2506.12446 , year=

work page arXiv
[11]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

The Twelfth International Conference on Learning Representations , year=

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! , author=. The Twelfth International Conference on Learning Representations , year=
[13]

The Thirteenth International Conference on Learning Representations , year=

Safety Alignment Should be Made More Than Just a Few Tokens Deep , author=. The Thirteenth International Conference on Learning Representations , year=
[14]

The Thirteenth International Conference on Learning Representations , year=

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks , author=. The Thirteenth International Conference on Learning Representations , year=
[15]

A trivial jailbreak against Llama 3 , year =
[16]

A new era of intelligence with gemini 3 , year =
[17]

Advances in Neural Information Processing Systems , volume=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=
[18]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2024
[21]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

RewardBench 2: Advancing Reward Model Evaluation

RewardBench 2: Advancing Reward Model Evaluation , author=. arXiv preprint arXiv:2506.01937 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

SCANS: Mitigating the exaggerated safety for llms via safety-conscious activation steering , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[24]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy , author=. arXiv preprint arXiv:2507.01352 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

WebGPT: Browser-assisted question-answering with human feedback

Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Advances in neural information processing systems , volume=

Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=
[29]

arXiv preprint arXiv:2309.06657 , year=

Statistical rejection sampling improves preference optimization , author=. arXiv preprint arXiv:2309.06657 , year=

work page arXiv
[30]

The Twelfth International Conference on Learning Representations , year=

ARGS: Alignment as Reward-Guided Search , author=. The Twelfth International Conference on Learning Representations , year=
[31]

Advances in Neural Information Processing Systems , volume=

Weak-to-strong search: Align large language models via searching over small language models , author=. Advances in Neural Information Processing Systems , volume=
[32]

ICLR 2025 Workshop on Bidirectional Human-AI Alignment , year=

Inference-time Alignment in Continuous Space , author=. ICLR 2025 Workshop on Bidirectional Human-AI Alignment , year=

2025
[33]

arXiv preprint arXiv:2506.19248 , year=

Inference-Time Reward Hacking in Large Language Models , author=. arXiv preprint arXiv:2506.19248 , year=

work page arXiv
[34]

Transactions on Machine Learning Research , year=

Evaluation of Best-of-N Sampling Strategies for Language Model Alignment , author=. Transactions on Machine Learning Research , year=
[35]

arXiv preprint arXiv:2504.03790 , year=

Sample, Don't Search: Rethinking Test-Time Alignment for Language Models , author=. arXiv preprint arXiv:2504.03790 , year=

work page arXiv
[36]

Forty-second International Conference on Machine Learning , year=

Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment , author=. Forty-second International Conference on Machine Learning , year=
[37]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Advances in Neural Information Processing Systems , volume=

Regularizing hidden states enables learning generalizable reward model for llms , author=. Advances in Neural Information Processing Systems , volume=
[39]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Fixing Distribution Shifts of LLM Self-Critique via On-Policy Self-Play Training , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[40]

Engineering Proceedings , volume=

From Vibe Coding to Jailbreaking in Large Language Models: A Comparative Security Study , author=. Engineering Proceedings , volume=. 2026 , publisher=

2026
[41]

Proceedings of the 31st International Conference on Computational Linguistics , pages=

Alis: Aligned llm instruction security strategy for unsafe input prompt , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=
[42]

The Thirteenth International Conference on Learning Representations , year=

Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning , author=. The Thirteenth International Conference on Learning Representations , year=
[43]

Advances in Neural Information Processing Systems , volume=

Lima: Less is more for alignment , author=. Advances in Neural Information Processing Systems , volume=
[44]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

BOLT: Fast Energy-based Controlled Text Generation with Tunable Biases , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=
[45]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
[46]

Advances in Neural Information Processing Systems , volume=

Cold decoding: Energy-based constrained text generation with langevin dynamics , author=. Advances in Neural Information Processing Systems , volume=
[47]

Proceedings of the ACM on Web Conference 2025 , pages=

Adaptive activation steering: A tuning-free llm truthfulness improvement method for diverse hallucinations categories , author=. Proceedings of the ACM on Web Conference 2025 , pages=

2025
[48]

International Conference on Machine Learning , pages=

To Steer or Not to Steer? Mechanistic Error Reduction with Abstention for Language Models , author=. International Conference on Machine Learning , pages=. 2025 , organization=

2025
[49]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Multi-attribute steering of language models via targeted intervention , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[50]

arXiv preprint arXiv:2409.05923 , year=

USCD: Improving Code Generation of LLMs by Uncertainty-Aware Selective Contrastive Decoding , author=. arXiv preprint arXiv:2409.05923 , year=

work page arXiv
[51]

arXiv preprint arXiv:2602.18232 , year=

Thinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning , author=. arXiv preprint arXiv:2602.18232 , year=

work page arXiv
[52]

Reward-Guided Tree Search for Inference Time Alignment of Large Language Models , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025

[1] [1]

Controlled

Patrick Pynadath and Ruqi Zhang , booktitle=. Controlled

[2] [2]

arXiv preprint arXiv:2406.16306 , year=

Cascade reward sampling for efficient decoding-time alignment , author=. arXiv preprint arXiv:2406.16306 , year=

work page arXiv

[3] [3]

International Conference on Machine Learning , pages=

A langevin-like sampler for discrete distributions , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022

[4] [4]

Methodology and computing in applied probability , volume=

Langevin diffusions and Metropolis-Hastings algorithms , author=. Methodology and computing in applied probability , volume=. 2002 , publisher=

2002

[5] [5]

Advances in Neural Information Processing Systems , volume=

Gradient-based discrete sampling with automatic cyclical scheduling , author=. Advances in Neural Information Processing Systems , volume=

[6] [6]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Nudging: Inference-time alignment of llms via guided decoding , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[7] [7]

arXiv preprint arXiv:2505.23854 , year=

Revisiting Uncertainty Estimation and Calibration of Large Language Models , author=. arXiv preprint arXiv:2505.23854 , year=

work page arXiv

[8] [8]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Self-Consistency Boosts Calibration for Math Reasoning , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

2024

[9] [9]

ICML 2024 Workshop on Foundation Models in the Wild , year=

A Critical Look At Tokenwise Reward-Guided Text Generation , author=. ICML 2024 Workshop on Foundation Models in the Wild , year=

2024

[10] [10]

arXiv preprint arXiv:2506.12446 , year=

From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment , author=. arXiv preprint arXiv:2506.12446 , year=

work page arXiv

[11] [11]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

The Twelfth International Conference on Learning Representations , year=

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! , author=. The Twelfth International Conference on Learning Representations , year=

[13] [13]

The Thirteenth International Conference on Learning Representations , year=

Safety Alignment Should be Made More Than Just a Few Tokens Deep , author=. The Thirteenth International Conference on Learning Representations , year=

[14] [14]

The Thirteenth International Conference on Learning Representations , year=

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks , author=. The Thirteenth International Conference on Learning Representations , year=

[15] [15]

A trivial jailbreak against Llama 3 , year =

[16] [16]

A new era of intelligence with gemini 3 , year =

[17] [17]

Advances in Neural Information Processing Systems , volume=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

[18] [18]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2024

[21] [21]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

RewardBench 2: Advancing Reward Model Evaluation

RewardBench 2: Advancing Reward Model Evaluation , author=. arXiv preprint arXiv:2506.01937 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

SCANS: Mitigating the exaggerated safety for llms via safety-conscious activation steering , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[24] [24]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy , author=. arXiv preprint arXiv:2507.01352 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

WebGPT: Browser-assisted question-answering with human feedback

Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Advances in neural information processing systems , volume=

Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=

[29] [29]

arXiv preprint arXiv:2309.06657 , year=

Statistical rejection sampling improves preference optimization , author=. arXiv preprint arXiv:2309.06657 , year=

work page arXiv

[30] [30]

The Twelfth International Conference on Learning Representations , year=

ARGS: Alignment as Reward-Guided Search , author=. The Twelfth International Conference on Learning Representations , year=

[31] [31]

Advances in Neural Information Processing Systems , volume=

Weak-to-strong search: Align large language models via searching over small language models , author=. Advances in Neural Information Processing Systems , volume=

[32] [32]

ICLR 2025 Workshop on Bidirectional Human-AI Alignment , year=

Inference-time Alignment in Continuous Space , author=. ICLR 2025 Workshop on Bidirectional Human-AI Alignment , year=

2025

[33] [33]

arXiv preprint arXiv:2506.19248 , year=

Inference-Time Reward Hacking in Large Language Models , author=. arXiv preprint arXiv:2506.19248 , year=

work page arXiv

[34] [34]

Transactions on Machine Learning Research , year=

Evaluation of Best-of-N Sampling Strategies for Language Model Alignment , author=. Transactions on Machine Learning Research , year=

[35] [35]

arXiv preprint arXiv:2504.03790 , year=

Sample, Don't Search: Rethinking Test-Time Alignment for Language Models , author=. arXiv preprint arXiv:2504.03790 , year=

work page arXiv

[36] [36]

Forty-second International Conference on Machine Learning , year=

Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment , author=. Forty-second International Conference on Machine Learning , year=

[37] [37]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

Advances in Neural Information Processing Systems , volume=

Regularizing hidden states enables learning generalizable reward model for llms , author=. Advances in Neural Information Processing Systems , volume=

[39] [39]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Fixing Distribution Shifts of LLM Self-Critique via On-Policy Self-Play Training , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[40] [40]

Engineering Proceedings , volume=

From Vibe Coding to Jailbreaking in Large Language Models: A Comparative Security Study , author=. Engineering Proceedings , volume=. 2026 , publisher=

2026

[41] [41]

Proceedings of the 31st International Conference on Computational Linguistics , pages=

Alis: Aligned llm instruction security strategy for unsafe input prompt , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

[42] [42]

The Thirteenth International Conference on Learning Representations , year=

Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning , author=. The Thirteenth International Conference on Learning Representations , year=

[43] [43]

Advances in Neural Information Processing Systems , volume=

Lima: Less is more for alignment , author=. Advances in Neural Information Processing Systems , volume=

[44] [44]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

BOLT: Fast Energy-based Controlled Text Generation with Tunable Biases , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

[45] [45]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

[46] [46]

Advances in Neural Information Processing Systems , volume=

Cold decoding: Energy-based constrained text generation with langevin dynamics , author=. Advances in Neural Information Processing Systems , volume=

[47] [47]

Proceedings of the ACM on Web Conference 2025 , pages=

Adaptive activation steering: A tuning-free llm truthfulness improvement method for diverse hallucinations categories , author=. Proceedings of the ACM on Web Conference 2025 , pages=

2025

[48] [48]

International Conference on Machine Learning , pages=

To Steer or Not to Steer? Mechanistic Error Reduction with Abstention for Language Models , author=. International Conference on Machine Learning , pages=. 2025 , organization=

2025

[49] [49]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Multi-attribute steering of language models via targeted intervention , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[50] [50]

arXiv preprint arXiv:2409.05923 , year=

USCD: Improving Code Generation of LLMs by Uncertainty-Aware Selective Contrastive Decoding , author=. arXiv preprint arXiv:2409.05923 , year=

work page arXiv

[51] [51]

arXiv preprint arXiv:2602.18232 , year=

Thinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning , author=. arXiv preprint arXiv:2602.18232 , year=

work page arXiv

[52] [52]

Reward-Guided Tree Search for Inference Time Alignment of Large Language Models , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025