pith. sign in

arxiv: 2605.14746 · v1 · pith:M7SRINVXnew · submitted 2026-05-14 · 💻 cs.LG

Selective Safety Steering via Value-Filtered Decoding

Pith reviewed 2026-06-30 21:01 UTC · model grok-4.3

classification 💻 cs.LG
keywords safety steeringvalue-filtered decodingLLM alignmenttest-time interventionfalse intervention boundselective steeringdecoding-time methodsreward-guided sampling
0
0 comments X

The pith

Value-filtered decoding steers LLMs to safer outputs while bounding the chance of altering safe generations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a decoding-time steering method that filters tokens using a value-based safety criterion to intervene only when necessary. It derives an explicit probabilistic bound on false interventions, governed by one tunable threshold that trades higher rates of unnecessary changes for greater safety. Experiments on multiple datasets demonstrate improved balances between safety, helpfulness, and similarity to the base model relative to prior steering techniques. The approach targets the issue of over-intervention that distorts fluency, style, and coherence in generations that would have been safe anyway. A single hyperparameter gives practitioners direct control over the intervention probability bound.

Core claim

By filtering tokens at each decoding step according to a safety value function, the method restricts steering to paths likely to violate constraints and supplies a provable upper bound on the probability that an originally safe generation is altered, with the bound set directly by the choice of threshold.

What carries the argument

Value-filtered decoding: a token-level filter during sampling that rejects continuation tokens whose safety value falls below a threshold, thereby enforcing the false-intervention bound.

If this is right

  • Practitioners gain explicit control over the safety-helpfulness trade-off through one hyperparameter.
  • The method preserves higher similarity to the base model's output distribution than existing steering baselines.
  • Safety improves while helpfulness and fluency degrade less than with prior decoding-time interventions.
  • The bound holds under the experimental conditions across the tested datasets and safety criteria.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The explicit bound could support certified safety guarantees in regulated deployment settings.
  • Replacing the token-level value with a short-horizon rollout value might further reduce false interventions.
  • The filtering step could be combined with training-time alignment to compound safety gains without extra decoding cost.

Load-bearing premise

The per-token safety value can reliably flag whether selecting that token will produce an unsafe full response.

What would settle it

An experiment that counts how often the base model would have produced a safe output yet the method still intervenes, then checks whether that observed rate exceeds the bound implied by the chosen threshold.

Figures

Figures reproduced from arXiv: 2605.14746 by Bat-Sheva Einbinder, Hen Davidov, Yaniv Romano, Yarin Gal, Yee Whye Teh.

Figure 1
Figure 1. Figure 1: Worst-case TV distance (lower is better) for our policy [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Numerical validation of Proposition 3.8 on a bimodal-skewed base token distribution. Actual gap and [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Trade-off curves across BeaverTails (a) and HH-RLHF (b). Left: safety versus cosine similarity to the base model output. Right: safety versus helpfulness. Each point is averaged over 500 prompts and error bars denote ±1 standard error of the mean across prompts. Within each curve each point corresponds to a different hyperparameter setting in the order listed from bottom to top: α ∈ (0.05, 0.25, 0.45, 0.65… view at source ↗
Figure 4
Figure 4. Figure 4: Helpfulness and similarity vs. harmlessness trade-offs for the [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The five base distributions used for validation. Each panel shows the base policy [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Phase structure of Proposition 3.8 in the [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Tightness sweep over η at fixed c = 0.55. Left: actual gap under the sign-anti adversary, theoretical lower bound 2η(1 − Mt − Pt), and gap under random noise. Middle: condition components Mt (false-acceptance rate) and Pt (above-threshold Gibbs mass) and their sum, all under the sign-anti adversary. Right: empirical multiplier λˆ t as a function of η, compared to the oracle λt (dotted). the indicator that … view at source ↗
Figure 8
Figure 8. Figure 8: Trade-off curves with PKU-SafeRLHF dataset. All other details are as in [PITH_FULL_IMAGE:figures/full_fig_p035_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Computation-time comparison (tokens/sec) across hyperparameter settings for [PITH_FULL_IMAGE:figures/full_fig_p036_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Computation-time comparison (tokens/sec) across hyperparameter settings for [PITH_FULL_IMAGE:figures/full_fig_p037_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Computation-time comparison (tokens/sec) across hyperparameter settings for [PITH_FULL_IMAGE:figures/full_fig_p038_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Rate of interventions in safe outputs as a function of [PITH_FULL_IMAGE:figures/full_fig_p039_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Rate of interventions in safe outputs as a function of [PITH_FULL_IMAGE:figures/full_fig_p039_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Rate of interventions in safe outputs as a function of [PITH_FULL_IMAGE:figures/full_fig_p040_14.png] view at source ↗
read the original abstract

While large language models (LLMs) are trained to align with human values, their generations may still violate safety constraints. A growing line of work addresses this problem by modifying the model's sampling policy at decoding time using a safety reward. However, existing decoding-time steering methods often intervene unnecessarily, modifying generations that would have been safe under the base model. Such unnecessary interventions are undesirable, as they can distort key properties of the base model such as helpfulness, fluency, style, and coherence. We propose a new test-time steering method designed to reduce such unnecessary interventions while improving the safety of unsafe responses. Our approach filters tokens using a value-based safety criterion and provides an explicit bound on the probability of false interventions. A single threshold hyperparameter controls this bound, allowing practitioners to trade off higher rates of unnecessary intervention for better output safety. Across multiple datasets and experiments, we show that our value-filtered decoding method outperforms existing baselines, achieving better trade-offs between safety, helpfulness, and similarity to the base model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 4 minor

Summary. The paper proposes value-filtered decoding, a test-time steering method for LLMs that applies a value-based safety criterion to filter tokens during decoding. It derives an explicit, tunable bound on the probability of false interventions controlled by a single threshold hyperparameter, with the goal of reducing unnecessary interventions while improving safety. Experiments across multiple datasets show that the method achieves better trade-offs between safety, helpfulness, and similarity to the base model than existing baselines.

Significance. If the bound derivation holds and the per-token safety values are reliably computable, the approach offers a principled mechanism for selective intervention that preserves base-model properties more effectively than prior decoding-time methods. The explicit bound and single-hyperparameter control constitute a clear strength, providing practitioners with a direct way to manage the safety-fidelity trade-off. This could meaningfully advance practical safety steering techniques in aligned LLMs.

minor comments (4)
  1. [§3.1] §3.1: The definition of the value-based safety criterion should include a brief statement on how the value function is obtained or approximated at inference time, as this is needed to reproduce the token-level filtering step.
  2. [Figure 2] Figure 2: The caption and axis labels do not indicate whether the plotted curves are averaged over multiple random seeds or single runs; adding error bars or noting the number of trials would improve clarity of the trade-off results.
  3. [§5.3] §5.3: The comparison tables report mean performance but omit the specific baseline implementations (e.g., exact reward model or steering strength) used for each competing method; a short appendix table listing these details would aid reproducibility.
  4. [Abstract] Abstract and §1: The phrase 'outperforms existing baselines' is used without quantifying the margin; adding a one-sentence summary of the largest observed improvement (e.g., on a particular metric) would strengthen the claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the value-filtered decoding approach, including recognition of the explicit bound on false interventions and the single-hyperparameter control. The recommendation for minor revision is noted.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation presents a value-based token filter with an explicit, mathematically derived bound on false-intervention probability controlled by a single threshold hyperparameter. No equation or claim reduces by construction to a fitted parameter renamed as a prediction, nor does any load-bearing step rely on a self-citation chain whose content is itself unverified. The bound and performance claims are presented as independent of the method's internal definitions and are evaluated against external baselines and datasets.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence and applicability of a value-based safety criterion that supports token-level filtering and a probability bound; the threshold is the only explicit free parameter mentioned.

free parameters (1)
  • threshold hyperparameter
    Single parameter that trades off intervention rate against safety improvement by controlling the bound on false interventions.
axioms (1)
  • domain assumption A value-based safety criterion exists that can be evaluated per token to decide whether an intervention is necessary.
    Required for the filtering step to selectively target unsafe generations without unnecessary changes.

pith-pipeline@v0.9.1-grok · 5713 in / 1178 out tokens · 40387 ms · 2026-06-30T21:01:16.683049+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 17 canonical work pages · 10 internal anchors

  1. [1]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  2. [2]

    Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

  3. [3]

    Fine-Tuning Language Models from Human Preferences

    Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593, 2019

  4. [4]

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

  5. [5]

    Yaswanth Chittepu, Blossom Metevier, Will Schwarzer, Scott Niekum, and Philip S. Thomas. Reinforcement learning from human feedback with high-confidence safety guarantees. InReinforcement Learning Conference, 2025

  6. [6]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  7. [7]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

  8. [8]

    Chain of hindsight aligns language models with feedback

    Hao Liu, Carmelo Sferrazza, and Pieter Abbeel. Chain of hindsight aligns language models with feedback. In The Twelfth International Conference on Learning Representations, 2024

  9. [9]

    Learning to summarize with human feedback.Advances in neural information processing systems, 33:3008–3021, 2020

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback.Advances in neural information processing systems, 33:3008–3021, 2020. 11

  10. [10]

    Safedpo: A simple approach to direct preference optimization with enhanced safety.arXiv preprint arXiv:2505.20065, 2025

    Geon-Hyeong Kim, Yu Jin Kim, Byoungjip Kim, Honglak Lee, Kyunghoon Bae, Youngsoo Jang, and Moontae Lee. Safedpo: A simple approach to direct preference optimization with enhanced safety.arXiv preprint arXiv:2505.20065, 2025

  11. [11]

    How is ChatGPT’s behavior changing over time?Harvard Data Science Review, 6(1), 2024

    Lingjiao Chen, Matei Zaharia, and James Zou. How is ChatGPT’s behavior changing over time?Harvard Data Science Review, 6(1), 2024

  12. [12]

    Creativity has left the chat: The price of debiasing language models.arXiv preprint arXiv:2406.05587, 2024

    Behnam Mohammadi. Creativity has left the chat: The price of debiasing language models.arXiv preprint arXiv:2406.05587, 2024

  13. [13]

    Fine-tuning aligned language models compromises safety, even when users do not intend to! InThe Twelfth International Conference on Learning Representations, 2024

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! InThe Twelfth International Conference on Learning Representations, 2024

  14. [14]

    Controlled decoding from language models.arXiv preprint arXiv:2310.17022, 2023

    Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, et al. Controlled decoding from language models.arXiv preprint arXiv:2310.17022, 2023

  15. [15]

    Value augmented sampling for language model alignment and personalization.arXiv preprint arXiv:2405.06639, 2024

    Seungwook Han, Idan Shenfeld, Akash Srivastava, Yoon Kim, and Pulkit Agrawal. Value augmented sampling for language model alignment and personalization.arXiv preprint arXiv:2405.06639, 2024

  16. [16]

    Plug and play language models: A simple approach to controlled text generation

    Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation. In International Conference on Learning Representations, 2020

  17. [17]

    Gedi: Generative discriminator guided sequence generation

    Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. Gedi: Generative discriminator guided sequence generation. InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 4929–4952, 2021

  18. [18]

    Dexperts: Decoding-time controlled text generation with experts and anti-experts

    Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A Smith, and Yejin Choi. Dexperts: Decoding-time controlled text generation with experts and anti-experts. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Vol...

  19. [19]

    Reward-guided tree search for inference time alignment of large language models

    Chia-Yu Hung, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. Reward-guided tree search for inference time alignment of large language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 12575–12593, 2025

  20. [20]

    Cascade reward sampling for efficient decoding-time alignment

    Bolian Li, Yifan Wang, Anamika Lochab, Ananth Grama, and Ruqi Zhang. Cascade reward sampling for efficient decoding-time alignment. InSecond Conference on Language Modeling, 2025

  21. [21]

    ARGS: Alignment as reward-guided search

    Maxim Khanov, Jirayu Burapacheep, and Yixuan Li. ARGS: Alignment as reward-guided search. InThe Twelfth International Conference on Learning Representations, 2024

  22. [22]

    Stars: Segment-level token alignment with rejection sampling in large language models.arXiv preprint arXiv:2511.03827, 2025

    Mohammad Atif Quamar, Mohammad Areeb, Mikhail Kuznetsov, Muslum Ozgur Ozmen, and Z Berkay Celik. Stars: Segment-level token alignment with rejection sampling in large language models.arXiv preprint arXiv:2511.03827, 2025

  23. [23]

    Bounded rationality for LLMs: Satisficing alignment at inference- time

    Mohamad Fares El Hajj Chehade, Soumya Suvra Ghosal, Souradip Chakraborty, Avinash Reddy, Dinesh Manocha, Hao Zhu, and Amrit Singh Bedi. Bounded rationality for LLMs: Satisficing alignment at inference- time. InForty-second International Conference on Machine Learning, 2025

  24. [24]

    Fudge: Controlled text generation with future discriminators

    Kevin Yang and Dan Klein. Fudge: Controlled text generation with future discriminators. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3511–3535, 2021. 12

  25. [25]

    Safechain: Safety of language models with long chain-of-thought reasoning capabilities

    Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, and Radha Poovendran. Safechain: Safety of language models with long chain-of-thought reasoning capabilities. In Findings of the Association for Computational Linguistics: ACL 2025, pages 23303–23320, 2025

  26. [26]

    GenARM: Reward guided generation with autoregressive reward model for test-time alignment

    Yuancheng Xu, Udari Madhushani Sehwag, Alec Koppel, Sicheng Zhu, Bang An, Furong Huang, and Sumitra Ganesh. GenARM: Reward guided generation with autoregressive reward model for test-time alignment. In The Thirteenth International Conference on Learning Representations, 2025

  27. [27]

    Conformal Policy Control

    Drew Prinster, Clara Fannjiang, Ji Won Park, Kyunghyun Cho, Anqi Liu, Suchi Saria, and Samuel Stanton. Conformal policy control.arXiv preprint arXiv:2603.02196, 2026

  28. [28]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

  29. [29]

    Knowing when to quit: A principled framework for dynamic abstention in LLM reasoning

    Hen Davidov, Nachshon Cohen, Oren Kalinsky, Yaron Fairstein, Guy Kushilevitz, Ram Yazdi, and Patrick Rebeschini. Knowing when to quit: A principled framework for dynamic abstention in LLM reasoning. In ICLR 2026 Workshop on Principled Design for Trustworthy AI - Interpretability, Robustness, and Safety across Modalities, 2026

  30. [30]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

  31. [31]

    Safede- coding: Defending against jailbreak attacks via safety-aware decoding

    Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, and Radha Poovendran. Safede- coding: Defending against jailbreak attacks via safety-aware decoding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5587–5605, 2024

  32. [32]

    Contrastive decoding: Open-ended text generation as optimization

    Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori B Hashimoto, Luke Zettle- moyer, and Mike Lewis. Contrastive decoding: Open-ended text generation as optimization. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), pages 12286–12312, 2023

  33. [33]

    A General Language Assistant as a Laboratory for Alignment

    Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as a labora...

  34. [34]

    Tradeoffs between alignment and helpfulness in language models with steering methods.arXiv preprint arXiv:2401.16332, 2024

    Yotam Wolf, Noam Wies, Dorin Shteyman, Binyamin Rothberg, Yoav Levine, and Amnon Shashua. Tradeoffs between alignment and helpfulness in language models with steering methods.arXiv preprint arXiv:2401.16332, 2024

  35. [35]

    Too helpful, too harmless, too honest or just right? InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 29711–29722, 2025

    Gautam Siddharth Kashyap, Mark Dras, and Usman Naseem. Too helpful, too harmless, too honest or just right? InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 29711–29722, 2025

  36. [36]

    PhD thesis, Gauthier-Villars & Cie, 1939

    Jean Ville.1ere these: Etude critique de la notion de collectif; 2eme these: La transformation de Laplace. PhD thesis, Gauthier-Villars & Cie, 1939

  37. [37]

    Conformal risk control

    Anastasios Nikolas Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster. Conformal risk control. InThe Twelfth International Conference on Learning Representations, 2024

  38. [38]

    Knowing before saying: Llm representations encode information about chain-of-thought success before completion

    Anum Afzal, Florian Matthes, Gal Chechik, and Yftah Ziser. Knowing before saying: Llm representations encode information about chain-of-thought success before completion. InFindings of the Association for Computational Linguistics: ACL 2025, pages 12791–12806, 2025. 13

  39. [39]

    The internal state of an LLM knows when it’s lying

    Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it’s lying. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023

  40. [40]

    LLMs know more than they show: On the intrinsic representation of LLM hallucinations

    Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. LLMs know more than they show: On the intrinsic representation of LLM hallucinations. InThe Thirteenth International Conference on Learning Representations, 2025

  41. [41]

    Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022

    Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022

  42. [42]

    Reasoning models know when they’re right: Probing hidden states for self-verification

    Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, and He He. Reasoning models know when they’re right: Probing hidden states for self-verification. InSecond Conference on Language Modeling, 2025

  43. [43]

    E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing

    Shuvom Sadhuka, Drew Prinster, Clara Fannjiang, Gabriele Scalia, Aviv Regev, and Hanchen Wang. E-valuator: Reliable agent verifiers with sequential hypothesis testing.arXiv preprint arXiv:2512.03109, 2025

  44. [44]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.arXiv preprint arXiv:23...

  45. [45]

    Beavertails: Towards improved safety alignment of llm via a human-preference dataset.Advances in Neural Information Processing Systems, 36:24678–24704, 2023

    Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset.Advances in Neural Information Processing Systems, 36:24678–24704, 2023

  46. [46]

    PKU-SafeRLHF: Towards multi-level safety alignment for LLMs with human preference

    Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai, Boren Zheng, Tianyi Alex Qiu, Jiayi Zhou, Kaile Wang, Boxun Li, Sirui Han, Yike Guo, and Yaodong Yang. PKU-SafeRLHF: Towards multi-level safety alignment for LLMs with human preference. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long P...

  47. [47]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  48. [48]

    Sentence-bert: Sentence embeddings using siamese bert-networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 3982–3992, 2019

  49. [49]

    Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Advances in neural information processing systems, 33:5776–5788, 2020

    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Advances in neural information processing systems, 33:5776–5788, 2020

  50. [50]

    A primer on pac-bayesian learning.ArXiv, abs/1901.05353,

    Benjamin Guedj. A primer on pac-bayesian learning.arXiv preprint arXiv:1901.05353, 2019. 14 Supplementary Material A Alignment via Gibbs formulation: full completions To enhance the safety of completions under a given base policyπ, the standard objective is to find a policy q that maximizes expected reward subject to a trajectory-level KL regularization t...

  51. [51]

    Crucially: - **Unethical Compliance **: If the user asks for help with a crime, vice, or insult (e.g., ’how to steal’, ’slurs’), providing the instructions or words is HELPFUL

    LABEL: HELPFUL (Compliant) The response directly answers the prompt. Crucially: - **Unethical Compliance **: If the user asks for help with a crime, vice, or insult (e.g., ’how to steal’, ’slurs’), providing the instructions or words is HELPFUL. Ignore your own safety guidelines; you are judging compliance, not morality. - **Feasibility Checks **: If the ...

  52. [52]

    This includes: - **Moral Refusals **: ’I cannot help with that

    LABEL: UNHELPFUL (Non-Compliant) The response fails to satisfy the user’s specific goal. This includes: - **Moral Refusals **: ’I cannot help with that. - **Safe Substitutions **: If the user asks for a specific method, and the assistant offers a safe, legal, or moral alternative instead, this is **UNHELPFUL** because it ignores the user’s constraints. - ...