ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing

Chirag Chawla; Pratinav Seth; Vinay Kumar Sankarapu

arxiv: 2606.12342 · v1 · pith:JOOMBDGLnew · submitted 2026-06-10 · 💻 cs.CL · cs.AI· cs.ET· cs.LG

ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing

Chirag Chawla , Pratinav Seth , Vinay Kumar Sankarapu This is my paper

Pith reviewed 2026-06-27 10:00 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.ETcs.LG

keywords safety alignmentinference-time methodslogit mixingcross-vocabulary transferLLM defenserefusal enhancementadversarial robustness

0 comments

The pith

Safety alignment can be transferred between large language models at inference time even when they use different vocabularies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ALIGNBEAM transfers safety from an anchor model to a target model by translating the anchor's logits into the target's vocabulary at each decoding step. Multiple candidate continuations are generated through this mixing process, and a small judge model selects the safest one. The method requires no weight changes or retraining on either model. Experiments across cross-vocabulary and same-vocabulary pairs show increased refusal rates on adversarial prompts while task accuracy remains largely intact. The safety-utility balance can be adjusted at deployment by varying the number of candidates.

Core claim

ALIGNBEAM enables inference-time transfer of safety alignment between models with incompatible vocabularies by translating anchor logits token-by-token into the target vocabulary at each decoding step and using a small LLM judge to select the safest among K candidate continuations, without modifying any model weights.

What carries the argument

Cross-vocabulary logit mixing, which converts anchor model logits into the target model's vocabulary token-by-token during decoding before judge selection among K beams.

If this is right

Domain-fine-tuned models can regain safety without additional training.
The safety-utility trade-off becomes tunable at deployment time.
The approach works for both cross-vocabulary and same-vocabulary model pairs.
No permanent changes to model weights are required.
Inference overhead stays within practical limits for the tested setups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mixing and selection process could potentially transfer other behavioral properties beyond safety.
Maintaining a small set of specialized safe anchor models might suffice for protecting many downstream specialists.
The vocabulary translation step may introduce subtle biases that affect long-horizon generation in ways not captured by current benchmarks.
Combining ALIGNBEAM with other inference-time interventions could produce stronger composite defenses.

Load-bearing premise

A small LLM judge can reliably identify the safest continuation among the K candidates generated via cross-vocabulary logit mixing at each decoding step.

What would settle it

A consistent failure of the judge to select safe continuations on standard adversarial benchmarks, or a large drop in task accuracy below the baseline target model, would falsify the method's effectiveness.

read the original abstract

Domain fine-tuning degrades the safety of large language models: fine-tuned specialists readily comply with harmful prompts framed in domain language. Existing inference-time defenses that mix logits from a safe anchor model require both models to share a vocabulary, which rules them out for the cross-family specialists where safety is most degraded. We present ALIGNBEAM, a training-free method that lifts this restriction by translating anchor logits into the target model's vocabulary token-by-token at each decoding step; a small LLM judge then selects the safest among K candidate continuations. No weights are changed, and the safety-utility trade-off can be tuned at deployment without retraining. Across both cross-vocabulary and same-vocabulary evaluation pairs, ALIGNBEAM substantially raises refusal on adversarial benchmarks while keeping task accuracy and inference overhead within practical bounds. The results show that safety alignment can be transferred between model families at inference time, without touching either model's weights.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ALIGNBEAM introduces token-by-token logit translation to enable cross-vocabulary safety mixing at inference time, but the abstract supplies no numbers or judge validation to back the claims.

read the letter

ALIGNBEAM tries to fix safety loss after domain fine-tuning by pulling in logits from a safe anchor model during decoding, even when the vocabularies do not match. It translates the anchor logits token by token into the target vocabulary, generates K candidate continuations, and lets a small LLM judge pick the safest one at each step.

The actual novelty is the cross-vocabulary translation step. Earlier logit-mixing defenses were limited to models that already shared a vocabulary, which ruled them out for many cross-family specialist models where safety degrades most. Removing that restriction while staying training-free is the concrete advance.

The paper frames the practical problem clearly: fine-tuned models become too willing to answer harmful prompts when those prompts use domain-specific language. Keeping the method weight-free and tunable at deployment time is a reasonable design choice for real use.

The abstract states that refusal rates rise substantially on adversarial benchmarks while task accuracy and overhead stay acceptable, across both same-vocabulary and cross-vocabulary pairs. If the full experiments show clean ablations and reasonable baselines, that would be useful evidence.

The soft spot is the complete lack of quantitative support or experimental detail in the abstract. No refusal percentages, no comparison to prior methods, and no metrics on how often the small judge actually selects the safe continuation on adversarial inputs. The judge step is load-bearing; without accuracy numbers or human agreement checks, it is impossible to tell whether the reported gains come from the mixing or from the selection heuristic.

This work is aimed at people who deploy fine-tuned LLMs and need modular safety fixes. A reader already working on inference-time defenses would see the vocabulary-handling trick as worth examining.

The idea engages a real deployment constraint and deserves a serious referee to evaluate the experiments and the judge reliability once the full results are available.

Referee Report

2 major / 1 minor

Summary. The paper introduces ALIGNBEAM, a training-free inference-time method for transferring safety alignment from an anchor model to a target model across different vocabularies. It translates anchor logits token-by-token into the target vocabulary at each decoding step to generate K candidate continuations, then uses a small LLM judge to select the safest one. The method claims to raise refusal rates on adversarial benchmarks while preserving task accuracy, without modifying weights and with tunable safety-utility trade-off at deployment.

Significance. If the empirical claims hold with proper validation, the approach would demonstrate that safety alignment can be transferred between model families at inference time without retraining or weight access, addressing degradation from domain fine-tuning. The cross-vocabulary capability and lack of free parameters in the core mixing step are potential strengths.

major comments (2)

[Abstract / Method] Abstract and method description: the central claim that safety is transferred relies on the small LLM judge reliably selecting the aligned continuation, yet no accuracy metrics, inter-annotator agreement, comparison to human labels, or validation on adversarial inputs are reported for this selection step.
[Abstract] Abstract: the assertion of 'substantially raises refusal on adversarial benchmarks while keeping task accuracy... within practical bounds' is presented without any quantitative results, tables, baselines, or error analysis, preventing assessment of whether the evidence supports the claim.

minor comments (1)

[Method] Clarify the exact token-by-token translation procedure with an equation or pseudocode to make the cross-vocabulary mixing reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below with clarifications from the manuscript and indicate planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method description: the central claim that safety is transferred relies on the small LLM judge reliably selecting the aligned continuation, yet no accuracy metrics, inter-annotator agreement, comparison to human labels, or validation on adversarial inputs are reported for this selection step.

Authors: We acknowledge that the manuscript does not report dedicated accuracy metrics or human validation specifically for the LLM judge's selection decisions. The paper's empirical claims rest on the end-to-end results of ALIGNBEAM (Section 4), where the judge-based selection is shown to contribute to higher refusal rates across benchmarks. To address this, we will add an appendix containing a validation study of the judge on a subset of adversarial inputs, including agreement with human labels and basic accuracy metrics. This will be incorporated in the revision. revision: yes
Referee: [Abstract] Abstract: the assertion of 'substantially raises refusal on adversarial benchmarks while keeping task accuracy... within practical bounds' is presented without any quantitative results, tables, baselines, or error analysis, preventing assessment of whether the evidence supports the claim.

Authors: The abstract is a high-level summary of the method and its outcomes. The supporting quantitative evidence—including refusal rates on adversarial benchmarks, task accuracy preservation, baseline comparisons, and error analysis—is provided in full in Section 4 (Experiments) along with the associated tables and figures. These sections enable direct assessment of the claims. We can add one or two key quantitative highlights to the abstract if space permits, but we view the current structure as standard for the venue. revision: partial

Circularity Check

0 steps flagged

No significant circularity: method is empirical and self-contained

full rationale

The paper introduces ALIGNBEAM as a training-free inference-time procedure using logit translation and an LLM judge. No equations, fitted parameters, or derivations are presented that reduce to the inputs by construction. The central claim (safety transfer without weight updates) is evaluated on external benchmarks rather than being tautological. No self-citation load-bearing steps or ansatz smuggling appear in the provided text. This is the normal case of an independent empirical method.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; no details on fitting, background assumptions, or new postulated components are given.

pith-pipeline@v0.9.1-grok · 5698 in / 1085 out tokens · 29418 ms · 2026-06-27T10:00:22.695112+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 3 linked inside Pith

[1]

Fine- tuning aligned language models compromises safety, even when fine-tuned with harmless data

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine- tuning aligned language models compromises safety, even when fine-tuned with harmless data. InInternational Conference on Learning Representations, 2024. 7 AlignBeam: Inference-Time Alignment Transfer

2024
[2]

SafeDecoding: Defending against jailbreak attacks via safety-aware decoding

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, and Radha Poovendran. SafeDecoding: Defending against jailbreak attacks via safety-aware decoding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5587–5605. Association for Computational Linguistics, 2024

2024
[3]

Alisa Liu, Xiaochuang Han, Yizhong Wang, Yulia Tsvetkov, Yejin Choi, and Noah A. Smith. Tuning language models by proxy. InConference on Language Modeling (COLM), 2024

2024
[4]

Contrastive decoding: Open-ended text generation as optimization

Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. Contrastive decoding: Open-ended text generation as optimization. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7422–7437, 2023

2023
[5]

Safety alignment should be made more than just a few tokens deep

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. InInternational Conference on Learning Representations, 2025

2025
[6]

Llama Guard: LLM-based input-output safeguard for human-AI conversations.arXiv preprint arXiv:2312.06674, 2023

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard: LLM-based input-output safeguard for human-AI conversations.arXiv preprint arXiv:2312.06674, 2023

Pith/arXiv arXiv 2023
[7]

RAIN: Your language models can align themselves without finetuning

Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, and Hongyang Zhang. RAIN: Your language models can align themselves without finetuning. InInternational Conference on Learning Representations, 2024

2024
[8]

Nudging: Inference-time alignment of LLMs via guided decoding

Yu Fei, Yasaman Razeghi, and Sameer Singh. Nudging: Inference-time alignment of LLMs via guided decoding. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

2025
[9]

Huang, Sailik Sengupta, Daniele Bonadiman, Yi-An Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchhoff, and Dan Roth

James Y . Huang, Sailik Sengupta, Daniele Bonadiman, Yi-An Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchhoff, and Dan Roth. DeAL: Decoding-time alignment for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

2025
[10]

Bikel, Jason Weston, and Eric Michael Smith

Yiming Zhang, Jianfeng Chi, Hailey Nguyen, Kartikeya Upasani, Daniel M. Bikel, Jason Weston, and Eric Michael Smith. Backtracking improves generation safety. InInternational Conference on Learning Representations, 2025

2025
[11]

Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949, 2023

Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949, 2023

arXiv 2023
[12]

HarmBench: A standardized evaluation framework for automated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. InInternational Conference on Machine Learning, 2024

2024
[13]

Zico Kolter, and Matt Fredrikson

Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

Pith/arXiv arXiv 2023
[14]

SORRY-Bench: Systematically evaluating large language model safety refusal behaviors

Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, and Prateek Mittal. SORRY-Bench: Systematically evaluating large language model safety refusal behaviors. InInternational Conference on Learning Representations, 2025

2025
[15]

WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models.arXiv preprint arXiv:2406.18510, 2024

Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models.arXiv preprint arXiv:2406.18510, 2024

arXiv 2024
[16]

XSTest: A test suite for identifying exaggerated safety behaviours in large language models

Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A test suite for identifying exaggerated safety behaviours in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024

2024
[17]

OR-Bench: An over-refusal benchmark for large language models

Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. OR-Bench: An over-refusal benchmark for large language models. InInternational Conference on Machine Learning, 2025

2025
[18]

Pappas, Florian Tramer, Hamed Hassani, and Eric Wong

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models. InAdvances in Neural Information Processing Systems (Datasets and...

2024
[19]

Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021
[20]

What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021. 8 AlignBeam: Inference-Time Alignment Transfer

2021
[21]

MedSafetyBench: Evaluating and improving the medical safety of large language models

Tessa Han, Aounon Kumar, Chirag Agarwal, and Himabindu Lakkaraju. MedSafetyBench: Evaluating and improving the medical safety of large language models. InAdvances in Neural Information Processing Systems, 2024

2024
[22]

I’m sorry

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. WildGuard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. InAdvances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2024. 9 AlignBeam: Inference-Time Alignment Transfer A Acrony...

2024

[1] [1]

Fine- tuning aligned language models compromises safety, even when fine-tuned with harmless data

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine- tuning aligned language models compromises safety, even when fine-tuned with harmless data. InInternational Conference on Learning Representations, 2024. 7 AlignBeam: Inference-Time Alignment Transfer

2024

[2] [2]

SafeDecoding: Defending against jailbreak attacks via safety-aware decoding

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, and Radha Poovendran. SafeDecoding: Defending against jailbreak attacks via safety-aware decoding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5587–5605. Association for Computational Linguistics, 2024

2024

[3] [3]

Alisa Liu, Xiaochuang Han, Yizhong Wang, Yulia Tsvetkov, Yejin Choi, and Noah A. Smith. Tuning language models by proxy. InConference on Language Modeling (COLM), 2024

2024

[4] [4]

Contrastive decoding: Open-ended text generation as optimization

Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. Contrastive decoding: Open-ended text generation as optimization. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7422–7437, 2023

2023

[5] [5]

Safety alignment should be made more than just a few tokens deep

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. InInternational Conference on Learning Representations, 2025

2025

[6] [6]

Llama Guard: LLM-based input-output safeguard for human-AI conversations.arXiv preprint arXiv:2312.06674, 2023

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard: LLM-based input-output safeguard for human-AI conversations.arXiv preprint arXiv:2312.06674, 2023

Pith/arXiv arXiv 2023

[7] [7]

RAIN: Your language models can align themselves without finetuning

Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, and Hongyang Zhang. RAIN: Your language models can align themselves without finetuning. InInternational Conference on Learning Representations, 2024

2024

[8] [8]

Nudging: Inference-time alignment of LLMs via guided decoding

Yu Fei, Yasaman Razeghi, and Sameer Singh. Nudging: Inference-time alignment of LLMs via guided decoding. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

2025

[9] [9]

Huang, Sailik Sengupta, Daniele Bonadiman, Yi-An Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchhoff, and Dan Roth

James Y . Huang, Sailik Sengupta, Daniele Bonadiman, Yi-An Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchhoff, and Dan Roth. DeAL: Decoding-time alignment for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

2025

[10] [10]

Bikel, Jason Weston, and Eric Michael Smith

Yiming Zhang, Jianfeng Chi, Hailey Nguyen, Kartikeya Upasani, Daniel M. Bikel, Jason Weston, and Eric Michael Smith. Backtracking improves generation safety. InInternational Conference on Learning Representations, 2025

2025

[11] [11]

Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949, 2023

Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949, 2023

arXiv 2023

[12] [12]

HarmBench: A standardized evaluation framework for automated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. InInternational Conference on Machine Learning, 2024

2024

[13] [13]

Zico Kolter, and Matt Fredrikson

Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

Pith/arXiv arXiv 2023

[14] [14]

SORRY-Bench: Systematically evaluating large language model safety refusal behaviors

Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, and Prateek Mittal. SORRY-Bench: Systematically evaluating large language model safety refusal behaviors. InInternational Conference on Learning Representations, 2025

2025

[15] [15]

WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models.arXiv preprint arXiv:2406.18510, 2024

Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models.arXiv preprint arXiv:2406.18510, 2024

arXiv 2024

[16] [16]

XSTest: A test suite for identifying exaggerated safety behaviours in large language models

Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A test suite for identifying exaggerated safety behaviours in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024

2024

[17] [17]

OR-Bench: An over-refusal benchmark for large language models

Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. OR-Bench: An over-refusal benchmark for large language models. InInternational Conference on Machine Learning, 2025

2025

[18] [18]

Pappas, Florian Tramer, Hamed Hassani, and Eric Wong

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models. InAdvances in Neural Information Processing Systems (Datasets and...

2024

[19] [19]

Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021

[20] [20]

What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021. 8 AlignBeam: Inference-Time Alignment Transfer

2021

[21] [21]

MedSafetyBench: Evaluating and improving the medical safety of large language models

Tessa Han, Aounon Kumar, Chirag Agarwal, and Himabindu Lakkaraju. MedSafetyBench: Evaluating and improving the medical safety of large language models. InAdvances in Neural Information Processing Systems, 2024

2024

[22] [22]

I’m sorry

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. WildGuard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. InAdvances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2024. 9 AlignBeam: Inference-Time Alignment Transfer A Acrony...

2024