ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing
Pith reviewed 2026-06-27 10:00 UTC · model grok-4.3
The pith
Safety alignment can be transferred between large language models at inference time even when they use different vocabularies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ALIGNBEAM enables inference-time transfer of safety alignment between models with incompatible vocabularies by translating anchor logits token-by-token into the target vocabulary at each decoding step and using a small LLM judge to select the safest among K candidate continuations, without modifying any model weights.
What carries the argument
Cross-vocabulary logit mixing, which converts anchor model logits into the target model's vocabulary token-by-token during decoding before judge selection among K beams.
If this is right
- Domain-fine-tuned models can regain safety without additional training.
- The safety-utility trade-off becomes tunable at deployment time.
- The approach works for both cross-vocabulary and same-vocabulary model pairs.
- No permanent changes to model weights are required.
- Inference overhead stays within practical limits for the tested setups.
Where Pith is reading between the lines
- The same mixing and selection process could potentially transfer other behavioral properties beyond safety.
- Maintaining a small set of specialized safe anchor models might suffice for protecting many downstream specialists.
- The vocabulary translation step may introduce subtle biases that affect long-horizon generation in ways not captured by current benchmarks.
- Combining ALIGNBEAM with other inference-time interventions could produce stronger composite defenses.
Load-bearing premise
A small LLM judge can reliably identify the safest continuation among the K candidates generated via cross-vocabulary logit mixing at each decoding step.
What would settle it
A consistent failure of the judge to select safe continuations on standard adversarial benchmarks, or a large drop in task accuracy below the baseline target model, would falsify the method's effectiveness.
read the original abstract
Domain fine-tuning degrades the safety of large language models: fine-tuned specialists readily comply with harmful prompts framed in domain language. Existing inference-time defenses that mix logits from a safe anchor model require both models to share a vocabulary, which rules them out for the cross-family specialists where safety is most degraded. We present ALIGNBEAM, a training-free method that lifts this restriction by translating anchor logits into the target model's vocabulary token-by-token at each decoding step; a small LLM judge then selects the safest among K candidate continuations. No weights are changed, and the safety-utility trade-off can be tuned at deployment without retraining. Across both cross-vocabulary and same-vocabulary evaluation pairs, ALIGNBEAM substantially raises refusal on adversarial benchmarks while keeping task accuracy and inference overhead within practical bounds. The results show that safety alignment can be transferred between model families at inference time, without touching either model's weights.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ALIGNBEAM, a training-free inference-time method for transferring safety alignment from an anchor model to a target model across different vocabularies. It translates anchor logits token-by-token into the target vocabulary at each decoding step to generate K candidate continuations, then uses a small LLM judge to select the safest one. The method claims to raise refusal rates on adversarial benchmarks while preserving task accuracy, without modifying weights and with tunable safety-utility trade-off at deployment.
Significance. If the empirical claims hold with proper validation, the approach would demonstrate that safety alignment can be transferred between model families at inference time without retraining or weight access, addressing degradation from domain fine-tuning. The cross-vocabulary capability and lack of free parameters in the core mixing step are potential strengths.
major comments (2)
- [Abstract / Method] Abstract and method description: the central claim that safety is transferred relies on the small LLM judge reliably selecting the aligned continuation, yet no accuracy metrics, inter-annotator agreement, comparison to human labels, or validation on adversarial inputs are reported for this selection step.
- [Abstract] Abstract: the assertion of 'substantially raises refusal on adversarial benchmarks while keeping task accuracy... within practical bounds' is presented without any quantitative results, tables, baselines, or error analysis, preventing assessment of whether the evidence supports the claim.
minor comments (1)
- [Method] Clarify the exact token-by-token translation procedure with an equation or pseudocode to make the cross-vocabulary mixing reproducible.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address each major comment below with clarifications from the manuscript and indicate planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract / Method] Abstract and method description: the central claim that safety is transferred relies on the small LLM judge reliably selecting the aligned continuation, yet no accuracy metrics, inter-annotator agreement, comparison to human labels, or validation on adversarial inputs are reported for this selection step.
Authors: We acknowledge that the manuscript does not report dedicated accuracy metrics or human validation specifically for the LLM judge's selection decisions. The paper's empirical claims rest on the end-to-end results of ALIGNBEAM (Section 4), where the judge-based selection is shown to contribute to higher refusal rates across benchmarks. To address this, we will add an appendix containing a validation study of the judge on a subset of adversarial inputs, including agreement with human labels and basic accuracy metrics. This will be incorporated in the revision. revision: yes
-
Referee: [Abstract] Abstract: the assertion of 'substantially raises refusal on adversarial benchmarks while keeping task accuracy... within practical bounds' is presented without any quantitative results, tables, baselines, or error analysis, preventing assessment of whether the evidence supports the claim.
Authors: The abstract is a high-level summary of the method and its outcomes. The supporting quantitative evidence—including refusal rates on adversarial benchmarks, task accuracy preservation, baseline comparisons, and error analysis—is provided in full in Section 4 (Experiments) along with the associated tables and figures. These sections enable direct assessment of the claims. We can add one or two key quantitative highlights to the abstract if space permits, but we view the current structure as standard for the venue. revision: partial
Circularity Check
No significant circularity: method is empirical and self-contained
full rationale
The paper introduces ALIGNBEAM as a training-free inference-time procedure using logit translation and an LLM judge. No equations, fitted parameters, or derivations are presented that reduce to the inputs by construction. The central claim (safety transfer without weight updates) is evaluated on external benchmarks rather than being tautological. No self-citation load-bearing steps or ansatz smuggling appear in the provided text. This is the normal case of an independent empirical method.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Fine- tuning aligned language models compromises safety, even when fine-tuned with harmless data
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine- tuning aligned language models compromises safety, even when fine-tuned with harmless data. InInternational Conference on Learning Representations, 2024. 7 AlignBeam: Inference-Time Alignment Transfer
2024
-
[2]
SafeDecoding: Defending against jailbreak attacks via safety-aware decoding
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, and Radha Poovendran. SafeDecoding: Defending against jailbreak attacks via safety-aware decoding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5587–5605. Association for Computational Linguistics, 2024
2024
-
[3]
Alisa Liu, Xiaochuang Han, Yizhong Wang, Yulia Tsvetkov, Yejin Choi, and Noah A. Smith. Tuning language models by proxy. InConference on Language Modeling (COLM), 2024
2024
-
[4]
Contrastive decoding: Open-ended text generation as optimization
Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. Contrastive decoding: Open-ended text generation as optimization. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7422–7437, 2023
2023
-
[5]
Safety alignment should be made more than just a few tokens deep
Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. InInternational Conference on Learning Representations, 2025
2025
-
[6]
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard: LLM-based input-output safeguard for human-AI conversations.arXiv preprint arXiv:2312.06674, 2023
Pith/arXiv arXiv 2023
-
[7]
RAIN: Your language models can align themselves without finetuning
Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, and Hongyang Zhang. RAIN: Your language models can align themselves without finetuning. InInternational Conference on Learning Representations, 2024
2024
-
[8]
Nudging: Inference-time alignment of LLMs via guided decoding
Yu Fei, Yasaman Razeghi, and Sameer Singh. Nudging: Inference-time alignment of LLMs via guided decoding. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025
2025
-
[9]
Huang, Sailik Sengupta, Daniele Bonadiman, Yi-An Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchhoff, and Dan Roth
James Y . Huang, Sailik Sengupta, Daniele Bonadiman, Yi-An Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchhoff, and Dan Roth. DeAL: Decoding-time alignment for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025
2025
-
[10]
Bikel, Jason Weston, and Eric Michael Smith
Yiming Zhang, Jianfeng Chi, Hailey Nguyen, Kartikeya Upasani, Daniel M. Bikel, Jason Weston, and Eric Michael Smith. Backtracking improves generation safety. InInternational Conference on Learning Representations, 2025
2025
-
[11]
Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949, 2023
arXiv 2023
-
[12]
HarmBench: A standardized evaluation framework for automated red teaming and robust refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. InInternational Conference on Machine Learning, 2024
2024
-
[13]
Zico Kolter, and Matt Fredrikson
Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023
Pith/arXiv arXiv 2023
-
[14]
SORRY-Bench: Systematically evaluating large language model safety refusal behaviors
Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, and Prateek Mittal. SORRY-Bench: Systematically evaluating large language model safety refusal behaviors. InInternational Conference on Learning Representations, 2025
2025
-
[15]
Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models.arXiv preprint arXiv:2406.18510, 2024
arXiv 2024
-
[16]
XSTest: A test suite for identifying exaggerated safety behaviours in large language models
Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A test suite for identifying exaggerated safety behaviours in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024
2024
-
[17]
OR-Bench: An over-refusal benchmark for large language models
Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. OR-Bench: An over-refusal benchmark for large language models. InInternational Conference on Machine Learning, 2025
2025
-
[18]
Pappas, Florian Tramer, Hamed Hassani, and Eric Wong
Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models. InAdvances in Neural Information Processing Systems (Datasets and...
2024
-
[19]
Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
Pith/arXiv arXiv 2021
-
[20]
What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021
Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021. 8 AlignBeam: Inference-Time Alignment Transfer
2021
-
[21]
MedSafetyBench: Evaluating and improving the medical safety of large language models
Tessa Han, Aounon Kumar, Chirag Agarwal, and Himabindu Lakkaraju. MedSafetyBench: Evaluating and improving the medical safety of large language models. InAdvances in Neural Information Processing Systems, 2024
2024
-
[22]
I’m sorry
Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. WildGuard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. InAdvances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2024. 9 AlignBeam: Inference-Time Alignment Transfer A Acrony...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.