pith. sign in

arxiv: 2606.11201 · v1 · pith:SLPG2DKQnew · submitted 2026-04-22 · 💻 cs.LG · cs.AI· cs.CL

To Intervene or Not: Guiding Inference-time Alignment with Probabilistic Model Blending

Pith reviewed 2026-07-05 04:36 UTC · model glm-5.2

classification 💻 cs.LG cs.AIcs.CL
keywords alignmentblendininference-timemodelmodelsguidanceguidancesinterventions
0
0 comments X

The pith

Soft blending beats hard acceptance in LLM alignment

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper identifies and addresses a failure mode in inference-time alignment of language models, where an aligned 'guidance' model steers an unaligned 'base' model during generation. The authors discover an 'intervention paradox': model pairs that require more guidance interventions actually perform worse, not better. This happens because guidance models often give bad suggestions at exactly the positions where the base model is struggling, and existing methods blindly accept these suggestions. The authors propose BlendIn, which replaces binary accept-or-reject decisions with a soft blend of the two models' probability distributions. At each uncertain position, BlendIn computes a weighted average of the guidance and base model distributions, where the weight α adapts based on each model's confidence and whether the guidance model's top token already has support in the base model's distribution. The method achieves up to 50% improvement on challenging model pairs while preserving performance on pairs that already work well. The paper also establishes intervention rate as a diagnostic signal: pairs exceeding roughly 20% intervention tend to fail, and this can be detected on a small data subset without running full benchmarks.

Core claim

The central discovery is the intervention paradox — a statistically significant negative correlation between intervention rate and performance across multiple benchmarks (GSM8K r=-0.65, TruthfulQA r=-0.56, XSTest r=-0.76). This refutes the implicit assumption that all guidance is beneficial and reveals that existing inference-time alignment methods suffer from quality blindness: they cannot distinguish helpful from harmful guidance. The paper shows this is not explained by vocabulary overlap between models (no significant correlation found), nor fixable by simply capping intervention rates (which degrades performance further by discarding both good and bad guidance). The proposed solution —软

What carries the argument

The central mechanism is the blended distribution P_blend(w) = α·P_guidance(w) + (1-α)·P_base(w), where α = clip(p̂_g/(p̂_b + p̂_g) + λ·P_b(t_g), 0, 1). Here p̂_b and p̂_g are the top-1 probabilities of the base and guidance models, and λ·P_b(t_g) is an agreement bonus that increases guidance weight when the guidance model's top token already has probability mass in the base model's distribution. The blend is applied only when the base model's confidence falls below threshold τ=0.4, and the final token is selected greedily from the blended distribution.

If this is right

  • Intervention rate can serve as a cheap pre-deployment diagnostic: practitioners can run a small 100-sample subset, measure intervention frequency, and predict whether a guidance-base model pair will succeed or fail without full benchmark evaluation.
  • The negative correlation between intervention rate and performance suggests that guidance model selection should be based on compatibility with the base model at difficult positions, not on the guidance model's standalone capability.
  • The failure of vocabulary overlap as a predictor implies that guidance quality depends on deeper semantic or representational alignment between models, not surface-level token agreement.
  • The task-dependent nature of improvements (safety tasks benefit more than reasoning tasks) suggests that inference-time alignment is most valuable for alignment-sensitive properties like safety and truthfulness, where base models retain capability but lack alignment.

Load-bearing premise

The adaptive blend weight uses each model's maximum softmax probability as a proxy for that model's quality at each generation position. If a model is confidently wrong — which language models frequently are — the blending weight will allocate trust to the wrong model, and the improvements over binary acceptance may not hold.

What would settle it

If model calibration were measured directly and the correlation between top-1 probability and correctness were near zero for the specific model pairs and tasks tested, the adaptive blend weight α would reduce to an arbitrary mixing parameter, and BlendIn's advantage over fixed-α blending or binary acceptance would disappear.

Figures

Figures reproduced from arXiv: 2606.11201 by Jin Gan, Jun Luo, Xin Li.

Figure 1
Figure 1. Figure 1: (a) Overview of quality blindness in inference-time alignment and our solution. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Difference between inference-time aligned [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Intervention paradox: Higher intervention rates correlate with worse performance. Each point represents [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Vocabulary overlap does not predict performance. We measure top-50 vocabulary overlap, which is the [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: BlendIn for Quality-Aware Inference-Time Alignment. At positions where the base model is uncertain [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

The wide deployment of LLMs has made model alignment necessary to make newly trained models safely and effectively respond to user instructions. Among different methods, inference-time alignment is often cheaper as it intervenes (i.e., offers guidances) only during output generation. Existing proposals apply guidances extracted from certain aligned models without properly assessing their reliability. Nonetheless, our systematic evaluation reveals that guidance effectiveness varies drastically across models; since ineffective guidances lead to further confusion and thus further interventions, the resulting excessive interventions typically indicate poor performance. To make interventions more effective and thus more efficient, we introduce BlendIn, an inference-time alignment framework that shifts from binary decisions to creating hybrid distributions integrating both models' knowledge. BlendIn stabilizes inference-time alignment by performing quality-aware alignment and proportionally weighting each model's contribution based on reliability. Compared with existing works, it preserves beneficial guidance while downweighting unreliable suggestions. BlendIn provides both diagnostic signals and mitigation strategies for misaligned guidance, achieving consistent and up to 50% performance improvement on challenging model pairs. Our code is available at: https://github.com/DecayingSeart/BlendIn.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 7 minor

Summary. The paper introduces BlendIn, an inference-time alignment method that replaces binary accept/reject guidance decisions with soft distribution blending. The method computes an adaptive blend weight α based on model confidence ratios and token agreement, then samples from a hybrid distribution P_blend = α·P_g + (1-α)·P_b. The paper also documents an 'intervention paradox': higher intervention rates correlate with worse performance across model pairs. The method is evaluated on 100-sample subsets (Table 1) and full test sets (Table 2) across three benchmarks and several model pairs. The 100-sample results show improvements up to 50% on challenging high-intervention pairs, while full-test-set results show modest 0-4 percentage point improvements that mostly fall within 95% confidence intervals. The paper is transparent about this gap and recommends task-specific α tuning.

Significance. The intervention paradox (Figure 3) is a genuinely useful empirical observation for the inference-time alignment community, and the framing of 'quality blindness' in existing methods is well-motivated. The method is simple and clearly described, with code provided. The distinction from confidence-based ensembling (Section 4.2) is appropriate. However, the significance of the method contribution is substantially undercut by the discrepancy between the headline results (Table 1, 100-sample subsets) and the full-test-set evaluation (Table 2), which I discuss in detail below. The vocabulary overlap null result (Figure 4) is a useful negative finding but is peripheral to the method contribution.

major comments (4)
  1. Tables 1 and 2: The absolute baseline numbers between the 100-sample subset (Table 1) and the full test set (Table 2) are inconsistent in ways that sampling variance alone cannot explain. Most strikingly, for Q→L XSTest, NUDGING scores 0.03 in Table 1 but 0.52 in Table 2 — a 17× difference. For Q→L GSM8K, NUDGING is 0.27 in Table 1 but 0.29 in Table 2 (closer, but the BlendIn result is 0.31 in Table 1 and 0.29 in Table 2, i.e., the improvement reverses direction). For Q→G GSM8K, NUDGING is 0.54 in Table 1 but 0.50 in Table 2. If both tables evaluate the same model pair on the same benchmark with the same method, a 100-sample subset should approximate the full-set mean, not diverge by an order of magnitude. This discrepancy suggests either (a) the 100-sample subset is catastrophically unrepresentative, or (b) there is an undocumented methodological difference between the two evaluations.
  2. Abstract and Section 1: The headline claim of 'up to 50% performance improvement' is drawn from Table 1 (100-sample subsets), but the full-test-set evaluation (Table 2) shows improvements of 0-4 percentage points that do not consistently reach statistical significance (95% CIs overlap in all cases). The abstract does not contextualize this. The paper should either reconcile the two tables or reframe the headline claim to reflect the full-test-set results, which are the more reliable evaluation.
  3. Table 1, Q→L XSTest row: The NUDGING baseline of 0.03 on a 100-sample subset, compared to 0.52 on the full test set, is the single most extreme discrepancy and also the source of the '33% improvement' (0.03→0.04) cited in the text. If the true baseline is closer to 0.52, the 0.01 absolute improvement (0.52→0.54 from Table 2) is within noise. The paper should clarify what happened in this specific cell — whether the 100-sample subset was drawn from a particularly adversarial stratum, whether there was a prompt formatting difference, or whether some other methodological factor explains the divergence.
  4. Section 4.1, Eq. (3): The adaptive blend weight α = clip(p̂_g/(p̂_b + p̂_g) + λ·P_b(t_g), 0, 1) uses top-1 softmax probability as a proxy for model quality. The paper does not validate the calibration assumption — that max softmax probability tracks correctness for the specific model pairs and tasks studied. Given that LLMs are known to be poorly calibrated, this is a load-bearing assumption: if confidence does not track correctness, α misallocates trust and the method's advantages over binary acceptance may not hold. A brief calibration analysis (e.g., reliability diagrams or ECE for the model pairs used) would substantially strengthen the paper.
minor comments (7)
  1. Table 1: The 'Alig.' column (aligned base model via fine-tuning) is included as an upper bound but is never discussed in the results text. Either add a sentence interpreting it or note that it is for reference only.
  2. Table 2 caption: 'Int.%' column is described as intervention rate but the values differ slightly from Table 1 for the same pairs (e.g., Q→L GSM8K: 22.2% in Table 1 vs. 22.4% in Table 2). Clarify whether these are from different sample sets or rounding.
  3. Section 5: The paper states 'We report results on a fixed random subset of 100 samples per benchmark to balance computational cost with statistical reliability.' Given the discrepancies noted above, the claim of 'statistical reliability' for n=100 should be softened or justified with power analysis.
  4. Figure 2: The color scale and cell values are difficult to read in the rendered figure. Consider using a diverging colormap with clearer labels.
  5. Section 4.1: The statement 'Full distributions are substitutable with top k to save computation, where k is an arbitrary large value' is vague. The experiments use k=100; clarify what 'arbitrary large' means and whether results are sensitive to k.
  6. Table 5 (Appendix A.2): The sensitivity analysis for τ and α is only on Qwen→Llama. Extending to at least one low-intervention pair would help readers assess whether the tuning recommendations generalize.
  7. References: The Gemma Team citations (2025, 2024) appear to have author list formatting issues ('and 197 others', 'and 179 others'). Standardize.

Simulated Author's Rebuttal

4 responses · 1 unresolved

We thank the referee for a careful and substantive review. The referee correctly identifies that the discrepancy between Tables 1 and 2 is the most serious issue in the paper, and we agree it must be addressed. We provide point-by-point responses below. We accept the need to reframe headline claims and to add a calibration analysis. We are unable to fully explain the most extreme discrepancy (Q→L XSTest NUDGING 0.03 vs 0.52), and we state this honestly.

read point-by-point responses
  1. Referee: Tables 1 and 2 inconsistency: NUDGING scores diverge between 100-sample subsets and full test sets in ways sampling variance cannot explain (e.g., Q→L XSTest 0.03 vs 0.52, Q→L GSM8K BlendIn improvement reverses direction).

    Authors: The referee is correct that these discrepancies are too large to be explained by sampling variance alone, and we acknowledge this as a genuine weakness in the current manuscript. After investigating, we can account for some but not all of the divergence. For Q→L GSM8K, the 100-sample subset (Table 1) happens to over-represent harder problems where NUDGING performs poorly (0.27) relative to the full set (0.29); BlendIn's improvement on the subset (0.31) does not replicate on the full set (0.29), and we agree this reversal means the subset was not representative for that pair. For Q→G GSM8K, the difference (0.54 vs 0.50) is within plausible sampling variance given the high intervention rate and bimodal per-problem outcomes on GSM8K. However, for Q→L XSTest, the NUDGING baseline of 0.03 on the 100-sample subset versus 0.52 on the full test set is a discrepancy we cannot fully explain. We have re-examined our logs and confirmed that the same model pair, prompt template, decoding parameters, and evaluation script were used in both cases. The 100-sample subset was drawn as a fixed random partition (seed documented in code), not an adversarial stratum. Our best hypothesis is that XSTest's small full size (250 samples) combined with the binary safety-scoring metric creates high variance for any subset, and that the Q→L pair on XSTest is particularly sensitive because the guidance model's interventions on safety-refusal tokens can flip entire outputs. But we acknowledge this does not fully account for a 17× difference, and we cannot rule out an undocumented methodological factor in the earlier 100-sample runs. We will revise the paper to (1) clearly flag this discrepancy and our inability to fully resolve it, (2) de-emphasize Table 1 as a headline result and present Table 2 as a revision: partial

  2. Referee: Abstract and Section 1: 'up to 50% improvement' headline claim is drawn from Table 1 (100-sample subsets), while full-test-set results show 0-4 pp improvements within CIs. The abstract should reframe the claim.

    Authors: We agree. The abstract and introduction overstate the method's contribution by anchoring to the 100-sample subset results. In the revision, we will reframe the headline claim to reflect the full-test-set evaluation: BlendIn shows consistent but modest improvements (0-4 percentage points) on high-intervention pairs, with improvements that do not consistently reach statistical significance at the 95% CI level. We will retain the 100-sample results as supplementary evidence of the method's behavior on challenging subsets, but will explicitly note their limited reliability and the discrepancy with full-set results. The 'up to 50%' figure will be removed from the abstract and replaced with an accurate characterization of the full-test-set findings. revision: yes

  3. Referee: Table 1, Q→L XSTest row: The 0.03 NUDGING baseline is the source of the '33% improvement' claim. If the true baseline is ~0.52, the 0.01 absolute improvement (0.52→0.54) is within noise. The paper should clarify what happened in this specific cell.

    Authors: We agree this is the most damaging specific case. As noted in our response to the first comment, we cannot fully explain the 0.03 vs 0.52 divergence. We re-ran the 100-sample subset evaluation and confirmed the 0.03 result is reproducible on that specific subset, suggesting it is not a transient bug but rather a genuine (and extreme) instance of subset non-representativeness on a small, high-variance benchmark. The referee is correct that the '33% improvement' claim is an artifact of this unrepresentative baseline: on the full test set, the improvement is 0.02 (0.52→0.54), which is within the 95% CI. We will remove the '33% improvement' claim from the paper and add an explicit discussion of this cell as a cautionary example of subset variance. We will also add a note recommending that future evaluations on small benchmarks like XSTest use the full test set rather than subsets. revision: yes

  4. Referee: Section 4.1, Eq. (3): The adaptive blend weight uses top-1 softmax probability as a proxy for model quality, but the paper does not validate the calibration assumption. LLMs are known to be poorly calibrated. A calibration analysis (ECE, reliability diagrams) would strengthen the paper.

    Authors: This is a fair and important point. The referee is correct that the blending weight α depends on the assumption that max softmax probability is a meaningful proxy for prediction quality, and we do not currently validate this assumption. We will add a calibration analysis in the revision. Specifically, we plan to report Expected Calibration Error (ECE) and reliability diagrams for the model pairs used in our experiments (at minimum Q→L, Q→G, and L→G), measured on the benchmarks where we evaluate. We expect this analysis to show that while the models are not perfectly calibrated, the relative confidence ordering between base and guidance models at uncertain positions is informative enough for the blending mechanism to provide benefit over binary acceptance. If the analysis reveals that calibration is too poor for the adaptive α to be effective, we will report this honestly and note that the manual α tuning results (Table 5) provide a fallback. We will also add discussion of this limitation in the method section. revision: yes

standing simulated objections not resolved
  • We are unable to provide a complete methodological explanation for the Q→L XSTest NUDGING discrepancy (0.03 on 100-sample subset vs 0.52 on full test set). We have verified that the same code, model pair, prompt template, and decoding parameters were used, and the subset result is reproducible. Our best explanation is extreme subset non-representativeness on a small, high-variance benchmark, but we cannot fully rule out an undocumented factor in the earlier runs.

Circularity Check

0 steps flagged

No circularity found: the blend weight formula is a heuristic computed from model outputs, not fitted to target performance or derived from self-cited premises.

full rationale

The paper's derivation chain is self-contained and does not exhibit circularity. The core methodological contribution — the adaptive blend weight α = clip(p̂_g/(p̂_b + p̂_g) + λ·P_b(t_g), 0, 1) in Eq. 3 — is a heuristic formula computed from observable model properties (top-1 probabilities and token agreement), not fitted to the performance metrics it claims to improve. The parameter λ=0.1 is a hand-set constant, not estimated from evaluation data. The paper does not claim to derive this formula from first principles or from a self-cited theorem; it presents it as a principled default with optional task-specific tuning. The intervention paradox (Figure 3) is an independent empirical observation, not a consequence of the method's design. The cited prior work (Fei et al., 2025 for NUDGING and τ=0.4) is by different authors, so there is no self-citation chain. The discrepancy between Table 1 (100-sample subsets) and Table 2 (full test sets) in absolute baseline numbers is a correctness and empirical validity concern, not a circularity issue — it does not involve any step where an output is defined in terms of the quantity it claims to predict. The paper's claims may be empirically overstated, but the derivation itself is not circular by construction.

Axiom & Free-Parameter Ledger

3 free parameters · 3 axioms · 0 invented entities

BlendIn introduces no new entities, particles, or postulated objects. The method operates entirely on existing model probability distributions. The free parameters (τ, λ, k) are standard hyperparameters, not fitted constants masquerading as derived quantities. The domain assumptions about confidence-as-quality-proxy are the most structurally important axioms and are unvalidated.

free parameters (3)
  • τ (uncertainty threshold) = 0.4
    Threshold below which base model is considered uncertain and guidance is triggered. Inherited from prior work (Fei et al., 2025) but used as a tunable parameter. Sensitivity analysis in Table 5 shows task-dependent effects.
  • λ (agreement bonus weight) = 0.1
    Controls the magnitude of the agreement bonus in the blend weight formula (Eq. 3). Chosen by hand without systematic optimization. No sensitivity analysis provided for this parameter specifically.
  • k (top-k tokens for blending) = 100
    Number of top tokens from each model's distribution used for blending. Stated as 'arbitrary large value' but set to 100 in experiments. No sensitivity analysis provided.
axioms (3)
  • domain assumption Model confidence (max softmax probability) is a reliable proxy for model quality/correctness at a given generation position.
    This is the foundational assumption behind the adaptive blend weight α in Eq. 3. The confidence ratio p̂_g/(p̂_b + p̂_g) only makes sense as a quality indicator if confidence tracks correctness. Invoked in §4.1, Step 2. Not directly validated in the paper.
  • domain assumption Token-level agreement between models (guidance top token appearing in base distribution) indicates guidance quality.
    The agreement bonus λ·P_b(t_g) in Eq. 3 assumes that when the guidance model's top token has support in the base distribution, the guidance is more likely to be beneficial. Invoked in §4.1, Step 2. Not directly validated.
  • domain assumption Greedy decoding from the blended distribution produces coherent multi-token sequences.
    BlendIn selects argmax from the blended distribution at each position independently (Eq. 5, Algorithm 1). This assumes that locally optimal token selection from a blended distribution leads to coherent generation, which may not hold for long sequences. Invoked in §4.1, Step 3.

pith-pipeline@v1.1.0-glm · 17998 in / 2815 out tokens · 226039 ms · 2026-07-05T04:36:55.449573+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 7 internal anchors

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  8. [14]

    Proceedings of the International Conference on Learning Representations (ICLR) , year=

    Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

  9. [15]

    Proceedings of the International Conference on Learning Representations (ICLR) , year=

    Aligning AI With Shared Human Values , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

  10. [17]

    The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning , year =

    Bill Yuchen Lin and Abhilasha Ravichander and Ximing Lu and Nouha Dziri and Melanie Sclar and Khyathi Chandu and Chandra Bhagavatula and Yejin Choi , journal =. The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning , year =

  11. [18]

    2024 , url =

    Llama 3 Model Card , author=. 2024 , url =

  12. [19]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  13. [20]

    Gemma 3 , url=

    Gemma Team , year=. Gemma 3 , url=

  14. [21]

    2025 , eprint=

    Gemma 3 Technical Report , author=. 2025 , eprint=

  15. [22]

    Gemma , url=

    Gemma Team , year=. Gemma , url=. doi:10.34740/KAGGLE/M/3301 , publisher=

  16. [23]

    2024 , eprint=

    Gemma 2: Improving Open Language Models at a Practical Size , author=. 2024 , eprint=

  17. [24]

    2025 , eprint=

    Qwen2.5 Technical Report , author=. 2025 , eprint=

  18. [25]

    Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L. and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe, R...

  19. [26]

    and Finn, Chelsea , title =

    Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Ermon, Stefano and Manning, Christopher D. and Finn, Chelsea , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

  20. [27]

    Proceedings of the 40th International Conference on Machine Learning , articleno =

    Leviathan, Yaniv and Kalman, Matan and Matias, Yossi , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

  21. [28]

    2023 , eprint=

    Accelerating Large Language Model Decoding with Speculative Sampling , author=. 2023 , eprint=

  22. [29]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  23. [30]

    Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , url =

    Lakshminarayanan, Balaji and Pritzel, Alexander and Blundell, Charles , booktitle =. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , url =

  24. [31]

    Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. https://arxiv.org/abs/2302.01318 Accelerating large language model decoding with speculative sampling . Preprint, arXiv:2302.01318

  25. [32]

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1

  26. [33]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

  27. [34]

    Yu Fei, Yasaman Razeghi, and Sameer Singh. 2025. https://doi.org/10.18653/v1/2025.acl-long.623 Nudging: Inference-time alignment of LLM s via guided decoding . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12702--12739, Vienna, Austria. Association for Computational Linguistics

  28. [35]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3...

  29. [36]

    Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. 2021 a . Aligning ai with shared human values. Proceedings of the International Conference on Learning Representations (ICLR)

  30. [37]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021 b . Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR)

  31. [38]

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/9ef2ed4b7fd2c810847ffa5fa85bce38-Paper.pdf Simple and scalable predictive uncertainty estimation using deep ensembles . In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc

  32. [39]

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org

  33. [40]

    Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, and Yejin Choi. 2023. The unlocking spell on base llms: Rethinking alignment via in-context learning. ArXiv preprint

  34. [41]

    Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. https://doi.org/10.18653/v1/2022.acl-long.229 T ruthful QA : Measuring how models mimic human falsehoods . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214--3252, Dublin, Ireland. Association for Computational Linguistics

  35. [42]

    Zhixuan Liu, Zhanhui Zhou, Yuanfu Wang, Chao Yang, and Yu Qiao. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.242 Inference-time language model alignment via integrated value guidance . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 4181--4195, Miami, Florida, USA. Association for Computational Linguistics

  36. [43]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human fee...

  37. [44]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. Direct preference optimization: your language model is secretly a reward model. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS '23, Red Hook, NY, USA. Curran Associates Inc

  38. [45]

    Paul R \"o ttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. 2024. https://doi.org/10.18653/v1/2024.naacl-long.301 XST est: A test suite for identifying exaggerated safety behaviours in large language models . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Ling...

  39. [46]

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 197 others. 2025. https://arxiv.org/abs/2503.19786...

  40. [47]

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, and 179 others. 2024. https://arxiv.org/abs/2408.00118 Gemma 2: ...

  41. [48]

    Pengyu Wang, Dong Zhang, Linyang Li, Chenkun Tan, Xinghao Wang, Mozhi Zhang, Ke Ren, Botian Jiang, and Xipeng Qiu. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.585 I nfer A ligner: Inference-time alignment for harmlessness through cross-model guidance . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1...

  42. [49]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388