BitFlipScope: Scalable Fault Localization and Recovery for Bit-Flip Corruptions in LLMs

Christiana Chamon Garcia; Muhammad Zeeshan Karamat; Sadman Saif

arxiv: 2512.22174 · v2 · submitted 2025-12-18 · 💻 cs.DC · cs.AI· cs.AR· cs.CR· cs.LG

BitFlipScope: Scalable Fault Localization and Recovery for Bit-Flip Corruptions in LLMs

Muhammad Zeeshan Karamat , Sadman Saif , Christiana Chamon Garcia This is my paper

Pith reviewed 2026-05-16 20:47 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.ARcs.CRcs.LG

keywords bit-flip faultsfault localizationlarge language modelstransformer architecturesfault recoverydifferential analysisloss sensitivity profiling

0 comments

The pith

BitFlipScope localizes bit-flip corruptions in LLMs by comparing outputs and activations or profiling loss sensitivity to enable recovery without fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents BitFlipScope as a framework to detect and locate bit-flip faults in large language models caused by hardware faults or attacks. It operates in two modes: comparing to a clean reference model via differences in outputs and internal states, or using perturbation of residual paths and sensitivity of loss to find issues in the corrupted model alone. The method then supports quick fixes to restore performance. Readers should care because corrupted LLMs can behave unpredictably in real-world deployments, and this offers a way to diagnose and repair without expensive retraining.

Core claim

BitFlipScope identifies fault-affected regions in transformer architectures by performing differential analysis of outputs, hidden states, and internal activations when a reference model is available, or by residual-path perturbation and loss-sensitivity profiling when no reference exists. This localization supports lightweight performance recovery without fine-tuning in both cases.

What carries the argument

Differential analysis of model outputs and hidden states, or residual-path perturbation with loss-sensitivity profiling, to isolate corrupted parameter regions.

If this is right

Localized faults allow targeted corrections instead of full model retraining.
Models can be restored in environments without access to clean references.
Deployment in hardware-vulnerable settings becomes more feasible.
Adversarial fault injections like Rowhammer can be countered more effectively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach might apply to other neural network architectures beyond transformers.
Combining it with runtime monitoring could enable automatic self-repair in deployed systems.
Further work could test its effectiveness against multiple simultaneous bit-flips.

Load-bearing premise

Bit-flip corruptions always produce distinct and detectable changes in outputs, hidden states, or loss sensitivity that stand out from normal model variations.

What would settle it

Observing a set of injected bit-flip faults that cause output changes overlapping completely with those from clean models under varied inputs, resulting in inability to distinguish faults reliably.

Figures

Figures reproduced from arXiv: 2512.22174 by Christiana Chamon Garcia, Muhammad Zeeshan Karamat, Sadman Saif.

**Figure 1.** Figure 1: Overview of the BitFlipScope framework for detecting and mitigating bit-flip faults in LLMs. (a) A single bit-flip arising from hardware faults or attack corrupts a transformer block and degrades the model’s output. (b) Fault localization is performed using two approaches: self-referential analysis (left), which identifies abnormal loss sensitivity under residual scaling, and differential analysis (right),… view at source ↗

**Figure 2.** Figure 2: Loss change ∆Loss across a broad range of scaling values α ∈ [0.2, 1.8] for a representative block. Diagnostic sensitivity is highest in the interval [0.6, 1.4], which motivates the selection of scaling values used in subsequent experiments. [0.6, 1.4], where the model exhibits the strongest and most stable response to residual modulation. In practice, we use the discrete set α ∈ {0.6, 0.7, 0.8, 0.9, 1.1, … view at source ↗

**Figure 4.** Figure 4: Cosine similarity of attention vs. MLP layer activations [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 3.** Figure 3: Hidden-state comparison between clean and bit-flipped [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Heatmaps of ∆Loss for LLaMA 3.2 3B across the four injected faults. The corrupted block in each case shows a pronounced asymmetric loss pattern under scaling. followed by an analysis of recovery behavior and overall efficiency. A. Self-Referential Fault Localization Results We first evaluate the self-referential localization method on the LLaMA 3.2 3B model by injecting bit-flips into four critical blocks … view at source ↗

**Figure 6.** Figure 6: Block Sensitivity Scores (BSS) for LLaMA 3.2 3B. In all cases, the corrupted block exhibits the highest sensitivity. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Heatmaps of ∆Loss for LLaMA 3.1 8B across injected faults. Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 8 Block 9 Block 10 Block 11 Block 12 Block 13 Block 14 Block 15 Block 16 Block 17 Block 18 Block 19 Block 20 Block 21 Block 22 Block 23 Block 24 Block 25 Block 26 Block 27 Block 28 Block 29 Block 30 Block 31 Blocks 0.0 0.5 1.0 1.5 2.0 2.5 Block Sensitivity Score (BSS) Block 0 Blo… view at source ↗

read the original abstract

Large Language Models (LLMs) deployed in practical and safety-critical settings are increasingly susceptible to bit-flip faults caused by hardware degradation, cosmic radiation, or deliberate fault-injection attacks such as Rowhammer. These faults silently corrupt internal parameters and can lead to unpredictable or dangerous model behavior. Localizing these corruptions is essential: without identifying the affected region, it is impossible to diagnose the source of degradation, apply targeted corrective measures, or restore model functionality without resorting to costly fine-tuning or full retraining. This work introduces BitFlipScope, a scalable, software-based framework for identifying fault-affected regions within transformer architectures under two deployment scenarios. When a clean reference model is available, BitFlipScope performs differential analysis of outputs, hidden states, and internal activations for detecting anomalous behavior indicative of corruption to pinpoint or localize faults. When no reference model exists, it uses residual-path perturbation and loss-sensitivity profiling to infer the fault-impacted region directly from the corrupted model. In both settings, the framework not only enables effective fault diagnosis but also supports lightweight performance recovery without fine-tuning, offering a practical path to restoring corrupted models. Together, these capabilities make BitFlipScope an important step toward trustworthy, fault-resilient LLM deployment in hardware-prone and adversarial environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BitFlipScope sketches a software-only way to localize bit-flip faults in LLMs via differential checks or residual perturbation, but the localization step looks fragile because faults spread through attention and residuals.

read the letter

The paper's main contribution is BitFlipScope, a framework that tries to find corrupted weights in a transformer without retraining. When a clean reference model is on hand it compares outputs, hidden states, and activations to flag anomalies. When no reference exists it perturbs residual paths and profiles loss sensitivity to guess which regions are hit, then claims a lightweight recovery step that avoids fine-tuning. That second mode is the freshest part; most prior fault work either assumes a golden model or falls back to full retraining. The framing around radiation, Rowhammer, and safety-critical deployments is also straightforward and useful. The authors correctly note that identifying the bad region is a prerequisite for any targeted fix. The soft spot is the central assumption that single-bit flips leave detectable, spatially confined signatures. In a transformer a weight flip immediately mixes into many token representations through matrix multiplies and attention, so the deviation is distributed rather than pinned to one layer or head. The abstract gives no precision-recall numbers, no layer-wise maps, and no ablation against normal run-time variance or benign weight noise, so it is unclear whether the proposed statistics actually separate faults from background. If the full paper contains controlled injection experiments on real models with those metrics, the claim strengthens; otherwise the method risks high false positives. The work is aimed at people building reliable inference stacks or studying hardware faults in ML. A reader who needs concrete ideas for fault diagnosis will find usable starting points even if the validation is still light. I would send it to peer review because the problem is timely and the two-mode design is worth testing, though the referees will need to press hard on experimental evidence for the localization accuracy.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces BitFlipScope, a software-based framework for localizing bit-flip faults in transformer-based LLMs. In the presence of a clean reference model, it performs differential analysis on outputs, hidden states, and activations. Without a reference, it employs residual-path perturbation and loss-sensitivity profiling to infer fault locations from the corrupted model alone. The framework also claims to support lightweight recovery without fine-tuning.

Significance. If the localization and recovery methods prove reliable, the work would address a practical need for diagnosing hardware-induced faults in deployed LLMs without full retraining, which is relevant for safety-critical and adversarial settings. The dual reference/no-reference design is a pragmatic contribution. However, the complete absence of experimental results, datasets, metrics, or validation details in the manuscript makes it impossible to assess whether the claimed accuracy or effectiveness holds.

major comments (2)

[Abstract] Abstract: The central claim that bit-flip corruptions produce detectable, spatially localized signatures in outputs, hidden states, or loss sensitivity (distinguishable from normal run-time variation) is presented without any supporting quantitative evidence such as precision-recall curves, layer-wise sensitivity maps, or ablation studies on false-positive rates. This is load-bearing for both the differential-analysis and perturbation-based localization claims.
[Abstract] Abstract: The assertion that the framework enables 'lightweight performance recovery without fine-tuning' lacks any description of the recovery mechanism, success criteria, or comparison to baselines, leaving the recovery contribution unsupported.

minor comments (1)

[Abstract] The abstract would benefit from a concise statement of the evaluation methodology, datasets used, and key quantitative results to allow readers to immediately gauge the strength of the claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical relevance of addressing bit-flip faults in deployed LLMs. We agree that the current manuscript version lacks the quantitative experimental support needed to substantiate the claims, and we will revise accordingly to include detailed validation.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that bit-flip corruptions produce detectable, spatially localized signatures in outputs, hidden states, or loss sensitivity (distinguishable from normal run-time variation) is presented without any supporting quantitative evidence such as precision-recall curves, layer-wise sensitivity maps, or ablation studies on false-positive rates. This is load-bearing for both the differential-analysis and perturbation-based localization claims.

Authors: We agree that the abstract presents the localization claims without accompanying quantitative evidence. The manuscript describes the differential analysis and residual-path perturbation methods, but we acknowledge the absence of empirical results. In the revision we will add a full experimental section reporting precision-recall curves, layer-wise sensitivity maps, and false-positive-rate ablations obtained from controlled bit-flip injection experiments on transformer models. These results will demonstrate that the observed signatures are distinguishable from normal run-time variation. revision: yes
Referee: [Abstract] Abstract: The assertion that the framework enables 'lightweight performance recovery without fine-tuning' lacks any description of the recovery mechanism, success criteria, or comparison to baselines, leaving the recovery contribution unsupported.

Authors: We agree that the recovery claim is currently unsupported by description or evidence. The manuscript mentions lightweight recovery based on localized fault information, but provides no mechanism details. In the revision we will expand the relevant section with a precise description of the recovery procedure, explicit success criteria (e.g., accuracy restoration thresholds), and direct comparisons against baselines such as full fine-tuning and other fault-tolerance techniques. revision: yes

Circularity Check

0 steps flagged

No circularity detected; framework described as direct behavioral analysis without self-referential derivations

full rationale

The provided abstract and summary describe BitFlipScope as performing differential analysis of outputs/hidden states or residual-path perturbation and loss-sensitivity profiling. No equations, parameter-fitting steps, predictions derived from fitted inputs, self-citations, uniqueness theorems, or ansatzes are present that would reduce any claim to its own inputs by construction. The method is framed as empirical analysis of model behavior, with no load-bearing derivation chain visible. This is the expected non-finding for a descriptive systems paper without mathematical reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5544 in / 1124 out tokens · 48023 ms · 2026-05-16T20:47:18.022684+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

When no reference model exists, it uses residual-path perturbation and loss-sensitivity profiling to infer the fault-impacted region directly from the corrupted model.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

Industrial applications of large language models,

M. Raza, Z. Jahangir, M. B. Riaz, M. J. Saeed, and M. A. Sattar, “Industrial applications of large language models,”Scientific Reports, vol. 15, Apr 2025

work page 2025
[2]

Security and privacy challenges of large language models: A survey,

B. C. Das, M. H. Amini, and Y . Wu, “Security and privacy challenges of large language models: A survey,”ACM Comput. Surv., vol. 57, Feb. 2025

work page 2025
[3]

Llwra: Large language models weight replacement attack,

A. Almalky, S. Ahmed, R. Zhou, M. A. Nahian, A. A. Arafat, S. Angizi, and A. S. Rakin, “Llwra: Large language models weight replacement attack,” in2025 International Conference on Control, Automation and Diagnosis (ICCAD), pp. 1–6, 2025

work page 2025
[4]

DeepHammer: Depleting the intelli- gence of deep neural networks through targeted chain of bit flips,

F. Yao, A. S. Rakin, and D. Fan, “DeepHammer: Depleting the intelli- gence of deep neural networks through targeted chain of bit flips,” in 29th USENIX Security Symposium (USENIX Security 20), pp. 1463– 1480, USENIX Association, Aug. 2020

work page 2020
[5]

Bit-flip attack: Crushing neural network with progressive bit search,

A. S. Rakin, Z. He, and D. Fan, “Bit-flip attack: Crushing neural network with progressive bit search,” in2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1211–1220, 2019

work page 2019
[6]

Genbfa: An evolutionary optimization approach to bit-flip attacks on llms,

S. Das, S. Bhattacharya, S. Kundu, S. Kundu, A. Menon, A. Raha, and K. Basu, “Genbfa: An evolutionary optimization approach to bit-flip attacks on llms,” 2025

work page 2025
[7]

Concurrent weight encoding-based detection for bit-flip attack on neural network accelerators,

Q. Liu, W. Wen, and Y . Wang, “Concurrent weight encoding-based detection for bit-flip attack on neural network accelerators,” inPro- ceedings of the 39th International Conference on Computer-Aided Design, ICCAD ’20, (New York, NY , USA), Association for Computing Machinery, 2020

work page 2020
[8]

Forget and rewire: Enhancing the resilience of transformer-based models against Bit-Flip attacks,

N. Nazari, H. M. Makrani, C. Fang, H. Sayadi, S. Rafatirad, K. N. Khasawneh, and H. Homayoun, “Forget and rewire: Enhancing the resilience of transformer-based models against Bit-Flip attacks,” in33rd USENIX Security Symposium (USENIX Security 24), (Philadelphia, PA), pp. 1349–1366, USENIX Association, Aug. 2024

work page 2024
[9]

NeuroPots: Realtime proactive defense against Bit-Flip attacks in neural networks,

Q. Liu, J. Yin, W. Wen, C. Yang, and S. Sha, “NeuroPots: Realtime proactive defense against Bit-Flip attacks in neural networks,” in32nd USENIX Security Symposium (USENIX Security 23), (Anaheim, CA), pp. 6347–6364, USENIX Association, Aug. 2023

work page 2023
[10]

Zero memory overhead approach for protecting vision transformer parameters against bit-flip faults,

F. Baradaran, M. Raji, A. Baradaran, A. Baradaran, and R. Akbarifard, “Zero memory overhead approach for protecting vision transformer parameters against bit-flip faults,” in2025 29th International Computer Conference, Computer Society of Iran (CSICC), pp. 1–5, 2025

work page 2025
[11]

Introducing Meta Llama 3: The most capable openly available llm to date

M. AI, “Introducing Meta Llama 3: The most capable openly available llm to date.” https://ai.meta.com/research/publications/ introducing-meta-llama-3, 2024. Accessed: 2025-02-15

work page 2024
[12]

Sembeddings: how to evaluate model misfit before data collection using large-language models,

T. Feraco and E. Toffalini, “Sembeddings: how to evaluate model misfit before data collection using large-language models,”Frontiers in Psychology, vol. V olume 15 - 2024, 2025

work page 2024
[13]

Em- bedllm: Learning compact representations of large language models,

R. Zhuang, T. Wu, Z. Wen, A. Li, J. Jiao, and K. Ramchandran, “Em- bedllm: Learning compact representations of large language models,” Oct. 2024

work page 2024
[14]

A survey of bit-flip attacks on deep neural network and corresponding defense methods,

C. Qian, M. Zhang, Y . Nie, S. Lu, and H. Cao, “A survey of bit-flip attacks on deep neural network and corresponding defense methods,” Electronics, vol. 12, no. 4, 2023

work page 2023
[15]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[16]

Measuring mas- sive multitask language understanding,

D. Hendrycks, C. Basart, S. Kadavath, M. Mazeika, A. Arora, E. He, N. Carlini, J. Schulman, D. Song, and J. Steinhardt, “Measuring mas- sive multitask language understanding,” inInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[17]

Preference- oriented supervised fine-tuning: Favoring target model over aligned large language models,

Y . Fan, Y . Hong, Q. Wang, J. Bao, H. Jiang, and Y . Song, “Preference- oriented supervised fine-tuning: Favoring target model over aligned large language models,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, pp. 23859–23867, Apr. 2025. APPENDIX A. Additional Self-Referential Results for LLaMA 3.1 8B 0.6 0.7 0.8 0.9 1.1 1.2 1.3 ...

work page 2025

[1] [1]

Industrial applications of large language models,

M. Raza, Z. Jahangir, M. B. Riaz, M. J. Saeed, and M. A. Sattar, “Industrial applications of large language models,”Scientific Reports, vol. 15, Apr 2025

work page 2025

[2] [2]

Security and privacy challenges of large language models: A survey,

B. C. Das, M. H. Amini, and Y . Wu, “Security and privacy challenges of large language models: A survey,”ACM Comput. Surv., vol. 57, Feb. 2025

work page 2025

[3] [3]

Llwra: Large language models weight replacement attack,

A. Almalky, S. Ahmed, R. Zhou, M. A. Nahian, A. A. Arafat, S. Angizi, and A. S. Rakin, “Llwra: Large language models weight replacement attack,” in2025 International Conference on Control, Automation and Diagnosis (ICCAD), pp. 1–6, 2025

work page 2025

[4] [4]

DeepHammer: Depleting the intelli- gence of deep neural networks through targeted chain of bit flips,

F. Yao, A. S. Rakin, and D. Fan, “DeepHammer: Depleting the intelli- gence of deep neural networks through targeted chain of bit flips,” in 29th USENIX Security Symposium (USENIX Security 20), pp. 1463– 1480, USENIX Association, Aug. 2020

work page 2020

[5] [5]

Bit-flip attack: Crushing neural network with progressive bit search,

A. S. Rakin, Z. He, and D. Fan, “Bit-flip attack: Crushing neural network with progressive bit search,” in2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1211–1220, 2019

work page 2019

[6] [6]

Genbfa: An evolutionary optimization approach to bit-flip attacks on llms,

S. Das, S. Bhattacharya, S. Kundu, S. Kundu, A. Menon, A. Raha, and K. Basu, “Genbfa: An evolutionary optimization approach to bit-flip attacks on llms,” 2025

work page 2025

[7] [7]

Concurrent weight encoding-based detection for bit-flip attack on neural network accelerators,

Q. Liu, W. Wen, and Y . Wang, “Concurrent weight encoding-based detection for bit-flip attack on neural network accelerators,” inPro- ceedings of the 39th International Conference on Computer-Aided Design, ICCAD ’20, (New York, NY , USA), Association for Computing Machinery, 2020

work page 2020

[8] [8]

Forget and rewire: Enhancing the resilience of transformer-based models against Bit-Flip attacks,

N. Nazari, H. M. Makrani, C. Fang, H. Sayadi, S. Rafatirad, K. N. Khasawneh, and H. Homayoun, “Forget and rewire: Enhancing the resilience of transformer-based models against Bit-Flip attacks,” in33rd USENIX Security Symposium (USENIX Security 24), (Philadelphia, PA), pp. 1349–1366, USENIX Association, Aug. 2024

work page 2024

[9] [9]

NeuroPots: Realtime proactive defense against Bit-Flip attacks in neural networks,

Q. Liu, J. Yin, W. Wen, C. Yang, and S. Sha, “NeuroPots: Realtime proactive defense against Bit-Flip attacks in neural networks,” in32nd USENIX Security Symposium (USENIX Security 23), (Anaheim, CA), pp. 6347–6364, USENIX Association, Aug. 2023

work page 2023

[10] [10]

Zero memory overhead approach for protecting vision transformer parameters against bit-flip faults,

F. Baradaran, M. Raji, A. Baradaran, A. Baradaran, and R. Akbarifard, “Zero memory overhead approach for protecting vision transformer parameters against bit-flip faults,” in2025 29th International Computer Conference, Computer Society of Iran (CSICC), pp. 1–5, 2025

work page 2025

[11] [11]

Introducing Meta Llama 3: The most capable openly available llm to date

M. AI, “Introducing Meta Llama 3: The most capable openly available llm to date.” https://ai.meta.com/research/publications/ introducing-meta-llama-3, 2024. Accessed: 2025-02-15

work page 2024

[12] [12]

Sembeddings: how to evaluate model misfit before data collection using large-language models,

T. Feraco and E. Toffalini, “Sembeddings: how to evaluate model misfit before data collection using large-language models,”Frontiers in Psychology, vol. V olume 15 - 2024, 2025

work page 2024

[13] [13]

Em- bedllm: Learning compact representations of large language models,

R. Zhuang, T. Wu, Z. Wen, A. Li, J. Jiao, and K. Ramchandran, “Em- bedllm: Learning compact representations of large language models,” Oct. 2024

work page 2024

[14] [14]

A survey of bit-flip attacks on deep neural network and corresponding defense methods,

C. Qian, M. Zhang, Y . Nie, S. Lu, and H. Cao, “A survey of bit-flip attacks on deep neural network and corresponding defense methods,” Electronics, vol. 12, no. 4, 2023

work page 2023

[15] [15]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017

[16] [16]

Measuring mas- sive multitask language understanding,

D. Hendrycks, C. Basart, S. Kadavath, M. Mazeika, A. Arora, E. He, N. Carlini, J. Schulman, D. Song, and J. Steinhardt, “Measuring mas- sive multitask language understanding,” inInternational Conference on Learning Representations (ICLR), 2021

work page 2021

[17] [17]

Preference- oriented supervised fine-tuning: Favoring target model over aligned large language models,

Y . Fan, Y . Hong, Q. Wang, J. Bao, H. Jiang, and Y . Song, “Preference- oriented supervised fine-tuning: Favoring target model over aligned large language models,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, pp. 23859–23867, Apr. 2025. APPENDIX A. Additional Self-Referential Results for LLaMA 3.1 8B 0.6 0.7 0.8 0.9 1.1 1.2 1.3 ...

work page 2025