Differentiable Faithfulness Alignment for Cross-Model Circuit Transfer

Anna Korhonen; Binxu Wang; Shay B. Cohen; Shun Shao; Yonatan Belinkov

arxiv: 2604.24302 · v1 · submitted 2026-04-27 · 💻 cs.CL

Differentiable Faithfulness Alignment for Cross-Model Circuit Transfer

Shun Shao , Binxu Wang , Shay B. Cohen , Anna Korhonen , Yonatan Belinkov This is my paper

Pith reviewed 2026-05-08 03:38 UTC · model grok-4.3

classification 💻 cs.CL

keywords differentiable faithfulness alignmentcross-model circuit transfermechanistic interpretabilitylanguage model circuitsnode importance alignmentfaithfulness objectivecircuit discoverymodel scaling

0 comments

The pith

A differentiable alignment transfers node importance scores from smaller language models to larger ones by optimizing a soft faithfulness objective.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Differentiable Faithfulness Alignment as a way to move circuit information between language models without running full circuit discovery on the larger target. It learns a mapping that takes importance scores from a small source model and projects them onto the target, then tunes the mapping so the transferred circuits keep the target model's behavior on tasks like factual retrieval and arithmetic. This avoids the high cost of direct methods on big models and uses the source as a prior. The evaluations show the approach works best when the source and target are close in size and architecture, such as Llama-3 1B to 3B. It beats simple baselines and sometimes matches or exceeds direct attribution in faithfulness.

Core claim

We introduce Differentiable Faithfulness Alignment (DFA), a framework that transfers circuit information from a smaller source model to a larger target model through a learned differentiable alignment. DFA projects source-model node importance scores into the target model and trains this mapping with a soft faithfulness objective, avoiding full circuit discovery on the target model. We evaluate DFA on Llama-3 and Qwen-2.5 across six tasks spanning factual retrieval, multiple-choice reasoning, and arithmetic. The strongest results occur on Llama-3 1B to 3B, where aligned circuits are often competitive with direct node attribution and zero-shot transfer remains effective.

What carries the argument

Differentiable Faithfulness Alignment (DFA), a learned mapping from source node importance scores to the target model that is optimized by a soft faithfulness loss to preserve task performance.

If this is right

On Llama-3 1B to 3B, DFA circuits are competitive with or stronger than direct node attribution in faithfulness for the tested tasks.
Zero-shot transfer of the learned alignment works without additional training in some settings.
Transfer performance drops for larger size gaps and is substantially weaker on Qwen-2.5 than on Llama-3.
DFA outperforms simple baselines across factual retrieval, reasoning, and arithmetic tasks.
In some cases the method recovers target-model circuits whose faithfulness matches or exceeds that of direct attribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could lower the cost of circuit discovery for very large models by reusing smaller ones as starting points.
It suggests mechanistic priors might be chained across successive model scales as training progresses.
Architecture differences appear to limit transfer, pointing to a need for alignment methods that account for model family structure.
The soft faithfulness objective might extend to transferring other interpretability signals such as activation patterns beyond node importance.

Load-bearing premise

Node importance scores from a smaller source model contain information that can be aligned differentiably to serve as useful mechanistic priors for a larger target without full circuit discovery on the target.

What would settle it

If DFA-aligned circuits on a new source-target pair produce substantially lower faithfulness scores or task accuracy than circuits found by direct attribution on the target, the transfer method would not succeed.

Figures

Figures reproduced from arXiv: 2604.24302 by Anna Korhonen, Binxu Wang, Shay B. Cohen, Shun Shao, Yonatan Belinkov.

**Figure 1.** Figure 1: Overview of Differentiable Faithfulness Alignment (DFA). Source-model node importance scores are projected into the target model through a learned alignment matrix W and converted into a soft mask mL. This mask interpolates between clean and corrupted target-model activations, producing an intervened output distribution. The alignment matrix is then optimized with a faithfulness loss and an L1 sparsity pen… view at source ↗

**Figure 2.** Figure 2: Direct target-model circuits (Gold) and DFA-predicted circuits (Aligned) on ARITHMETIC-SUBTRACTION for three Llama-3 source–target pairs. The x-axis shows the proportion of retained nodes k, and the y-axis shows faithfulness f . Aligned circuits closely track the gold faithfulness curves and often achieve higher CPR. 4.1 Experimental setup Models and tasks. We evaluate on Llama-3 and Qwen-2.5. Our main res… view at source ↗

**Figure 3.** Figure 3: Mean faithfulness recovery ratio across tasks for each attribution method and view at source ↗

**Figure 4.** Figure 4: Cross-task transfer matrix for NAP-IG-INPUTS on Llama-3, evaluated with CPR. Rows denote training tasks and columns evaluation tasks. Entries report faithfulness of transferred circuits; strong off-diagonal values indicate cross-task generalization. 4.4 Scaling across model gaps and families We next test whether DFA remains effective as the source–target gap increases and across model families view at source ↗

**Figure 5.** Figure 5: Best aligned circuits (Aligned) versus direct target-model circuits (Gold) across tasks for NAP-IG-ACTIVATIONS on Llama-3 (1B→ 3B). Across tasks, aligned circuits recover similar faithfulness curves and often achieve comparable or higher CPR. 5 Discussion We studied whether mechanistic circuits discovered in smaller language models can be transferred to larger models through learned alignment. We introduce… view at source ↗

**Figure 6.** Figure 6: Per-task comparison of direct target-model circuits ( view at source ↗

**Figure 7.** Figure 7: Per-task comparison of direct target-model circuits ( view at source ↗

**Figure 8.** Figure 8: Per-task comparison of direct target-model circuits ( view at source ↗

**Figure 9.** Figure 9: Per-task comparison of direct target-model circuits ( view at source ↗

**Figure 10.** Figure 10: Per-task comparison of direct target-model circuits ( view at source ↗

**Figure 11.** Figure 11: Mean faithfulness recovery ratio across tasks for each attribution method and view at source ↗

**Figure 12.** Figure 12: Mean faithfulness recovery ratio across tasks for each attribution method and view at source ↗

**Figure 13.** Figure 13: Cross-task transfer matrix for EAP-IG-ACTIVATIONS on LLaMA-3, evaluated using CPR. Each row denotes the task used to train the alignment, and each column denotes the evaluation task. Entries report the faithfulness of the transferred circuit in the target model. Diagonal entries correspond to matched-task transfer, while off-diagonal entries measure cross-task generalization. The matrix is not symmetric, … view at source ↗

**Figure 14.** Figure 14: Cross-task transfer matrix for EAP on LLaMA-3, evaluated using CPR. Each row denotes the task used to train the alignment, and each column denotes the evaluation task. Entries report the faithfulness of the transferred circuit in the target model. Diagonal entries correspond to matched-task transfer, while off-diagonal entries measure cross-task generalization. The matrix is not symmetric, indicating that… view at source ↗

**Figure 15.** Figure 15: Cross-task transfer matrix for EAP-IG-INPUTS on Qwen-2.5, evaluated using CPR. Each row denotes the task used to train the alignment, and each column denotes the evaluation task. Entries report the faithfulness of the transferred circuit in the target model. Diagonal entries correspond to matched-task transfer, while off-diagonal entries measure cross-task generalization. The matrix is not symmetric, indi… view at source ↗

**Figure 16.** Figure 16: Cross-task transfer matrix for EAP-IG-ACTIVATIONS on Qwen-2.5, evaluated using CPR. Each row denotes the task used to train the alignment, and each column denotes the evaluation task. Entries report the faithfulness of the transferred circuit in the target model. Diagonal entries correspond to matched-task transfer, while off-diagonal entries measure cross-task generalization. The matrix is not symmetric,… view at source ↗

**Figure 17.** Figure 17: Cross-task transfer matrix for EAP on Qwen-2.5, evaluated using CPR. Each row denotes the task used to train the alignment, and each column denotes the evaluation task. Entries report the faithfulness of the transferred circuit in the target model. Diagonal entries correspond to matched-task transfer, while off-diagonal entries measure cross-task generalization. The matrix is not symmetric, indicating tha… view at source ↗

read the original abstract

Mechanistic interpretability has made it possible to localize circuits underlying specific behaviors in language models, but existing methods are expensive, model-specific, and difficult to scale to larger architectures. We introduce \textbf{Differentiable Faithfulness Alignment (DFA)}, a framework that transfers circuit information from a smaller source model to a larger target model through a learned differentiable alignment. DFA projects source-model node importance scores into the target model and trains this mapping with a soft faithfulness objective, avoiding full circuit discovery on the target model. We evaluate DFA on Llama-3 and Qwen-2.5 across six tasks spanning factual retrieval, multiple-choice reasoning, and arithmetic. The strongest results occur on Llama-3 $1$B$\rightarrow3$B, where aligned circuits are often competitive with direct node attribution and zero-shot transfer remains effective. Recovery weakens for larger source--target gaps and is substantially lower on Qwen-2.5, suggesting that transfer becomes harder as architectural and scaling differences increase. Overall, DFA consistently outperforms simple baselines and, in some settings, recovers target-model circuits with faithfulness comparable to or stronger than direct attribution. These results suggest that smaller models can provide useful mechanistic priors for larger ones, while highlighting both the promise and the limits of node-level cross-model circuit alignment.\footnote{Code is available at https://github.com/jasonshaoshun/dfa-circuits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DFA gives a workable differentiable route for aligning node scores across model sizes, but the transfer story rests on unshown details of how much the source actually constrains the solution.

read the letter

The paper's main contribution is a learned mapping that takes node importance scores from a smaller source model and projects them into a larger target, then optimizes the mapping parameters with a soft faithfulness objective on the target. This is meant to supply mechanistic priors without running full circuit discovery on the big model. They test it on Llama-3 1B-to-3B and Qwen-2.5 pairs across factual, reasoning, and arithmetic tasks, with code released at the GitHub link in the abstract.

Referee Report

3 major / 2 minor

Summary. The paper introduces Differentiable Faithfulness Alignment (DFA), a framework that transfers circuit information from a smaller source LM to a larger target LM by projecting source node importance scores into the target via a learned differentiable mapping trained against a soft faithfulness objective. This avoids full circuit discovery on the target. Evaluations on Llama-3 (1B→3B) and Qwen-2.5 models across factual retrieval, multiple-choice reasoning, and arithmetic tasks show DFA often competitive with or exceeding direct node attribution on smaller gaps, outperforming simple baselines, with zero-shot transfer effective in some cases; performance degrades for larger gaps and on Qwen-2.5.

Significance. If the transfer interpretation holds, DFA offers a scalable route to mechanistic interpretability by using smaller models as mechanistic priors for larger ones, reducing the cost of circuit discovery. The empirical results on Llama-3 pairs and the code release are concrete strengths; the observed limits with scale/architectural gaps provide useful boundary conditions for the approach.

major comments (3)

The central transfer claim requires that source node scores remain load-bearing after alignment. The manuscript does not specify the parameterization or capacity of the learned mapping (e.g., whether it is a low-rank or tied function) nor any auxiliary loss that penalizes deviation from source relative importances. Without such constraints, gradient descent on the faithfulness objective alone can reassign importances to maximize the target objective independently of the source, turning DFA into target-only optimization. This must be clarified with an explicit definition of the mapping and a control experiment replacing source scores with random or uniform values.
Recovery is reported as weaker for larger source-target gaps and substantially lower on Qwen-2.5. The manuscript should quantify this with per-task faithfulness scores, statistical significance tests, and an ablation showing whether the drop is due to architectural mismatch or failure of the alignment to preserve source structure (e.g., correlation between source and aligned scores before/after training).
The soft faithfulness objective is central to training but its precise formulation, including any post-hoc choices in evaluation metrics or circuit extraction thresholds, is not detailed enough to assess whether the reported competitiveness with direct attribution is robust or sensitive to those choices.

minor comments (2)

The abstract and introduction would benefit from a clearer statement of the exact node sets being aligned (e.g., attention heads, MLPs, or residual streams) and how the projection handles differing layer counts between source and target.
Table or figure captions should explicitly state the number of runs, random seeds, and error bars for all reported faithfulness numbers to allow assessment of variability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the transfer mechanism and improve the empirical rigor of the work. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: The central transfer claim requires that source node scores remain load-bearing after alignment. The manuscript does not specify the parameterization or capacity of the learned mapping (e.g., whether it is a low-rank or tied function) nor any auxiliary loss that penalizes deviation from source relative importances. Without such constraints, gradient descent on the faithfulness objective alone can reassign importances to maximize the target objective independently of the source, turning DFA into target-only optimization. This must be clarified with an explicit definition of the mapping and a control experiment replacing source scores with random or uniform values.

Authors: We agree that an explicit definition of the mapping and a control experiment are necessary to substantiate that source scores remain load-bearing. The current manuscript describes the mapping as a learned differentiable projection of source node importance scores but does not detail its parameterization. In the revision we will add a precise definition (including capacity and any weight-tying) in the Methods section. We will also include the requested control experiment that replaces source scores with random or uniform values and reports the resulting drop in target faithfulness, thereby demonstrating that the alignment depends on source-derived information rather than independent target optimization. revision: yes
Referee: Recovery is reported as weaker for larger source-target gaps and substantially lower on Qwen-2.5. The manuscript should quantify this with per-task faithfulness scores, statistical significance tests, and an ablation showing whether the drop is due to architectural mismatch or failure of the alignment to preserve source structure (e.g., correlation between source and aligned scores before/after training).

Authors: We will expand the results section to report per-task faithfulness scores for all model pairs and tasks. Where multiple runs are available we will add statistical significance tests. We will further include an ablation that computes the Pearson correlation between source node scores and the aligned scores both before and after training; this will help distinguish whether performance degradation stems from architectural mismatch or from the alignment failing to preserve source structure. revision: yes
Referee: The soft faithfulness objective is central to training but its precise formulation, including any post-hoc choices in evaluation metrics or circuit extraction thresholds, is not detailed enough to assess whether the reported competitiveness with direct attribution is robust or sensitive to those choices.

Authors: We will provide the complete mathematical formulation of the soft faithfulness objective in the Methods section, together with the exact post-hoc choices used for evaluation metrics and any circuit extraction thresholds. This added detail will allow readers to assess robustness and will be accompanied by a brief sensitivity analysis where feasible. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external faithfulness objective and empirical comparison.

full rationale

The paper defines DFA as a learned projection of source node importance scores trained via an independent soft faithfulness objective on the target model, then evaluates the resulting circuits against direct attribution baselines. No equations or steps reduce the output to the input by construction, no self-definitional mappings appear, and no load-bearing claims rest on self-citations or renamed known results. The method is presented as an optimization procedure whose success is measured externally rather than tautologically, making the derivation self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

The central claim rests on the existence of transferable node importance structure across models and the effectiveness of a learned differentiable mapping; no explicit free parameters, axioms, or invented entities are detailed in the abstract beyond the introduced DFA method itself.

free parameters (1)

alignment mapping parameters
Parameters of the learned differentiable projection are fitted during training on the faithfulness objective.

invented entities (1)

Differentiable Faithfulness Alignment (DFA) no independent evidence
purpose: Framework for projecting and aligning source-model node importance into target models
New method introduced in the paper; no independent evidence provided beyond the reported experiments.

pith-pipeline@v0.9.0 · 5555 in / 1194 out tokens · 27496 ms · 2026-05-08T03:38:03.546675+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages

[1]

RandomW(Lower Bound) 0.25 0.25 0.25 0.25 0.27 0.28

work page
[2]

Scrambled inputs0.27 0.25 0.25 0.25 0.25 0.32

work page
[3]

PermutedWcolumns 0.29 0.27 0.25 0.25 0.41 0.31

work page
[4]

Heuristic Depth Mean 0.25 0.25 0.25 0.25 0.25 0.25 DFA (Zero-shot)0.27 0.46 0.28 0.260.44 0.48 DFA (Best) 0.32 0.49 0.39 0.26 0.46 0.52 NAP-IG-Activations

work page
[5]

RandomW(Lower Bound) 0.25 0.25 0.25 0.25 0.26 0.25

work page
[6]

Scrambled inputs0.29 0.31 0.33 0.25 0.34 0.37

work page
[7]

PermutedWcolumns 0.26 0.26 0.25 0.25 0.26 0.27

work page
[8]

We report the zero-shot faithfulness drop under various structural corruptions using CPR

Heuristic Depth Mean 0.25 0.25 0.25 0.25 0.25 0.25 DFA (Zero-shot) 0.33 0.46 0.820.28 0.34 0.51 DFA (Best) 0.33 0.46 0.82 0.45 0.41 0.51 Table 11: Validation and Ablation Study for target model llama3-3b (source llama3-1b) by NAP and NAP-IG-Activations. We report the zero-shot faithfulness drop under various structural corruptions using CPR. Method Settin...

work page
[10]

Scrambled inputs0.33 0.26 0.27 0.25 0.34 0.27

work page
[11]

PermutedWcolumns 0.29 0.27 0.25 0.25 0.26 0.28

work page
[12]

Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.33 0.51 0.67 0.39 0.47 0.38 DFA (Best) 0.34 0.51 0.67 0.39 0.47 0.38 NAP

work page
[14]

Scrambled inputs0.390.25 0.26 0.25 0.25 0.30

work page
[15]

PermutedWcolumns 0.31 0.25 0.25 0.25 0.33 0.31

work page
[16]

Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.27 0.40 0.400.26 0.30 0.29 DFA (Best)0.33 0.46 0.40 0.33 0.49 0.53 NAP-IG-Activations

work page
[18]

Scrambled inputs0.560.34 0.31 0.25 0.33 0.26

work page
[19]

PermutedWcolumns 0.30 0.26 0.25 0.25 0.26 0.27

work page
[20]

We report the zero-shot faithfulness drop under various structural corruptions using CPR

Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.360.400.41 0.28 0.35 0.40 DFA (Best)0.43 0.40 0.41 0.28 0.35 0.40 Table 12: Validation and Ablation Study for target model llama3-8b (source llama3-1b). We report the zero-shot faithfulness drop under various structural corruptions using CPR. B V alidating the Alignment Mechanism Tables 11–16 test whether ...

work page
[22]

Scrambled inputs0.28 0.28 0.25 0.27 0.29 0.49

work page
[23]

PermutedWcolumns 0.25 0.26 0.25 0.25 0.27 0.26

work page
[24]

Heuristic Depth Mean - - - - - - DFA (Zero-shot) 0.57 0.460.27 0.46 0.27 0.25 DFA (Best) 0.57 0.46 0.45 0.67 0.340.34 NAP

work page
[25]

RandomW(Lower Bound) 0.25 0.25 0.25 0.25 0.25 0.25

work page
[26]

Scrambled inputs0.26 0.26 0.25 0.26 0.250.35

work page
[27]

PermutedWcolumns 0.25 0.26 0.25 0.25 0.26 0.26

work page
[28]

Heuristic Depth Mean - - - - - - DFA (Zero-shot) 0.36 0.320.27 0.26 0.290.27 DFA (Best) 0.36 0.32 0.40 0.40 0.290.29 NAP-IG-Activations

work page
[29]

RandomW(Lower Bound) 0.25 0.26 0.25 0.25 0.26 0.26

work page
[30]

Scrambled inputs0.25 0.31 0.25 0.42 0.25 0.29

work page
[31]

PermutedWcolumns 0.29 0.26 0.25 0.25 0.26 0.32

work page
[32]

We report the zero-shot faithfulness drop under various structural corruptions using CPR

Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.44 0.50 0.28 0.590.36 0.43 DFA (Best) 0.50 0.54 0.46 0.59 0.56 0.56 Table 13: Validation and Ablation Study for target model llama3-8b (source llama3-3b). We report the zero-shot faithfulness drop under various structural corruptions using CPR. Method Setting / Control IOI MCQA Arith + Arith - ARC-E ARC-C ...

work page
[34]

Scrambled inputs0.25 0.27 - 0.25 0.25 0.27

work page
[35]

PermutedWcolumns 0.250.32- 0.25 0.29 0.29

work page
[36]

Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.26 0.25 -0.390.27 0.26 DFA (Best) 0.300.30 -0.39 0.29 0.29 NAP

work page
[37]

RandomW(Lower Bound) 0.25 0.25 - 0.25 0.27 0.27

work page
[38]

Scrambled inputs0.250.44-0.410.28 0.25

work page
[39]

PermutedWcolumns 0.24 0.28 - 0.25 0.27 0.26

work page
[40]

Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.25 0.27 - 0.250.38 0.36 DFA (Best) 0.280.28 - 0.40 0.38 0.36 NAP-IG-Activations

work page
[41]

RandomW(Lower Bound) 0.25 0.25 - 0.25 0.28 0.30

work page
[42]

Scrambled inputs0.250.43- 0.25 0.33 0.25

work page
[43]

PermutedWcolumns 0.25 0.25 - 0.25 0.26 0.25

work page
[44]

We report the zero-shot faithfulness drop under various structural corruptions using CPR

Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.25 0.28 -0.40 0.42 0.44 DFA (Best) 0.290.34 -0.40 0.42 0.44 Table 14: Validation and Ablation Study for target model qwen2.5-1.5b (source qwen2.5- 0.5b). We report the zero-shot faithfulness drop under various structural corruptions using CPR. 22 Preprint. Under review. Method Setting / Control IOI MCQA Ar...

work page
[45]

RandomW(Lower Bound) 0.25 0.25 - 0.25 0.25 0.25

work page
[46]

Scrambled inputs0.25 0.36 - 0.25 0.25 0.29

work page
[47]

PermutedWcolumns 0.27 0.38- 0.250.35 0.34

work page
[48]

Heuristic Depth Mean - - - - - - DFA (Zero-shot) 0.300.32 - 0.25 0.28 0.28 DFA (Best) 0.300.37 -0.260.35 0.34 NAP

work page
[50]

Scrambled inputs0.250.57- 0.25 0.29 0.25

work page
[51]

PermutedWcolumns 0.29 0.30 - 0.25 0.30 0.29

work page
[52]

Heuristic Depth Mean - - - - - - DFA (Zero-shot) 0.320.27 - 0.24 0.32 0.35 DFA (Best) 0.320.37 -0.26 0.37 0.42 NAP-IG-Activations

work page
[53]

RandomW(Lower Bound) 0.25 0.27 - 0.25 0.25 0.25

work page
[54]

Scrambled inputs0.250.60- 0.25 0.24 0.25

work page
[55]

PermutedWcolumns0.300.30 - 0.25 0.31 0.31

work page
[56]

We report the zero-shot faithfulness drop under various structural corruptions using CPR

Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.25 0.34 - 0.25 0.34 0.48 DFA (Best)0.25 0.36 -0.26 0.34 0.48 Table 15: Validation and Ablation Study for target model qwen2.5-3b (source qwen2.5-0.5b). We report the zero-shot faithfulness drop under various structural corruptions using CPR. Method Setting / Control IOI MCQA Arith + Arith - ARC-E ARC-C NAP...

work page
[57]

RandomW(Lower Bound) 0.25 0.28 - 0.25 0.27 0.27

work page
[58]

Scrambled inputs0.260.38- 0.25 0.280.34

work page
[59]

PermutedWcolumns 0.29 0.27 -0.280.31 0.30

work page
[60]

Heuristic Depth Mean - - - - - - DFA (Zero-shot) 0.360.33 - 0.27 0.29 0.29 DFA (Best) 0.360.33 - 0.27 0.310.31 NAP

work page
[61]

RandomW(Lower Bound) 0.25 0.34 - 0.25 0.29 0.25

work page
[62]

Scrambled inputs0.25 0.41- 0.26 0.25 0.34

work page
[63]

PermutedWcolumns 0.24 0.29 - 0.26 0.27 0.28

work page
[64]

Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.23 0.28 - 0.260.40 0.38 DFA (Best) 0.280.28 -0.27 0.40 0.38 NAP-IG-Activations

work page
[65]

RandomW(Lower Bound) 0.250.39- 0.26 0.28 0.25

work page
[66]

Scrambled inputs0.25 0.34 - 0.27 0.26 0.29

work page
[67]

PermutedWcolumns 0.23 0.28 -0.270.27 0.27

work page
[68]

We report the zero-shot faithfulness drop under various structural corruptions using CPR

Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.23 0.30 - 0.260.35 0.34 DFA (Best) 0.290.30 - 0.27 0.35 0.34 Table 16: Validation and Ablation Study for target model qwen2.5-3b (source qwen2.5-1.5b). We report the zero-shot faithfulness drop under various structural corruptions using CPR. 23 Preprint. Under review. Method T arget Model Pair Baseline DFA...

work page

[1] [1]

RandomW(Lower Bound) 0.25 0.25 0.25 0.25 0.27 0.28

work page

[2] [2]

Scrambled inputs0.27 0.25 0.25 0.25 0.25 0.32

work page

[3] [3]

PermutedWcolumns 0.29 0.27 0.25 0.25 0.41 0.31

work page

[4] [4]

Heuristic Depth Mean 0.25 0.25 0.25 0.25 0.25 0.25 DFA (Zero-shot)0.27 0.46 0.28 0.260.44 0.48 DFA (Best) 0.32 0.49 0.39 0.26 0.46 0.52 NAP-IG-Activations

work page

[5] [5]

RandomW(Lower Bound) 0.25 0.25 0.25 0.25 0.26 0.25

work page

[6] [6]

Scrambled inputs0.29 0.31 0.33 0.25 0.34 0.37

work page

[7] [7]

PermutedWcolumns 0.26 0.26 0.25 0.25 0.26 0.27

work page

[8] [8]

We report the zero-shot faithfulness drop under various structural corruptions using CPR

Heuristic Depth Mean 0.25 0.25 0.25 0.25 0.25 0.25 DFA (Zero-shot) 0.33 0.46 0.820.28 0.34 0.51 DFA (Best) 0.33 0.46 0.82 0.45 0.41 0.51 Table 11: Validation and Ablation Study for target model llama3-3b (source llama3-1b) by NAP and NAP-IG-Activations. We report the zero-shot faithfulness drop under various structural corruptions using CPR. Method Settin...

work page

[9] [10]

Scrambled inputs0.33 0.26 0.27 0.25 0.34 0.27

work page

[10] [11]

PermutedWcolumns 0.29 0.27 0.25 0.25 0.26 0.28

work page

[11] [12]

Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.33 0.51 0.67 0.39 0.47 0.38 DFA (Best) 0.34 0.51 0.67 0.39 0.47 0.38 NAP

work page

[12] [14]

Scrambled inputs0.390.25 0.26 0.25 0.25 0.30

work page

[13] [15]

PermutedWcolumns 0.31 0.25 0.25 0.25 0.33 0.31

work page

[14] [16]

Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.27 0.40 0.400.26 0.30 0.29 DFA (Best)0.33 0.46 0.40 0.33 0.49 0.53 NAP-IG-Activations

work page

[15] [18]

Scrambled inputs0.560.34 0.31 0.25 0.33 0.26

work page

[16] [19]

PermutedWcolumns 0.30 0.26 0.25 0.25 0.26 0.27

work page

[17] [20]

We report the zero-shot faithfulness drop under various structural corruptions using CPR

Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.360.400.41 0.28 0.35 0.40 DFA (Best)0.43 0.40 0.41 0.28 0.35 0.40 Table 12: Validation and Ablation Study for target model llama3-8b (source llama3-1b). We report the zero-shot faithfulness drop under various structural corruptions using CPR. B V alidating the Alignment Mechanism Tables 11–16 test whether ...

work page

[18] [22]

Scrambled inputs0.28 0.28 0.25 0.27 0.29 0.49

work page

[19] [23]

PermutedWcolumns 0.25 0.26 0.25 0.25 0.27 0.26

work page

[20] [24]

Heuristic Depth Mean - - - - - - DFA (Zero-shot) 0.57 0.460.27 0.46 0.27 0.25 DFA (Best) 0.57 0.46 0.45 0.67 0.340.34 NAP

work page

[21] [25]

RandomW(Lower Bound) 0.25 0.25 0.25 0.25 0.25 0.25

work page

[22] [26]

Scrambled inputs0.26 0.26 0.25 0.26 0.250.35

work page

[23] [27]

PermutedWcolumns 0.25 0.26 0.25 0.25 0.26 0.26

work page

[24] [28]

Heuristic Depth Mean - - - - - - DFA (Zero-shot) 0.36 0.320.27 0.26 0.290.27 DFA (Best) 0.36 0.32 0.40 0.40 0.290.29 NAP-IG-Activations

work page

[25] [29]

RandomW(Lower Bound) 0.25 0.26 0.25 0.25 0.26 0.26

work page

[26] [30]

Scrambled inputs0.25 0.31 0.25 0.42 0.25 0.29

work page

[27] [31]

PermutedWcolumns 0.29 0.26 0.25 0.25 0.26 0.32

work page

[28] [32]

We report the zero-shot faithfulness drop under various structural corruptions using CPR

Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.44 0.50 0.28 0.590.36 0.43 DFA (Best) 0.50 0.54 0.46 0.59 0.56 0.56 Table 13: Validation and Ablation Study for target model llama3-8b (source llama3-3b). We report the zero-shot faithfulness drop under various structural corruptions using CPR. Method Setting / Control IOI MCQA Arith + Arith - ARC-E ARC-C ...

work page

[29] [34]

Scrambled inputs0.25 0.27 - 0.25 0.25 0.27

work page

[30] [35]

PermutedWcolumns 0.250.32- 0.25 0.29 0.29

work page

[31] [36]

Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.26 0.25 -0.390.27 0.26 DFA (Best) 0.300.30 -0.39 0.29 0.29 NAP

work page

[32] [37]

RandomW(Lower Bound) 0.25 0.25 - 0.25 0.27 0.27

work page

[33] [38]

Scrambled inputs0.250.44-0.410.28 0.25

work page

[34] [39]

PermutedWcolumns 0.24 0.28 - 0.25 0.27 0.26

work page

[35] [40]

Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.25 0.27 - 0.250.38 0.36 DFA (Best) 0.280.28 - 0.40 0.38 0.36 NAP-IG-Activations

work page

[36] [41]

RandomW(Lower Bound) 0.25 0.25 - 0.25 0.28 0.30

work page

[37] [42]

Scrambled inputs0.250.43- 0.25 0.33 0.25

work page

[38] [43]

PermutedWcolumns 0.25 0.25 - 0.25 0.26 0.25

work page

[39] [44]

We report the zero-shot faithfulness drop under various structural corruptions using CPR

Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.25 0.28 -0.40 0.42 0.44 DFA (Best) 0.290.34 -0.40 0.42 0.44 Table 14: Validation and Ablation Study for target model qwen2.5-1.5b (source qwen2.5- 0.5b). We report the zero-shot faithfulness drop under various structural corruptions using CPR. 22 Preprint. Under review. Method Setting / Control IOI MCQA Ar...

work page

[40] [45]

RandomW(Lower Bound) 0.25 0.25 - 0.25 0.25 0.25

work page

[41] [46]

Scrambled inputs0.25 0.36 - 0.25 0.25 0.29

work page

[42] [47]

PermutedWcolumns 0.27 0.38- 0.250.35 0.34

work page

[43] [48]

Heuristic Depth Mean - - - - - - DFA (Zero-shot) 0.300.32 - 0.25 0.28 0.28 DFA (Best) 0.300.37 -0.260.35 0.34 NAP

work page

[44] [50]

Scrambled inputs0.250.57- 0.25 0.29 0.25

work page

[45] [51]

PermutedWcolumns 0.29 0.30 - 0.25 0.30 0.29

work page

[46] [52]

Heuristic Depth Mean - - - - - - DFA (Zero-shot) 0.320.27 - 0.24 0.32 0.35 DFA (Best) 0.320.37 -0.26 0.37 0.42 NAP-IG-Activations

work page

[47] [53]

RandomW(Lower Bound) 0.25 0.27 - 0.25 0.25 0.25

work page

[48] [54]

Scrambled inputs0.250.60- 0.25 0.24 0.25

work page

[49] [55]

PermutedWcolumns0.300.30 - 0.25 0.31 0.31

work page

[50] [56]

We report the zero-shot faithfulness drop under various structural corruptions using CPR

Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.25 0.34 - 0.25 0.34 0.48 DFA (Best)0.25 0.36 -0.26 0.34 0.48 Table 15: Validation and Ablation Study for target model qwen2.5-3b (source qwen2.5-0.5b). We report the zero-shot faithfulness drop under various structural corruptions using CPR. Method Setting / Control IOI MCQA Arith + Arith - ARC-E ARC-C NAP...

work page

[51] [57]

RandomW(Lower Bound) 0.25 0.28 - 0.25 0.27 0.27

work page

[52] [58]

Scrambled inputs0.260.38- 0.25 0.280.34

work page

[53] [59]

PermutedWcolumns 0.29 0.27 -0.280.31 0.30

work page

[54] [60]

Heuristic Depth Mean - - - - - - DFA (Zero-shot) 0.360.33 - 0.27 0.29 0.29 DFA (Best) 0.360.33 - 0.27 0.310.31 NAP

work page

[55] [61]

RandomW(Lower Bound) 0.25 0.34 - 0.25 0.29 0.25

work page

[56] [62]

Scrambled inputs0.25 0.41- 0.26 0.25 0.34

work page

[57] [63]

PermutedWcolumns 0.24 0.29 - 0.26 0.27 0.28

work page

[58] [64]

Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.23 0.28 - 0.260.40 0.38 DFA (Best) 0.280.28 -0.27 0.40 0.38 NAP-IG-Activations

work page

[59] [65]

RandomW(Lower Bound) 0.250.39- 0.26 0.28 0.25

work page

[60] [66]

Scrambled inputs0.25 0.34 - 0.27 0.26 0.29

work page

[61] [67]

PermutedWcolumns 0.23 0.28 -0.270.27 0.27

work page

[62] [68]

We report the zero-shot faithfulness drop under various structural corruptions using CPR

Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.23 0.30 - 0.260.35 0.34 DFA (Best) 0.290.30 - 0.27 0.35 0.34 Table 16: Validation and Ablation Study for target model qwen2.5-3b (source qwen2.5-1.5b). We report the zero-shot faithfulness drop under various structural corruptions using CPR. 23 Preprint. Under review. Method T arget Model Pair Baseline DFA...

work page