Differentiable Faithfulness Alignment for Cross-Model Circuit Transfer
Pith reviewed 2026-05-08 03:38 UTC · model grok-4.3
The pith
A differentiable alignment transfers node importance scores from smaller language models to larger ones by optimizing a soft faithfulness objective.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce Differentiable Faithfulness Alignment (DFA), a framework that transfers circuit information from a smaller source model to a larger target model through a learned differentiable alignment. DFA projects source-model node importance scores into the target model and trains this mapping with a soft faithfulness objective, avoiding full circuit discovery on the target model. We evaluate DFA on Llama-3 and Qwen-2.5 across six tasks spanning factual retrieval, multiple-choice reasoning, and arithmetic. The strongest results occur on Llama-3 1B to 3B, where aligned circuits are often competitive with direct node attribution and zero-shot transfer remains effective.
What carries the argument
Differentiable Faithfulness Alignment (DFA), a learned mapping from source node importance scores to the target model that is optimized by a soft faithfulness loss to preserve task performance.
If this is right
- On Llama-3 1B to 3B, DFA circuits are competitive with or stronger than direct node attribution in faithfulness for the tested tasks.
- Zero-shot transfer of the learned alignment works without additional training in some settings.
- Transfer performance drops for larger size gaps and is substantially weaker on Qwen-2.5 than on Llama-3.
- DFA outperforms simple baselines across factual retrieval, reasoning, and arithmetic tasks.
- In some cases the method recovers target-model circuits whose faithfulness matches or exceeds that of direct attribution.
Where Pith is reading between the lines
- The method could lower the cost of circuit discovery for very large models by reusing smaller ones as starting points.
- It suggests mechanistic priors might be chained across successive model scales as training progresses.
- Architecture differences appear to limit transfer, pointing to a need for alignment methods that account for model family structure.
- The soft faithfulness objective might extend to transferring other interpretability signals such as activation patterns beyond node importance.
Load-bearing premise
Node importance scores from a smaller source model contain information that can be aligned differentiably to serve as useful mechanistic priors for a larger target without full circuit discovery on the target.
What would settle it
If DFA-aligned circuits on a new source-target pair produce substantially lower faithfulness scores or task accuracy than circuits found by direct attribution on the target, the transfer method would not succeed.
Figures
read the original abstract
Mechanistic interpretability has made it possible to localize circuits underlying specific behaviors in language models, but existing methods are expensive, model-specific, and difficult to scale to larger architectures. We introduce \textbf{Differentiable Faithfulness Alignment (DFA)}, a framework that transfers circuit information from a smaller source model to a larger target model through a learned differentiable alignment. DFA projects source-model node importance scores into the target model and trains this mapping with a soft faithfulness objective, avoiding full circuit discovery on the target model. We evaluate DFA on Llama-3 and Qwen-2.5 across six tasks spanning factual retrieval, multiple-choice reasoning, and arithmetic. The strongest results occur on Llama-3 $1$B$\rightarrow3$B, where aligned circuits are often competitive with direct node attribution and zero-shot transfer remains effective. Recovery weakens for larger source--target gaps and is substantially lower on Qwen-2.5, suggesting that transfer becomes harder as architectural and scaling differences increase. Overall, DFA consistently outperforms simple baselines and, in some settings, recovers target-model circuits with faithfulness comparable to or stronger than direct attribution. These results suggest that smaller models can provide useful mechanistic priors for larger ones, while highlighting both the promise and the limits of node-level cross-model circuit alignment.\footnote{Code is available at https://github.com/jasonshaoshun/dfa-circuits.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Differentiable Faithfulness Alignment (DFA), a framework that transfers circuit information from a smaller source LM to a larger target LM by projecting source node importance scores into the target via a learned differentiable mapping trained against a soft faithfulness objective. This avoids full circuit discovery on the target. Evaluations on Llama-3 (1B→3B) and Qwen-2.5 models across factual retrieval, multiple-choice reasoning, and arithmetic tasks show DFA often competitive with or exceeding direct node attribution on smaller gaps, outperforming simple baselines, with zero-shot transfer effective in some cases; performance degrades for larger gaps and on Qwen-2.5.
Significance. If the transfer interpretation holds, DFA offers a scalable route to mechanistic interpretability by using smaller models as mechanistic priors for larger ones, reducing the cost of circuit discovery. The empirical results on Llama-3 pairs and the code release are concrete strengths; the observed limits with scale/architectural gaps provide useful boundary conditions for the approach.
major comments (3)
- The central transfer claim requires that source node scores remain load-bearing after alignment. The manuscript does not specify the parameterization or capacity of the learned mapping (e.g., whether it is a low-rank or tied function) nor any auxiliary loss that penalizes deviation from source relative importances. Without such constraints, gradient descent on the faithfulness objective alone can reassign importances to maximize the target objective independently of the source, turning DFA into target-only optimization. This must be clarified with an explicit definition of the mapping and a control experiment replacing source scores with random or uniform values.
- Recovery is reported as weaker for larger source-target gaps and substantially lower on Qwen-2.5. The manuscript should quantify this with per-task faithfulness scores, statistical significance tests, and an ablation showing whether the drop is due to architectural mismatch or failure of the alignment to preserve source structure (e.g., correlation between source and aligned scores before/after training).
- The soft faithfulness objective is central to training but its precise formulation, including any post-hoc choices in evaluation metrics or circuit extraction thresholds, is not detailed enough to assess whether the reported competitiveness with direct attribution is robust or sensitive to those choices.
minor comments (2)
- The abstract and introduction would benefit from a clearer statement of the exact node sets being aligned (e.g., attention heads, MLPs, or residual streams) and how the projection handles differing layer counts between source and target.
- Table or figure captions should explicitly state the number of runs, random seeds, and error bars for all reported faithfulness numbers to allow assessment of variability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the transfer mechanism and improve the empirical rigor of the work. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: The central transfer claim requires that source node scores remain load-bearing after alignment. The manuscript does not specify the parameterization or capacity of the learned mapping (e.g., whether it is a low-rank or tied function) nor any auxiliary loss that penalizes deviation from source relative importances. Without such constraints, gradient descent on the faithfulness objective alone can reassign importances to maximize the target objective independently of the source, turning DFA into target-only optimization. This must be clarified with an explicit definition of the mapping and a control experiment replacing source scores with random or uniform values.
Authors: We agree that an explicit definition of the mapping and a control experiment are necessary to substantiate that source scores remain load-bearing. The current manuscript describes the mapping as a learned differentiable projection of source node importance scores but does not detail its parameterization. In the revision we will add a precise definition (including capacity and any weight-tying) in the Methods section. We will also include the requested control experiment that replaces source scores with random or uniform values and reports the resulting drop in target faithfulness, thereby demonstrating that the alignment depends on source-derived information rather than independent target optimization. revision: yes
-
Referee: Recovery is reported as weaker for larger source-target gaps and substantially lower on Qwen-2.5. The manuscript should quantify this with per-task faithfulness scores, statistical significance tests, and an ablation showing whether the drop is due to architectural mismatch or failure of the alignment to preserve source structure (e.g., correlation between source and aligned scores before/after training).
Authors: We will expand the results section to report per-task faithfulness scores for all model pairs and tasks. Where multiple runs are available we will add statistical significance tests. We will further include an ablation that computes the Pearson correlation between source node scores and the aligned scores both before and after training; this will help distinguish whether performance degradation stems from architectural mismatch or from the alignment failing to preserve source structure. revision: yes
-
Referee: The soft faithfulness objective is central to training but its precise formulation, including any post-hoc choices in evaluation metrics or circuit extraction thresholds, is not detailed enough to assess whether the reported competitiveness with direct attribution is robust or sensitive to those choices.
Authors: We will provide the complete mathematical formulation of the soft faithfulness objective in the Methods section, together with the exact post-hoc choices used for evaluation metrics and any circuit extraction thresholds. This added detail will allow readers to assess robustness and will be accompanied by a brief sensitivity analysis where feasible. revision: yes
Circularity Check
No significant circularity; derivation relies on external faithfulness objective and empirical comparison.
full rationale
The paper defines DFA as a learned projection of source node importance scores trained via an independent soft faithfulness objective on the target model, then evaluates the resulting circuits against direct attribution baselines. No equations or steps reduce the output to the input by construction, no self-definitional mappings appear, and no load-bearing claims rest on self-citations or renamed known results. The method is presented as an optimization procedure whose success is measured externally rather than tautologically, making the derivation self-contained against the stated benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- alignment mapping parameters
invented entities (1)
-
Differentiable Faithfulness Alignment (DFA)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
RandomW(Lower Bound) 0.25 0.25 0.25 0.25 0.27 0.28
-
[2]
Scrambled inputs0.27 0.25 0.25 0.25 0.25 0.32
-
[3]
PermutedWcolumns 0.29 0.27 0.25 0.25 0.41 0.31
-
[4]
Heuristic Depth Mean 0.25 0.25 0.25 0.25 0.25 0.25 DFA (Zero-shot)0.27 0.46 0.28 0.260.44 0.48 DFA (Best) 0.32 0.49 0.39 0.26 0.46 0.52 NAP-IG-Activations
-
[5]
RandomW(Lower Bound) 0.25 0.25 0.25 0.25 0.26 0.25
-
[6]
Scrambled inputs0.29 0.31 0.33 0.25 0.34 0.37
-
[7]
PermutedWcolumns 0.26 0.26 0.25 0.25 0.26 0.27
-
[8]
We report the zero-shot faithfulness drop under various structural corruptions using CPR
Heuristic Depth Mean 0.25 0.25 0.25 0.25 0.25 0.25 DFA (Zero-shot) 0.33 0.46 0.820.28 0.34 0.51 DFA (Best) 0.33 0.46 0.82 0.45 0.41 0.51 Table 11: Validation and Ablation Study for target model llama3-3b (source llama3-1b) by NAP and NAP-IG-Activations. We report the zero-shot faithfulness drop under various structural corruptions using CPR. Method Settin...
-
[10]
Scrambled inputs0.33 0.26 0.27 0.25 0.34 0.27
-
[11]
PermutedWcolumns 0.29 0.27 0.25 0.25 0.26 0.28
-
[12]
Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.33 0.51 0.67 0.39 0.47 0.38 DFA (Best) 0.34 0.51 0.67 0.39 0.47 0.38 NAP
-
[14]
Scrambled inputs0.390.25 0.26 0.25 0.25 0.30
-
[15]
PermutedWcolumns 0.31 0.25 0.25 0.25 0.33 0.31
-
[16]
Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.27 0.40 0.400.26 0.30 0.29 DFA (Best)0.33 0.46 0.40 0.33 0.49 0.53 NAP-IG-Activations
-
[18]
Scrambled inputs0.560.34 0.31 0.25 0.33 0.26
-
[19]
PermutedWcolumns 0.30 0.26 0.25 0.25 0.26 0.27
-
[20]
We report the zero-shot faithfulness drop under various structural corruptions using CPR
Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.360.400.41 0.28 0.35 0.40 DFA (Best)0.43 0.40 0.41 0.28 0.35 0.40 Table 12: Validation and Ablation Study for target model llama3-8b (source llama3-1b). We report the zero-shot faithfulness drop under various structural corruptions using CPR. B V alidating the Alignment Mechanism Tables 11–16 test whether ...
-
[22]
Scrambled inputs0.28 0.28 0.25 0.27 0.29 0.49
-
[23]
PermutedWcolumns 0.25 0.26 0.25 0.25 0.27 0.26
-
[24]
Heuristic Depth Mean - - - - - - DFA (Zero-shot) 0.57 0.460.27 0.46 0.27 0.25 DFA (Best) 0.57 0.46 0.45 0.67 0.340.34 NAP
-
[25]
RandomW(Lower Bound) 0.25 0.25 0.25 0.25 0.25 0.25
-
[26]
Scrambled inputs0.26 0.26 0.25 0.26 0.250.35
-
[27]
PermutedWcolumns 0.25 0.26 0.25 0.25 0.26 0.26
-
[28]
Heuristic Depth Mean - - - - - - DFA (Zero-shot) 0.36 0.320.27 0.26 0.290.27 DFA (Best) 0.36 0.32 0.40 0.40 0.290.29 NAP-IG-Activations
-
[29]
RandomW(Lower Bound) 0.25 0.26 0.25 0.25 0.26 0.26
-
[30]
Scrambled inputs0.25 0.31 0.25 0.42 0.25 0.29
-
[31]
PermutedWcolumns 0.29 0.26 0.25 0.25 0.26 0.32
-
[32]
We report the zero-shot faithfulness drop under various structural corruptions using CPR
Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.44 0.50 0.28 0.590.36 0.43 DFA (Best) 0.50 0.54 0.46 0.59 0.56 0.56 Table 13: Validation and Ablation Study for target model llama3-8b (source llama3-3b). We report the zero-shot faithfulness drop under various structural corruptions using CPR. Method Setting / Control IOI MCQA Arith + Arith - ARC-E ARC-C ...
-
[34]
Scrambled inputs0.25 0.27 - 0.25 0.25 0.27
-
[35]
PermutedWcolumns 0.250.32- 0.25 0.29 0.29
-
[36]
Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.26 0.25 -0.390.27 0.26 DFA (Best) 0.300.30 -0.39 0.29 0.29 NAP
-
[37]
RandomW(Lower Bound) 0.25 0.25 - 0.25 0.27 0.27
-
[38]
Scrambled inputs0.250.44-0.410.28 0.25
-
[39]
PermutedWcolumns 0.24 0.28 - 0.25 0.27 0.26
-
[40]
Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.25 0.27 - 0.250.38 0.36 DFA (Best) 0.280.28 - 0.40 0.38 0.36 NAP-IG-Activations
-
[41]
RandomW(Lower Bound) 0.25 0.25 - 0.25 0.28 0.30
-
[42]
Scrambled inputs0.250.43- 0.25 0.33 0.25
-
[43]
PermutedWcolumns 0.25 0.25 - 0.25 0.26 0.25
-
[44]
We report the zero-shot faithfulness drop under various structural corruptions using CPR
Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.25 0.28 -0.40 0.42 0.44 DFA (Best) 0.290.34 -0.40 0.42 0.44 Table 14: Validation and Ablation Study for target model qwen2.5-1.5b (source qwen2.5- 0.5b). We report the zero-shot faithfulness drop under various structural corruptions using CPR. 22 Preprint. Under review. Method Setting / Control IOI MCQA Ar...
-
[45]
RandomW(Lower Bound) 0.25 0.25 - 0.25 0.25 0.25
-
[46]
Scrambled inputs0.25 0.36 - 0.25 0.25 0.29
-
[47]
PermutedWcolumns 0.27 0.38- 0.250.35 0.34
-
[48]
Heuristic Depth Mean - - - - - - DFA (Zero-shot) 0.300.32 - 0.25 0.28 0.28 DFA (Best) 0.300.37 -0.260.35 0.34 NAP
-
[50]
Scrambled inputs0.250.57- 0.25 0.29 0.25
-
[51]
PermutedWcolumns 0.29 0.30 - 0.25 0.30 0.29
-
[52]
Heuristic Depth Mean - - - - - - DFA (Zero-shot) 0.320.27 - 0.24 0.32 0.35 DFA (Best) 0.320.37 -0.26 0.37 0.42 NAP-IG-Activations
-
[53]
RandomW(Lower Bound) 0.25 0.27 - 0.25 0.25 0.25
-
[54]
Scrambled inputs0.250.60- 0.25 0.24 0.25
-
[55]
PermutedWcolumns0.300.30 - 0.25 0.31 0.31
-
[56]
We report the zero-shot faithfulness drop under various structural corruptions using CPR
Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.25 0.34 - 0.25 0.34 0.48 DFA (Best)0.25 0.36 -0.26 0.34 0.48 Table 15: Validation and Ablation Study for target model qwen2.5-3b (source qwen2.5-0.5b). We report the zero-shot faithfulness drop under various structural corruptions using CPR. Method Setting / Control IOI MCQA Arith + Arith - ARC-E ARC-C NAP...
-
[57]
RandomW(Lower Bound) 0.25 0.28 - 0.25 0.27 0.27
-
[58]
Scrambled inputs0.260.38- 0.25 0.280.34
-
[59]
PermutedWcolumns 0.29 0.27 -0.280.31 0.30
-
[60]
Heuristic Depth Mean - - - - - - DFA (Zero-shot) 0.360.33 - 0.27 0.29 0.29 DFA (Best) 0.360.33 - 0.27 0.310.31 NAP
-
[61]
RandomW(Lower Bound) 0.25 0.34 - 0.25 0.29 0.25
-
[62]
Scrambled inputs0.25 0.41- 0.26 0.25 0.34
-
[63]
PermutedWcolumns 0.24 0.29 - 0.26 0.27 0.28
-
[64]
Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.23 0.28 - 0.260.40 0.38 DFA (Best) 0.280.28 -0.27 0.40 0.38 NAP-IG-Activations
-
[65]
RandomW(Lower Bound) 0.250.39- 0.26 0.28 0.25
-
[66]
Scrambled inputs0.25 0.34 - 0.27 0.26 0.29
-
[67]
PermutedWcolumns 0.23 0.28 -0.270.27 0.27
-
[68]
We report the zero-shot faithfulness drop under various structural corruptions using CPR
Heuristic Depth Mean - - - - - - DFA (Zero-shot)0.23 0.30 - 0.260.35 0.34 DFA (Best) 0.290.30 - 0.27 0.35 0.34 Table 16: Validation and Ablation Study for target model qwen2.5-3b (source qwen2.5-1.5b). We report the zero-shot faithfulness drop under various structural corruptions using CPR. 23 Preprint. Under review. Method T arget Model Pair Baseline DFA...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.