pith. sign in

arxiv: 2605.16023 · v1 · pith:S7E7XSAJnew · submitted 2026-05-15 · 💻 cs.CL · cs.LG

Judge Circuits

Pith reviewed 2026-05-20 19:26 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords LLM-as-a-judgemechanistic interpretabilitycircuit discoveryformat inconsistencyPEAPLatent EvaluatorMLP sub-graphcausal intervention
0
0 comments X

The pith

LLM judges share a sparse Latent Evaluator sub-graph in mid-to-late MLP layers that computes evaluation before format-specific output mapping.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why large language models used as judges give inconsistent scores when the output format changes, such as switching from a numeric rating to a true/false label. It applies a causal tracing technique called Position-aware Edge Attribution Patching to locate the responsible internal components inside three open models. The central result is that a small set of connections in the middle and later MLP layers forms a generalized sub-graph that handles judgments across both structured and open-ended tasks. Removing this sub-graph stops the model from judging while leaving factual knowledge and other abilities intact in models with modular structure. The account separates the core judgment computation, which runs in a shared trunk, from fragile terminal branches that handle the requested output format.

Core claim

Judgments across structured understanding and open-ended preference tasks share a sparse, generalized Latent Evaluator sub-graph in the mid-to-late multi-layer perceptrons (MLPs); zero-ablating it collapses judgment while preserving world knowledge in architecturally modular models. By structurally decoupling abstract judging from output formatting, we provide a mechanistic account of format-induced inconsistency on the open-weight models we study: a continuous judgment signal computed in the shared trunk is mapped through fragile, format-specific terminal branches, enabling format-independent preference to be isolated downstream of the requested output format. Our findings imply that bench

What carries the argument

The Latent Evaluator sub-graph: a sparse collection of mid-to-late MLP connections identified by PEAP that computes a continuous judgment signal shared across task types before any output formatting occurs.

If this is right

  • Format-induced score differences arise after the judgment signal is formed, from mapping it through format-specific terminal branches rather than from changes in the evaluation itself.
  • Benchmark comparisons of judge reliability across output formats partly measure differences in formatter geometry instead of differences in evaluation quality.
  • Format-independent preference information can be read out downstream of the requested output format once the shared judgment signal is isolated.
  • In models with clear architectural modularity, factual world knowledge survives removal of the judgment sub-graph.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Targeted editing of the identified sub-graph could improve judgment consistency across formats without full retraining.
  • The same causal patching approach might locate other functional sub-graphs that handle distinct capabilities such as reasoning or refusal.
  • Judge models could be architected with an explicit shared evaluation trunk and replaceable format heads to reduce inconsistency by design.

Load-bearing premise

The PEAP intervention isolates a causally responsible sub-graph for judgment rather than a correlated or downstream effect, and the three studied models represent the broader class of decoder-only transformers used as judges.

What would settle it

Zero-ablating the identified sub-graph in an additional decoder-only model and checking whether judgment performance on both structured and preference tasks drops sharply while factual recall and non-judgment capabilities remain largely intact.

Figures

Figures reproduced from arXiv: 2605.16023 by Aaron Louis Eidt, Christopher Ebert, Elena Golimblevskaia, Jing Yang, Nils Feldhus, Qianli Wang, Sebastian M\"oller, Simon Ostermann, Tanja Baeumel, Van Bach Nguyen, Vera Schmitt, Wojciech Samek.

Figure 1
Figure 1. Figure 1: Overview of our pipeline on an MNLI mini￾mal pair: (1) PEAP (Haklay et al., 2025) traces cross￾token causal edges from the differential input tokens into a shared Latent Evaluator sub-circuit (CLE := Crate ∩ Cclass). (2) We validate this circuit three ways: zero-ablation (red ×) isolates evaluation from world knowledge; BDAS (Wu et al., 2023) identifies a 1D judgment direction in the LE’s activation space;… view at source ↗
Figure 2
Figure 2. Figure 2: Sparse circuit faithfulness across the five evaluated models and five rating tasks. Each curve traces median [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cross-task Edge IoU on Gemma-3-12B across [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cross-task circuit overlap at top-200 across all four architecturally modular models. The shared trunk is recoverable on every model, but the per-pair magnitude reflects each model’s circuit redundancy (§3.2, App. L): smaller models route through fewer equivalent paths, so their top-200 edges are more conserved across tasks, while larger modular models distribute attribution across many equivalent sub-path… view at source ↗
Figure 5
Figure 5. Figure 5: Cross-task Node IoU on Gemma-3-12B across [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cross-domain causal steering of the Latent [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Steering probability heatmap for STS-B → MNLI across intervention strength (x-axis) and pre￾dicted rating tokens (1–5) (y-axis) for Gemma-3-12B. As α increases, probability mass shifts monotonically from lower ratings toward 5, demonstrating smooth, continuous control over the judgment output via a sin￾gle geometric direction. tially weaker across all models (e.g., Qwen2.5-14B MNLI→ RewardBench: 3.15 ± 0.8… view at source ↗
Figure 8
Figure 8. Figure 8: Pairwise cosine similarity between the P C1 direction of the Latent Evaluator’s clean/corrupt activa￾tion difference matrix across CoLA, MNLI, and STS-B (Gemma-3-12B). rule this out via a Haar-uniform random-rotation control on Gemma-3-12B: at α = 2.0 on the same target hooks, the trained rotation moves mean tar￾get EV by −0.42 on CoLA → MNLI and by −0.49 on MNLI→ STS-B, while ten random orthogo￾nal rotati… view at source ↗
Figure 9
Figure 9. Figure 9: Timeline of the geometric token intersection overlap between ordinal rating (1-5) and categorical [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Late-layer Task Formatter attractor geometry [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Ablation phase-transition study (Gemma-3-12B). [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Ablation phase-transition study (Gemma-3-27B). [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Ablation phase-transition study (Qwen2.5-7B-Instruct). [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Ablation phase-transition study (Qwen2.5-14B-Instruct). [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Global judge circuit for MNLI on Gemma-3-27B. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Global judge circuit for CoLA on Gemma-3-27B. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Global judge circuit for STS-B on Gemma-3-12B. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Global judge circuit for MNLI on Qwen2.5-14B [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Cross-task structural overlap on Gemma-3-12B: LRPEAP (solid) overlaid on PEAP (dashed, faded) at [PITH_FULL_IMAGE:figures/full_fig_p031_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Layer-pair attribution density of the top- [PITH_FULL_IMAGE:figures/full_fig_p032_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Sparse-circuit faithfulness with PEAP (blue) and LRPEAP (green) on Qwen2.5-7B and Gemma-3-12B [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗
read the original abstract

LLM-as-a-judge has become the dominant paradigm for grading model outputs at scale, yet the same model assigns systematically different scores when its output format changes (e.g., a 1-5 rating vs. a True/False label). Existing diagnoses of these format-induced inconsistencies stop at the input-output level. Using Position-aware Edge Attribution Patching (PEAP), we causally investigate the internal mechanism in Gemma-3, Qwen2.5, and Llama-3. We find that judgments across structured understanding and open-ended preference tasks share a sparse, generalized Latent Evaluator sub-graph in the mid-to-late multi-layer perceptrons (MLPs); zero-ablating it collapses judgment while preserving world knowledge in architecturally modular models. By structurally decoupling abstract judging from output formatting, we provide a mechanistic account of format-induced inconsistency on the open-weight models we study: a continuous judgment signal computed in the shared trunk is mapped through fragile, format-specific terminal branches, enabling format-independent preference to be isolated downstream of the requested output format. Our findings imply that benchmark-level reliability comparisons across formats are partially measuring formatter geometry rather than evaluation quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript uses Position-aware Edge Attribution Patching (PEAP) to causally probe the internal mechanisms of LLM-as-a-judge on Gemma-3, Qwen2.5, and Llama-3. It reports a sparse, shared Latent Evaluator sub-graph in mid-to-late MLPs that supports judgments across structured understanding and open-ended preference tasks; zero-ablating this sub-graph collapses observable judgment while preserving world knowledge in modular architectures. The authors conclude that format-induced score inconsistencies arise because a continuous judgment signal computed in a shared trunk is routed through fragile, format-specific terminal branches.

Significance. If the causal isolation holds, the work supplies a mechanistic account of format sensitivity that could guide more reliable LLM judges by separating core evaluation from output formatting. The multi-model, multi-task design and emphasis on empirical interventions rather than fitted parameters are positive features that increase the potential impact within mechanistic interpretability of evaluation systems.

major comments (2)
  1. [Abstract] Abstract: the claim that zero-ablating the identified sub-graph removes a generalized judgment computation (rather than a downstream formatter or router) is central to the interpretation, yet the manuscript does not report controls confirming that the same ablation leaves other mid-to-late MLP behaviors intact (e.g., chain-of-thought reasoning or factual completion without an explicit judgment token). Without these controls, the distinction between upstream evaluator and correlated routing effect remains untested.
  2. [Abstract] Abstract / Results: the reported ablation results are described as consistent across three models and two task types, but no quantitative effect sizes, confidence intervals, or details on the total number of circuits examined before selecting the final sub-graph are provided. This information is required to assess the claimed sparsity and to evaluate selection bias in the identification of a 'generalized' circuit.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'architecturally modular models' is used without a brief parenthetical gloss or citation clarifying which architectural properties (e.g., specific layer separation or residual-stream modularity) enable the observed preservation of world knowledge.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, acknowledging where the current manuscript can be strengthened through revision while defending the existing evidence where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that zero-ablating the identified sub-graph removes a generalized judgment computation (rather than a downstream formatter or router) is central to the interpretation, yet the manuscript does not report controls confirming that the same ablation leaves other mid-to-late MLP behaviors intact (e.g., chain-of-thought reasoning or factual completion without an explicit judgment token). Without these controls, the distinction between upstream evaluator and correlated routing effect remains untested.

    Authors: We agree that additional controls would strengthen the distinction between a judgment-specific computation and a more general routing or formatting effect. The manuscript already reports that zero-ablation preserves world knowledge in modular architectures, providing evidence against a broad disruption of mid-to-late MLP function. However, this does not fully address behaviors such as chain-of-thought reasoning or factual completion without judgment tokens. In the revised manuscript we will add targeted ablation experiments on these tasks to further test specificity. revision: yes

  2. Referee: [Abstract] Abstract / Results: the reported ablation results are described as consistent across three models and two task types, but no quantitative effect sizes, confidence intervals, or details on the total number of circuits examined before selecting the final sub-graph are provided. This information is required to assess the claimed sparsity and to evaluate selection bias in the identification of a 'generalized' circuit.

    Authors: We acknowledge that quantitative details on effect sizes, confidence intervals, and the circuit search process would allow better evaluation of sparsity and potential selection bias. The revised manuscript will include these elements: effect sizes for judgment degradation, confidence intervals from repeated runs, and a description of the total number of candidate circuits examined along with the selection criteria applied. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical causal interventions on external models

full rationale

The paper applies Position-aware Edge Attribution Patching (PEAP) as an intervention method to pre-trained external models (Gemma-3, Qwen2.5, Llama-3) to locate a sparse sub-graph in mid-to-late MLPs. Zero-ablation is reported to collapse judgment scores while preserving world knowledge, with the account framed as a structural decoupling of judging from formatting. No equations, fitted parameters, or self-citations are shown to reduce the central claim to the paper's own inputs by construction; the interventions operate on fixed model weights without self-referential fitting loops or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the validity of the PEAP causal intervention method and on the assumption that the observed ablation effects are specific to judgment rather than general capability disruption.

axioms (1)
  • domain assumption PEAP correctly identifies causally relevant edges for the judgment behavior
    The entire analysis depends on the patching method producing faithful causal attributions.
invented entities (1)
  • Latent Evaluator sub-graph no independent evidence
    purpose: A sparse set of MLP connections hypothesized to compute abstract judgment independent of output format
    Postulated on the basis of ablation results; no independent verification outside the studied models is provided.

pith-pipeline@v0.9.0 · 5766 in / 1300 out tokens · 38943 ms · 2026-05-20T19:26:07.075917+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 3 internal anchors

  1. [1]

    Elena Golimblevskaia, Aakriti Jain, Bruno Puri, Ammar Ibrahim, Wojciech Samek, and Sebastian Lapuschkin

    Reference-free rating of llm responses via latent information.arXiv, abs/2509.24678. Elena Golimblevskaia, Aakriti Jain, Bruno Puri, Ammar Ibrahim, Wojciech Samek, and Sebastian Lapuschkin

  2. [2]

    The Llama 3 Herd of Models

    Circuit insights: Towards interpretability be- yond activations. InThe Fourteenth International Conference on Learning Representations. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models.arXiv, abs/240...

  3. [3]

    Michael Hanna, Sandro Pezzelle, and Yonatan Belinkov

    Are formal and functional linguistic mecha- nisms dissociated in language models?Computa- tional Linguistics, pages 1–41. Michael Hanna, Sandro Pezzelle, and Yonatan Belinkov

  4. [4]

    In First Conference on Language Modeling

    Have faith in faithfulness: Going beyond cir- cuit overlap when finding model mechanisms. In First Conference on Language Modeling. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt

  5. [5]

    InInternational Conference on Learning Representations

    Measuring massive multitask language under- standing. InInternational Conference on Learning Representations. Farnoush Rezaei Jafari, Oliver Eberle, Ashkan Khakzar, and Neel Nanda. 2025. Relp: Faithful and efficient circuit discovery in language models via relevance patching.arXiv, abs/2508.21258. 10 Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ M...

  6. [6]

    InFirst Conference on Language Modeling

    Uncovering intermediate variables in trans- formers using circuit probing. InFirst Conference on Language Modeling. Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yu- jia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu

  7. [7]

    LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    Llms-as-judges: A comprehensive sur- vey on llm-based evaluation methods.Preprint, arXiv:2412.05579. Maxime Méloux, Silviu Maniu, François Portet, and Maxime Peyrard. 2025. Everything, everywhere, all at once: Is mechanistic interpretability identifiable? InThe Thirteenth International Conference on Learn- ing Representations. Jack Merullo, Carsten Eickho...

  8. [8]

    Qwen2.5 Technical Report

    Transformer circuit evaluation metrics are not robust. InFirst Conference on Language Modeling. Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel Marks, Koyena Pal, Nikhil Prakash, Can Rager, Aruna Sankaranarayanan, Arnab Sen Sharma, Jiud- ing Sun, Eric Todd, David Bau, and Yonatan Be- linkov. 2026. The quest for the right mediator: Sur- veying mechan...

  9. [9]

    Binary → Rating

    A broad-coverage challenge corpus for sen- tence understanding through inference. InProceed- ings of the 2018 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguis- tics. Zhengxuan Wu, A...

  10. [10]

    CoLA_CLASS:...Is this sentence grammati- cally acceptable? Answer: 2.MNLI_CLASS:...The relationship is:

  11. [11]

    STS-B_CLASS:...Are these sentences seman- tically similar? Answer:

  12. [12]

    RewardBench_CLASS:...Is this response helpful and aligned? Answer:

  13. [13]

    scale”, “how

    Yelp_CLASS:...Is this review positive? An- swer: The selection spans meaningfully different label structures – binary (CoLA), three-class (MNLI), ordinal (STS-B,Y elp), and pairwise preference (RewardBench) – and this heterogeneity is essen- tial to the cross-task overlap claim in §3.2: a shared computational trunk that recurs across distinct label spaces...