pith. sign in

arxiv: 2606.28050 · v1 · pith:O652YWPRnew · submitted 2026-06-26 · 💻 cs.CL · cs.AI

Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA

Pith reviewed 2026-06-29 04:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM-as-a-Judgetask asymmetryself-evaluationin-context QAattention analysisLoRA fine-tuningmechanistic interpretability
0
0 comments X

The pith

LLMs generate answers more accurately than they self-evaluate on three of four in-context QA benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the assumption that evaluation is easier than generation by having two models judge their own answers in a setting where only the provided context can be used. Generation accuracy exceeds self-evaluation accuracy on SQuAD 2.0, DROP, and HotpotQA, but the pattern reverses on the multi-hop MuSiQue benchmark. Attention maps show evaluation focuses three to five times less on the context and almost ignores the candidate answer. LoRA experiments on each task separately demonstrate that the asymmetry is not an artifact of pretraining data.

Core claim

In a controlled in-context QA setting where each model judges the answer it itself generated from the supplied passage, generation accuracy exceeds self-evaluation accuracy on three of four benchmarks while the reverse holds on multi-hop reasoning; attention analysis shows evaluation attends far less to context and the candidate answer, and LoRA fine-tuning for one task produces negative transfer to the other.

What carries the argument

The controlled self-evaluation pipeline in which the model judges its own generated answer using only the provided context passage.

If this is right

  • Self-evaluation pipelines cannot be assumed to be easier or more reliable than the generation they are meant to improve.
  • Multi-hop reasoning tasks can invert the observed asymmetry between generation and evaluation.
  • Attention allocation during evaluation differs sharply from attention during generation.
  • Fine-tuning one capability produces negative transfer to the other rather than neutral or positive transfer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Self-improvement loops that alternate generation and self-judgment may accumulate errors faster than expected on single-hop tasks.
  • Separate specialized heads or modules for judgment may be required rather than relying on the same weights used for generation.
  • The attention deficit during evaluation suggests that adding explicit context-retrieval steps inside the judge could reduce the asymmetry.

Load-bearing premise

That restricting the model to the supplied context and its own generated answer removes all confounds and isolates a genuine difference between the two tasks.

What would settle it

Running the same four-benchmark protocol on additional models and finding that self-evaluation accuracy exceeds generation accuracy on every benchmark would falsify the reported asymmetry.

Figures

Figures reproduced from arXiv: 2606.28050 by Sambaran Bandyopadhyay.

Figure 1
Figure 1. Figure 1: Core task-asymmetry pipeline. Model L is tested on two tasks per instance: genera￾tion (Tgen) produces answer a from (c, q); self￾evaluation (Teval) judges whether a is correct given (c, q, a). Oracle L ∗ scores a against gold a ∗ , yielding y ∗ as ground truth for both metrics. Dashed ar￾rows are data-passing operations with no LLM call. ∆ = EA − GA is the primary asymmetry measure. The mechanistic and tr… view at source ↗
Figure 2
Figure 2. Figure 2: Mean last-token attention fraction di￾rected to context (c) and candidate answer (a) for Tgen and Teval, averaged over layers 24–31 and 184 jointly verified Llama-3.1-8B-Instruct samples. Teval consistently de-attends to context by 3–5× relative to Tgen and allocates negligible attention (0.3–0.5%) to the candidate answer it is judging. the evaluation task: rather than re-reading the passage to verify the … view at source ↗
Figure 3
Figure 3. Figure 3: Per-token attention ratio on answer￾mention tokens vs. non-mention context tokens. Tgen exhibits sharp needle-in-haystack lookup be￾haviour; Teval does not. 22.66) where the answer span must be located within long, distractor-heavy contexts. Teval collapses this ratio to 1.0–2.1×. Interpretation. Tgen, not Teval, is the task that performs the answer-locating lookup. This asymmetry provides a complementary … view at source ↗
read the original abstract

LLM-as-a-Judge and self-evaluation pipelines implicitly assume that evaluation is easier than generation. We test this in a controlled in-context QA setting where a context passage is the sole information source and each model judges the answer it generated, removing the parametric-knowledge confound of open-domain comparisons. Across four benchmarks (SQuAD 2.0, DROP, HotpotQA, MuSiQue) and two models, evaluation is not uniformly easier: generation accuracy exceeds self-evaluation on three of four, with multi-hop MuSiQue the exception. Attention analysis reveals why: evaluation attends to context 3--5x less than generation does and barely reads the candidate answer. LoRA fine-tuning confirms the asymmetry is not a training artifact: generation fine-tuning induces over-acceptance and evaluation fine-tuning degrades generation. These findings challenge core assumptions in self-evaluation pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper claims that in a controlled in-context QA setting—where a context passage is the sole information source and each model judges the answer it generated—evaluation is not uniformly easier than generation. Across four benchmarks (SQuAD 2.0, DROP, HotpotQA, MuSiQue) and two models, generation accuracy exceeds self-evaluation on three of four, with multi-hop MuSiQue the exception. Attention analysis shows evaluation attends to context 3-5x less than generation and barely reads the candidate answer. LoRA fine-tuning confirms the asymmetry is not a training artifact: generation fine-tuning induces over-acceptance while evaluation fine-tuning degrades generation.

Significance. If the results hold, they challenge core assumptions underlying LLM-as-a-Judge and self-evaluation pipelines. The controlled in-context design removes the parametric-knowledge confound, directly isolating task asymmetry. Credit is due for the multi-benchmark empirical design, mechanistic attention analysis, and LoRA transfer experiments that provide explanatory and falsifiable support. The work has clear implications for improving evaluation reliability in LLMs, particularly for multi-hop reasoning.

minor comments (1)
  1. Abstract: the two models are not named; specifying them (and any relevant hyperparameters) in the methods section would aid reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our work, the recognition of its significance for LLM-as-a-Judge pipelines, and the recommendation for minor revision. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports purely empirical results from controlled in-context QA experiments across four benchmarks, attention analysis, and LoRA fine-tuning ablations. No equations, derivations, first-principles predictions, or fitted parameters are presented as outputs; the central claims rest on direct accuracy comparisons that isolate the generation-vs-evaluation asymmetry without reducing to self-definitional inputs or self-citation chains. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on experimental outcomes from four standard QA benchmarks and two LLMs rather than on mathematical derivations, fitted parameters, or new postulated entities.

axioms (1)
  • domain assumption Attention weights in transformer models reflect the relative importance the model assigns to different input tokens
    Invoked to interpret why evaluation performance differs from generation based on the attention analysis.

pith-pipeline@v0.9.1-grok · 5681 in / 1177 out tokens · 62295 ms · 2026-06-29T04:14:14.942802+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Language models of code are few-shot planners and reasoners for multi-document summarization with attribution , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  2. [2]

    Do Before You Judge: Self-Reference as a Pathway to Better

    Lin, Wei-Hsiang and Wei, Sheng-Lun and Huang, Hen-Hsen and Chen, Hsin-Hsi , booktitle =. Do Before You Judge: Self-Reference as a Pathway to Better. 2025 , address =

  3. [3]

    The Generative

    Oh, Juhyun and Kim, Eunsu and Cha, Inha and Oh, Alice , booktitle =. The Generative. 2024 , url =

  4. [4]

    2025 , url =

    Jiang, Dongwei and Zhang, Jingyu and Weller, Orion and Weir, Nathaniel and Van Durme, Benjamin and Khashabi, Daniel , booktitle =. 2025 , url =

  5. [5]

    arXiv preprint arXiv:2303.17557 , year =

    Recognition, Recall, and Retention of Few-Shot Memories in Large Language Models , author =. arXiv preprint arXiv:2303.17557 , year =

  6. [6]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , month = jul, year =

    Bavaresco, Anna and Bernardi, Raffaella and Bertolazzi, Leonardo and Elliott, Desmond and Fern. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , month = jul, year =

  7. [7]

    and Zhang, Hao and Gonzalez, Joseph E

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle =. Judging. 2023 , url =

  8. [8]

    2023 , url =

    Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang , booktitle =. 2023 , url =

  9. [9]

    The Twelfth International Conference on Learning Representations , year =

    Large Language Models Cannot Self-Correct Reasoning Yet , author =. The Twelfth International Conference on Learning Representations , year =

  10. [10]

    and Feng, Shi , booktitle =

    Panickssery, Arjun and Bowman, Samuel R. and Feng, Shi , booktitle =. 2024 , url =

  11. [11]

    2022 , url =

    Trivedi, Harsh and Balasubramanian, Niranjan and Khot, Tushar and Sabharwal, Ashish , journal =. 2022 , url =

  12. [12]

    Know What You Don

    Rajpurkar, Pranav and Jia, Robin and Liang, Percy , booktitle =. Know What You Don. 2018 , url =

  13. [13]

    and Salakhutdinov, Ruslan and Manning, Christopher D

    Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D. , booktitle =. 2018 , url =

  14. [14]

    2019 , url =

    Dua, Dheeru and Wang, Yizhong and Dasigi, Pradeep and Stanovsky, Gabriel and Singh, Sameer and Gardner, Matt , booktitle =. 2019 , url =

  15. [15]

    Constructing A Multi-hop

    Ho, Xanh and Duong Nguyen, Anh-Khoa and Sugawara, Saku and Aizawa, Akiko , booktitle =. Constructing A Multi-hop. 2020 , url =

  16. [16]

    Advances in Neural Information Processing Systems , volume =

    Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

  17. [17]

    Constitutional

    Bai, Yuntao and Jones, Andy and Ndousse, Kamal and Askell, Amanda and Chen, Anna and DasSarma, Nova and Drain, Dawn and Fort, Stanislav and Ganguli, Deep and Henighan, Tom and others , journal =. Constitutional. 2022 , url =

  18. [18]

    Asai, Akari and Wu, Zeqiu and Wang, Yizhong and Sil, Avirup and Hajishirzi, Hannaneh , booktitle =. Self-. 2024 , url =

  19. [19]

    2025 , eprint =

    Donors and Recipients: On Asymmetric Transfer Across Tasks and Languages with Parameter-Efficient Fine-Tuning , author =. 2025 , eprint =

  20. [20]

    Psychological Review , volume =

    Recognition and Retrieval Processes in Free Recall , author =. Psychological Review , volume =

  21. [21]

    Dubey, Abhimanyu and others , journal =. The. 2024 , url =

  22. [22]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =. 2022 , url =

  23. [23]

    Transformer Circuits Thread , year =

    A Mathematical Framework for Transformer Circuits , author =. Transformer Circuits Thread , year =

  24. [24]

    Transformer Circuits Thread , year =

    In-Context Learning and Induction Heads , author =. Transformer Circuits Thread , year =

  25. [25]

    Locating and Editing Factual Associations in

    Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle =. Locating and Editing Factual Associations in. 2022 , url =

  26. [26]

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    Eliciting Latent Predictions from Transformers with the Tuned Lens , author =. arXiv preprint arXiv:2303.08112 , year =

  27. [27]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

    Dissecting Recall of Factual Associations in Auto-Regressive Language Models , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =