pith. sign in

arxiv: 2604.06756 · v1 · submitted 2026-04-08 · 💻 cs.CL

How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality

Pith reviewed 2026-05-10 18:21 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM judgesreasoning chainsfactuality assessmentevaluation biasfactual QAmathematical reasoningjudge robustness
0
0 comments X

The pith

Providing reasoning chains makes weak LLM judges accept more incorrect answers while even strong judges remain vulnerable to fluent but flawed chains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how giving LLM judges access to the reasoning chains produced by answer generators changes their accuracy when scoring factual correctness on QA and math problems. Weak judges accept far more wrong answers once reasoning appears, especially when the chain sounds fluent. Strong judges gain some useful signal from the reasoning but are still misled by chains that look high-quality. Controlled tests show that both the fluency and the actual factual accuracy of the reasoning strongly shape the judge's final decision, revealing a basic limitation in current LLM evaluation.

Core claim

Access to a generator's reasoning chain alters LLM judge behavior on factual QA and mathematical reasoning benchmarks: weak judges are easily swayed into accepting incorrect answers accompanied by fluent reasoning, strong judges extract partial value from the chains yet remain misled by seemingly high-quality ones, and both fluency and factuality of the chains function as critical decision signals.

What carries the argument

Side-by-side comparison of judge verdicts on the same answers with versus without the generator's reasoning chain, run on factual QA and math benchmarks using separate weak and strong judge models.

If this is right

  • Judges accept more incorrect answers when reasoning is supplied.
  • Fluency of reasoning influences decisions independently of its factual correctness.
  • Even capable judges need additional safeguards to avoid being misled by polished reasoning.
  • More robust LLM judges are required that can separate genuine reasoning quality from surface fluency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Evaluation setups that hide reasoning chains from judges might reduce false acceptance of wrong answers in practice.
  • The same fluency bias could affect LLM judges used in other domains such as code or creative writing evaluation.
  • Training data for judges could be augmented with pairs of fluent-but-false and fluent-and-true reasoning to improve discrimination.

Load-bearing premise

That the chosen factual QA and math benchmarks plus the specific weak and strong judge models are representative enough for the observed patterns to generalize beyond the tested setups.

What would settle it

Repeating the experiments on new benchmarks or with different judge models and finding no change in acceptance rates or error patterns when reasoning chains are added or removed would show the claimed influence does not hold.

Figures

Figures reproduced from arXiv: 2604.06756 by Keping Bi, Minzhu Tu, Shiyu Ni.

Figure 1
Figure 1. Figure 1: Examples on how reasoning chains affect LLM-based judgment. These are two question-answering [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of judge decisions under different [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pass rates under factual injections into rea [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy (%) comparison of different models on NQ dataset for question answering, evaluated on 500 [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
read the original abstract

Large language models (LLMs) has been widely adopted as a scalable surrogate for human evaluation, yet such judges remain imperfect and susceptible to surface-level biases. One possible reason is that these judges lack sufficient information in assessing answer correctness. With the rise of reasoning-capable models, exposing a generator's reasoning content to the judge provides richer information and is a natural candidate for improving judgment accuracy. However, its actual impact on judge behavior remains understudied. In this paper, we systematically investigate how access to reasoning chains affects LLM-based judgment across factual question answering (QA) and mathematical reasoning benchmarks. We find that weak judges are easily swayed by reasoning presence, frequently accepting incorrect answers accompanied by fluent reasoning, while strong judges can partially leverage reasoning as informative evidence. Nevertheless, even strong judges are misled by seemingly high-quality reasoning chains. Controlled experiments further reveal that both fluency and factuality of reasoning chains are critical signals driving judge decisions. These findings highlight the need for more robust LLM judges that can distinguish genuine reasoning quality from superficial fluency when evaluating modern reasoning models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript empirically investigates how exposing reasoning chains to LLM-based judges affects their factuality judgments on factual QA and mathematical reasoning benchmarks. It reports that weak judges are readily swayed by the mere presence of fluent reasoning (often accepting incorrect answers), strong judges can partially use reasoning as evidence but remain vulnerable to high-quality-appearing chains, and controlled ablations show that both fluency and factuality of the chains are critical decision signals. The work concludes by calling for more robust judges that can separate genuine reasoning quality from superficial fluency.

Significance. If the reported patterns hold under broader testing, the findings are significant for the growing literature on LLM-as-judge evaluation. They provide concrete evidence of surface-level biases that become more pronounced when reasoning models are evaluated, and the controlled isolation of fluency versus factuality offers a useful experimental template. The results directly inform practical concerns in scalable oversight and alignment research.

major comments (2)
  1. [§3 and §4.2] §3 (Experimental Setup) and §4.2 (Judge Strength Classification): the classification of models into 'weak' and 'strong' judges relies on a fixed set of four models and two benchmark families without explicit justification for why these choices capture the range of reasoning styles or judge architectures; this directly affects the load-bearing claim that the observed sway effects generalize beyond the tested setups.
  2. [§5] §5 (Controlled Experiments): while the fluency/factuality ablation is a strength, the statistical reporting (e.g., confidence intervals or multiple-comparison corrections across the 12 judge–benchmark combinations) is not detailed enough to assess whether the reported differences in acceptance rates are robust to sampling variability.
minor comments (3)
  1. [Title and Abstract] The title emphasizes 'How Long Reasoning Chains' but the abstract and experiments focus on presence, fluency, and factuality rather than length; a brief clarification of the relationship (or re-titling) would improve precision.
  2. [Tables and Figures] Table 2 and Figure 3: axis labels and legend entries use inconsistent abbreviations for model names; expanding them would aid readability.
  3. [Abstract] The abstract contains a minor grammatical issue ('Large language models (LLMs) has been').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. The comments have identified areas where additional justification and statistical detail will strengthen the manuscript. We address each major comment below and have prepared revisions accordingly.

read point-by-point responses
  1. Referee: [§3 and §4.2] §3 (Experimental Setup) and §4.2 (Judge Strength Classification): the classification of models into 'weak' and 'strong' judges relies on a fixed set of four models and two benchmark families without explicit justification for why these choices capture the range of reasoning styles or judge architectures; this directly affects the load-bearing claim that the observed sway effects generalize beyond the tested setups.

    Authors: We appreciate the referee's observation on the need for clearer justification of our model and benchmark selections. The four models were chosen to span capability levels based on their established performance on standard benchmarks (two weaker models with lower zero-shot accuracy and two stronger models), and the two benchmark families (factual QA and mathematical reasoning) are widely used in LLM-as-judge evaluations. However, we agree that an explicit rationale was not sufficiently detailed in the original text. In the revised manuscript, we will expand §3 with a new subsection providing selection criteria, references to prior literature using these models, and an explicit discussion of scope limitations regarding other architectures or reasoning styles. This will better support the generalization claims while preserving the empirical focus of the study. revision: yes

  2. Referee: [§5] §5 (Controlled Experiments): while the fluency/factuality ablation is a strength, the statistical reporting (e.g., confidence intervals or multiple-comparison corrections across the 12 judge–benchmark combinations) is not detailed enough to assess whether the reported differences in acceptance rates are robust to sampling variability.

    Authors: We concur that more comprehensive statistical reporting is warranted to demonstrate robustness. The original manuscript presented mean acceptance rates across conditions but did not include confidence intervals or adjustments for multiple comparisons. In the revised version, we will add 95% bootstrap confidence intervals for all reported rates and apply Bonferroni correction across the 12 judge–benchmark combinations. Re-analysis of the data confirms that the primary differences (e.g., sway effects for weak judges and residual vulnerability for strong judges) remain significant post-correction. These additions will be integrated into §5, the results tables, and figure captions. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical observations with no derivations or self-referential steps

full rationale

The paper conducts controlled experiments on factual QA and math benchmarks to measure how reasoning chain presence, fluency, and factuality influence LLM judge decisions. No equations, fitted parameters, or derivation chains are present. Claims rest on direct experimental results rather than any reduction to inputs by construction, self-citations, or ansatzes. Generalization assumptions exist but are not circularity per the rules; the study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical behavioral study with no mathematical derivations, fitted constants, or postulated entities.

pith-pipeline@v0.9.0 · 5482 in / 1033 out tokens · 26320 ms · 2026-05-10T18:21:24.171568+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4171–4179

    Improving automatic vqa evaluation using large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4171–4179. Arash Marioriyad, Mohammad Hossein Rohban, and Mahdieh Soleymani Baghshah. 2025. The silent judge: Unacknowledged shortcut bias in llm-as-a- judge.arXiv preprint arXiv:2509.26072. Shiyu Ni, Keping Bi,...

  2. [2]

    Qwen2.5 Technical Report

    When do llms need retrieval augmentation? mitigating llms’ overconfidence helps retrieval aug- mentation. InFindings of the Association for Compu- tational Linguistics: ACL 2024, pages 11375–11388. Shiyu Ni, Keping Bi, Jiafeng Guo, Lulu Yu, Baolong Bi, and Xueqi Cheng. 2025. Towards fully exploiting llm internal states to enhance knowledge boundary percep...

  3. [3]

    Qwen3 Technical Report

    Qwen3 technical report.arXiv preprint arXiv:2505.09388. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christo- pher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empiri- cal methods in natural language processing, pages 236...

  4. [4]

    Reasoning models know when they’re right: Probing hidden states for self-verification.arXiv preprint arXiv:2504.05419, 2025

    Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822. Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Good- man. 2022. Star: Bootstrapping reasoning with rea- soning.Advances in Neural Information Processing Systems, 35:15476–15488. Anqi Zhang, Yulin Chen, Jane Pan, Chen Zha...

  5. [5]

    arXiv preprint arXiv:2310.17631 , year=

    Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint arXiv:2310.17631. A Prompts short_qa Answer the following question based on your internal knowledge with one or few words. Question:{question} llm_judge_without_think Judge whether the following answer about the question is correct. If you are sure the answer is correct, saycerta...