How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality
Pith reviewed 2026-05-10 18:21 UTC · model grok-4.3
The pith
Providing reasoning chains makes weak LLM judges accept more incorrect answers while even strong judges remain vulnerable to fluent but flawed chains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Access to a generator's reasoning chain alters LLM judge behavior on factual QA and mathematical reasoning benchmarks: weak judges are easily swayed into accepting incorrect answers accompanied by fluent reasoning, strong judges extract partial value from the chains yet remain misled by seemingly high-quality ones, and both fluency and factuality of the chains function as critical decision signals.
What carries the argument
Side-by-side comparison of judge verdicts on the same answers with versus without the generator's reasoning chain, run on factual QA and math benchmarks using separate weak and strong judge models.
If this is right
- Judges accept more incorrect answers when reasoning is supplied.
- Fluency of reasoning influences decisions independently of its factual correctness.
- Even capable judges need additional safeguards to avoid being misled by polished reasoning.
- More robust LLM judges are required that can separate genuine reasoning quality from surface fluency.
Where Pith is reading between the lines
- Evaluation setups that hide reasoning chains from judges might reduce false acceptance of wrong answers in practice.
- The same fluency bias could affect LLM judges used in other domains such as code or creative writing evaluation.
- Training data for judges could be augmented with pairs of fluent-but-false and fluent-and-true reasoning to improve discrimination.
Load-bearing premise
That the chosen factual QA and math benchmarks plus the specific weak and strong judge models are representative enough for the observed patterns to generalize beyond the tested setups.
What would settle it
Repeating the experiments on new benchmarks or with different judge models and finding no change in acceptance rates or error patterns when reasoning chains are added or removed would show the claimed influence does not hold.
Figures
read the original abstract
Large language models (LLMs) has been widely adopted as a scalable surrogate for human evaluation, yet such judges remain imperfect and susceptible to surface-level biases. One possible reason is that these judges lack sufficient information in assessing answer correctness. With the rise of reasoning-capable models, exposing a generator's reasoning content to the judge provides richer information and is a natural candidate for improving judgment accuracy. However, its actual impact on judge behavior remains understudied. In this paper, we systematically investigate how access to reasoning chains affects LLM-based judgment across factual question answering (QA) and mathematical reasoning benchmarks. We find that weak judges are easily swayed by reasoning presence, frequently accepting incorrect answers accompanied by fluent reasoning, while strong judges can partially leverage reasoning as informative evidence. Nevertheless, even strong judges are misled by seemingly high-quality reasoning chains. Controlled experiments further reveal that both fluency and factuality of reasoning chains are critical signals driving judge decisions. These findings highlight the need for more robust LLM judges that can distinguish genuine reasoning quality from superficial fluency when evaluating modern reasoning models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript empirically investigates how exposing reasoning chains to LLM-based judges affects their factuality judgments on factual QA and mathematical reasoning benchmarks. It reports that weak judges are readily swayed by the mere presence of fluent reasoning (often accepting incorrect answers), strong judges can partially use reasoning as evidence but remain vulnerable to high-quality-appearing chains, and controlled ablations show that both fluency and factuality of the chains are critical decision signals. The work concludes by calling for more robust judges that can separate genuine reasoning quality from superficial fluency.
Significance. If the reported patterns hold under broader testing, the findings are significant for the growing literature on LLM-as-judge evaluation. They provide concrete evidence of surface-level biases that become more pronounced when reasoning models are evaluated, and the controlled isolation of fluency versus factuality offers a useful experimental template. The results directly inform practical concerns in scalable oversight and alignment research.
major comments (2)
- [§3 and §4.2] §3 (Experimental Setup) and §4.2 (Judge Strength Classification): the classification of models into 'weak' and 'strong' judges relies on a fixed set of four models and two benchmark families without explicit justification for why these choices capture the range of reasoning styles or judge architectures; this directly affects the load-bearing claim that the observed sway effects generalize beyond the tested setups.
- [§5] §5 (Controlled Experiments): while the fluency/factuality ablation is a strength, the statistical reporting (e.g., confidence intervals or multiple-comparison corrections across the 12 judge–benchmark combinations) is not detailed enough to assess whether the reported differences in acceptance rates are robust to sampling variability.
minor comments (3)
- [Title and Abstract] The title emphasizes 'How Long Reasoning Chains' but the abstract and experiments focus on presence, fluency, and factuality rather than length; a brief clarification of the relationship (or re-titling) would improve precision.
- [Tables and Figures] Table 2 and Figure 3: axis labels and legend entries use inconsistent abbreviations for model names; expanding them would aid readability.
- [Abstract] The abstract contains a minor grammatical issue ('Large language models (LLMs) has been').
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation of minor revision. The comments have identified areas where additional justification and statistical detail will strengthen the manuscript. We address each major comment below and have prepared revisions accordingly.
read point-by-point responses
-
Referee: [§3 and §4.2] §3 (Experimental Setup) and §4.2 (Judge Strength Classification): the classification of models into 'weak' and 'strong' judges relies on a fixed set of four models and two benchmark families without explicit justification for why these choices capture the range of reasoning styles or judge architectures; this directly affects the load-bearing claim that the observed sway effects generalize beyond the tested setups.
Authors: We appreciate the referee's observation on the need for clearer justification of our model and benchmark selections. The four models were chosen to span capability levels based on their established performance on standard benchmarks (two weaker models with lower zero-shot accuracy and two stronger models), and the two benchmark families (factual QA and mathematical reasoning) are widely used in LLM-as-judge evaluations. However, we agree that an explicit rationale was not sufficiently detailed in the original text. In the revised manuscript, we will expand §3 with a new subsection providing selection criteria, references to prior literature using these models, and an explicit discussion of scope limitations regarding other architectures or reasoning styles. This will better support the generalization claims while preserving the empirical focus of the study. revision: yes
-
Referee: [§5] §5 (Controlled Experiments): while the fluency/factuality ablation is a strength, the statistical reporting (e.g., confidence intervals or multiple-comparison corrections across the 12 judge–benchmark combinations) is not detailed enough to assess whether the reported differences in acceptance rates are robust to sampling variability.
Authors: We concur that more comprehensive statistical reporting is warranted to demonstrate robustness. The original manuscript presented mean acceptance rates across conditions but did not include confidence intervals or adjustments for multiple comparisons. In the revised version, we will add 95% bootstrap confidence intervals for all reported rates and apply Bonferroni correction across the 12 judge–benchmark combinations. Re-analysis of the data confirms that the primary differences (e.g., sway effects for weak judges and residual vulnerability for strong judges) remain significant post-correction. These additions will be integrated into §5, the results tables, and figure captions. revision: yes
Circularity Check
No circularity: purely empirical observations with no derivations or self-referential steps
full rationale
The paper conducts controlled experiments on factual QA and math benchmarks to measure how reasoning chain presence, fluency, and factuality influence LLM judge decisions. No equations, fitted parameters, or derivation chains are present. Claims rest on direct experimental results rather than any reduction to inputs by construction, self-citations, or ansatzes. Generalization assumptions exist but are not circularity per the rules; the study is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4171–4179
Improving automatic vqa evaluation using large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4171–4179. Arash Marioriyad, Mohammad Hossein Rohban, and Mahdieh Soleymani Baghshah. 2025. The silent judge: Unacknowledged shortcut bias in llm-as-a- judge.arXiv preprint arXiv:2509.26072. Shiyu Ni, Keping Bi,...
-
[2]
When do llms need retrieval augmentation? mitigating llms’ overconfidence helps retrieval aug- mentation. InFindings of the Association for Compu- tational Linguistics: ACL 2024, pages 11375–11388. Shiyu Ni, Keping Bi, Jiafeng Guo, Lulu Yu, Baolong Bi, and Xueqi Cheng. 2025. Towards fully exploiting llm internal states to enhance knowledge boundary percep...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Qwen3 technical report.arXiv preprint arXiv:2505.09388. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christo- pher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empiri- cal methods in natural language processing, pages 236...
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[4]
Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822. Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Good- man. 2022. Star: Bootstrapping reasoning with rea- soning.Advances in Neural Information Processing Systems, 35:15476–15488. Anqi Zhang, Yulin Chen, Jane Pan, Chen Zha...
-
[5]
arXiv preprint arXiv:2310.17631 , year=
Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint arXiv:2310.17631. A Prompts short_qa Answer the following question based on your internal knowledge with one or few words. Question:{question} llm_judge_without_think Judge whether the following answer about the question is correct. If you are sure the answer is correct, saycerta...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.