A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA
Pith reviewed 2026-05-18 13:40 UTC · model grok-4.3
The pith
Single-pass LLMs face a hard accuracy upper bound in multi-hop QA that collapses once task complexity exceeds their finite capacity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that single-pass LLM reasoning in multi-hop QA obeys a Fano-style accuracy upper bound derived by modeling the generation as a noisy communication channel whose capacity is finite. Accuracy therefore inevitably declines once the amount of interdependent evidence that must be combined exceeds this capacity, regardless of prompt engineering.
What carries the argument
The Fano-style accuracy upper bound obtained by casting single-pass LLM output as transmission through a noisy channel with limited capacity for reliable evidence integration.
Load-bearing premise
Single-pass LLM reasoning can be modeled as a noisy communication channel with finite capacity to which Fano's inequality applies directly.
What would settle it
A controlled experiment in which single-pass accuracy stays high while the number of required evidence hops is increased past the estimated capacity point would falsify the predicted collapse.
Figures
read the original abstract
Multi-Hop Question Answering (MHQA) requires integrating dispersed, interdependent evidence through sequential reasoning under noise. This task is challenging for LLMs as they have a finite per-pass output capacity, beyond which the integration of task-relevant evidence proves unreliable. Consequently, the single-pass reasoning paradigm is inherently vulnerable to this capacity overflow. To formalize this bottleneck, our analysis establishes a Fano-style accuracy upper bound, defining a theoretical performance ceiling for single-pass LLMs. This bound reveals that accuracy inevitably collapses once task complexity exceeds model capacity, providing general principles for capacity-aware representation and structuring of MHQA in LLMs. Building on these principles, we introduce a proof-of-concept multi-call framework for MHQA, InfoQA. It ensures high per-step accuracy by combining capacity-aware task decomposition with active pruning of prior reasoning traces, keeping the information load within the single-pass limit. It further achieves robustness by a dependency-explicit workflow that enables precise control over the reasoning path. We construct a stringent and noise-rich benchmark to validate our theory and framework. Experimental results show that model behavior aligns with our predicted capacity curves while InfoQA achieves consistent performance improvements. We hope our work inspires more LLM multi-step reasoning methods: \faGithub \href{https://github.com/KaiyangWan/InfoQA}{InfoQA}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to derive a Fano-style accuracy upper bound for single-pass LLM reasoning in multi-hop QA, showing that accuracy collapses when task complexity exceeds finite model capacity. It introduces the InfoQA multi-call framework with capacity-aware decomposition and pruning, and validates both the bound and framework on a new noise-rich benchmark where model behavior aligns with predicted capacity curves.
Significance. If the bound is rigorously derived and the channel model holds, the work supplies a theoretical ceiling that explains single-pass failures in MHQA and motivates capacity-aware multi-step methods. The open-source InfoQA implementation and the new benchmark are concrete contributions that could be reused even if the bound requires refinement.
major comments (2)
- [§3] §3 (Fano-style bound derivation): the mapping of autoregressive single-pass LLM generation to a memoryless noisy channel with finite capacity C is asserted without explicit justification that attention patterns or token-by-token conditioning preserve the standard Fano assumptions; the error-probability form P_e ≥ (H(X|Y)−1)/log|X| therefore rests on an unverified modeling step that is load-bearing for the central collapse claim.
- [§4] §4 (experimental validation): the reported alignment between observed accuracy and the predicted capacity curves is presented without ablation of the effective-capacity parameter or controls that isolate the bound from prompt-engineering effects; this leaves open whether the curves confirm the theoretical ceiling or merely reflect empirical scaling.
minor comments (2)
- Notation for the effective capacity C is introduced without a clear operational definition or estimation procedure from model internals.
- The benchmark construction paragraph should include explicit statistics on hop count, evidence dispersion, and injected noise levels to allow reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. These have highlighted important areas where additional clarification and controls can strengthen the presentation of both the theoretical bound and its empirical validation. We address each major comment below and will incorporate the suggested revisions in the next version of the paper.
read point-by-point responses
-
Referee: [§3] §3 (Fano-style bound derivation): the mapping of autoregressive single-pass LLM generation to a memoryless noisy channel with finite capacity C is asserted without explicit justification that attention patterns or token-by-token conditioning preserve the standard Fano assumptions; the error-probability form P_e ≥ (H(X|Y)−1)/log|X| therefore rests on an unverified modeling step that is load-bearing for the central collapse claim.
Authors: We agree that the modeling step merits more explicit justification. In the revised §3 we will expand the derivation to clarify that we treat the full single-pass input-to-output mapping (question plus evidence to generated answer) as a discrete memoryless channel whose capacity C is bounded by the model’s finite parameters and context length. Under this abstraction the standard Fano inequality applies directly to the overall channel, independent of the internal autoregressive conditioning or attention patterns, which are subsumed into the channel transition probabilities. We will add a short paragraph with supporting references from the information-theoretic literature on neural channels to make this modeling choice transparent while preserving the collapse claim. revision: yes
-
Referee: [§4] §4 (experimental validation): the reported alignment between observed accuracy and the predicted capacity curves is presented without ablation of the effective-capacity parameter or controls that isolate the bound from prompt-engineering effects; this leaves open whether the curves confirm the theoretical ceiling or merely reflect empirical scaling.
Authors: We acknowledge that stronger controls are needed to isolate the capacity effect. In the revised manuscript we will augment §4 with (i) an ablation that varies effective capacity by using models of different sizes and by systematically truncating context length, and (ii) a comparison against standard prompt-engineering baselines (e.g., chain-of-thought without capacity-aware decomposition) to show that the observed alignment with the theoretical curves is not explained by prompting alone. These additions will be presented with new figures and tables so that readers can directly assess whether the data support the predicted capacity ceiling. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper applies the standard Fano inequality from information theory to a modeled noisy-channel view of single-pass LLM reasoning in MHQA, yielding an accuracy upper bound that depends on task complexity exceeding finite capacity C. This step invokes an external theorem rather than defining the bound in terms of itself, fitting parameters to the target data and renaming them as predictions, or relying on load-bearing self-citations whose content reduces to the present claim. The subsequent InfoQA framework is constructed on top of the bound but does not retroactively alter the bound's derivation; experimental alignment with capacity curves is presented as validation, not as the source of the bound. The derivation therefore remains self-contained against the external benchmark of Fano's inequality.
Axiom & Free-Parameter Ledger
free parameters (1)
- effective model capacity
axioms (1)
- standard math Fano's inequality
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/InformationTheory or Cost.FunctionalEquationreality_from_one_distinction or washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1 (A Fano-Style Accuracy Upper Bound... h(Acc)+(1−Acc)log(|A|−1)≥β−C
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of- thought for reasoning large language models.arXiv preprint arXiv:2503.09567,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Measuring and narrowing the compositionality gap in language models
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pp. 5687–5711,
work page 2023
-
[6]
Assist- ing in writing wikipedia-like articles from scratch with large language models
Yijia Shao, Yucheng Jiang, Theodore Kanell, Peter Xu, Omar Khattab, and Monica Lam. Assist- ing in writing wikipedia-like articles from scratch with large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 6252–6278,
work page 2024
-
[7]
Kaiyang Wan, Honglin Mu, Rui Hao, Haoran Luo, Tianle Gu, and Xiuying Chen. A cognitive writing perspective for constrained long-form text generation.arXiv preprint arXiv:2502.12568,
-
[8]
Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts
Richmond Y Wong, Bjoern Hartmann, and Qian Yang. Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts. InProceedings of the 2023 CHI conference on human factors in computing systems, pp. 1–21,
work page 2023
-
[9]
Fire: Fact-checking with iterative retrieval and verification
Zhuohan Xie, Rui Xing, Yuxia Wang, Jiahui Geng, Hasan Iqbal, Dhruv Sahnan, Iryna Gurevych, and Preslav Nakov. Fire: Fact-checking with iterative retrieval and verification. InFindings of the Association for Computational Linguistics: NAACL 2025, pp. 2901–2914,
work page 2025
-
[10]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Hotpotqa: A dataset for diverse, explainable multi-hop question answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2369–2380,
work page 2018
-
[12]
We provide complete proofs of the conditional Fano inequality and the output entropy bound, together with intuitive interpretations and implications for multi-hop reasoning. A.2.1 PROOF OF THECONDITIONALFANOINEQUALITY Setup.LetAbe the ground-truth answer, ˆA=g(Y, Q, C)the prediction derived from the model outputY(allowing the estimator to depend on(Q, C))...
work page 1961
-
[13]
Remarks. • The proof only uses that ˆAis a (deterministic) function ofY; if ˆAwere randomized givenY, equation 20 would still hold by the data-processing inequality (conditioning on (Q, C, Y)is at least as informative as conditioning on(Q, C, ˆA)). • The capacity constantCis taken as theeffectivesingle-pass capacityH(Y|Q, C)realized by the decoding policy...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.