A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA

Honglin Mu; Kaiyang Wan; Lang Gao; Preslav Nakov; Xiuying Chen; Yuxia Wang

arxiv: 2509.21199 · v3 · submitted 2025-09-25 · 💻 cs.AI

A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA

Kaiyang Wan , Lang Gao , Honglin Mu , Preslav Nakov , Yuxia Wang , Xiuying Chen This is my paper

Pith reviewed 2026-05-18 13:40 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-hop QALLM reasoningaccuracy upper boundFano inequalitycapacity limitstask decompositioninformation integration

0 comments

The pith

Single-pass LLMs face a hard accuracy upper bound in multi-hop QA that collapses once task complexity exceeds their finite capacity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that single-pass reasoning by large language models on multi-hop questions is constrained by a limited ability to integrate dispersed evidence in one generation. Treating the process as transmission over a noisy channel with finite capacity, the authors derive a Fano-style upper bound on achievable accuracy. This bound predicts that accuracy must fall sharply for tasks whose information requirements outstrip the model's per-pass limit. The analysis then guides a multi-call decomposition method that keeps each step within the safe capacity range.

Core claim

The central claim is that single-pass LLM reasoning in multi-hop QA obeys a Fano-style accuracy upper bound derived by modeling the generation as a noisy communication channel whose capacity is finite. Accuracy therefore inevitably declines once the amount of interdependent evidence that must be combined exceeds this capacity, regardless of prompt engineering.

What carries the argument

The Fano-style accuracy upper bound obtained by casting single-pass LLM output as transmission through a noisy channel with limited capacity for reliable evidence integration.

Load-bearing premise

Single-pass LLM reasoning can be modeled as a noisy communication channel with finite capacity to which Fano's inequality applies directly.

What would settle it

A controlled experiment in which single-pass accuracy stays high while the number of required evidence hops is increased past the estimated capacity point would falsify the predicted collapse.

Figures

Figures reproduced from arXiv: 2509.21199 by Honglin Mu, Kaiyang Wan, Lang Gao, Preslav Nakov, Xiuying Chen, Yuxia Wang.

**Figure 2.** Figure 2: The Accuracy Cliff. The theoretical upper bound on accuracy is plotted against information demand β, using C = 200 as an illustrative example. Phase Transition and the Cliff Edge. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Error Accumulation. Even a small per-step error rate (ε) causes a rapid decay in overall success probability as the number of hops (K) increases. By the chain rule, Pr(Succ) is the product of the conditional success probabilities pk at each step: Pr(Succ) = K Y +1 k=1 Pr Sk | S<k = K Y +1 k=1 pk, (8) pk = Pr Zˆ k = Zk ∧ Zˆ k = ϕk(Zˆ k−1, Q, C) [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The InfoQA framework integrates three key components: (1) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qwen3-14B F1 vs. theoretical curves across single-pass methods. The x-axis shows the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qwen3-8B’s Empirical F1 vs. theoretical curves across single-pass methods. The x-axis [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

read the original abstract

Multi-Hop Question Answering (MHQA) requires integrating dispersed, interdependent evidence through sequential reasoning under noise. This task is challenging for LLMs as they have a finite per-pass output capacity, beyond which the integration of task-relevant evidence proves unreliable. Consequently, the single-pass reasoning paradigm is inherently vulnerable to this capacity overflow. To formalize this bottleneck, our analysis establishes a Fano-style accuracy upper bound, defining a theoretical performance ceiling for single-pass LLMs. This bound reveals that accuracy inevitably collapses once task complexity exceeds model capacity, providing general principles for capacity-aware representation and structuring of MHQA in LLMs. Building on these principles, we introduce a proof-of-concept multi-call framework for MHQA, InfoQA. It ensures high per-step accuracy by combining capacity-aware task decomposition with active pruning of prior reasoning traces, keeping the information load within the single-pass limit. It further achieves robustness by a dependency-explicit workflow that enables precise control over the reasoning path. We construct a stringent and noise-rich benchmark to validate our theory and framework. Experimental results show that model behavior aligns with our predicted capacity curves while InfoQA achieves consistent performance improvements. We hope our work inspires more LLM multi-step reasoning methods: \faGithub \href{https://github.com/KaiyangWan/InfoQA}{InfoQA}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Fano-style bound is the central new claim but its direct mapping from LLM generation to a standard noisy channel looks under-justified.

read the letter

The paper derives a Fano-style upper bound on single-pass accuracy for multi-hop QA and uses it to motivate InfoQA, a multi-call framework that decomposes tasks, prunes traces, and follows explicit dependencies to stay inside per-pass capacity limits. Experiments on a noise-rich benchmark show model behavior tracking the predicted capacity curves and some gains from the framework. That combination of a theoretical ceiling plus a concrete implementation is what is actually new here. Prior work has discussed capacity issues in reasoning, but this specific adaptation of Fano's inequality to single-pass LLM multi-hop performance does not appear in the cited literature. The experiments provide at least qualitative support for the capacity-collapse idea, which is worth having on record. The bound itself is presented as general principle rather than a fitted curve, which is a positive step. The main soft spot is the channel model. Treating autoregressive token generation as a memoryless channel with fixed mutual information capacity independent of attention patterns requires explicit justification that the paper's derivation steps would need to supply. Without that, the bound functions more as an inspirational analogy than a tight information-theoretic limit. The free parameter for effective capacity also needs careful definition so readers can see it is not post-hoc. The benchmark construction and controls look reasonable from the description, though full verification would require the details. This paper is for groups working on multi-step reasoning architectures and theoretical limits rather than immediate production systems. A reader already thinking about capacity constraints would extract usable ideas from the framework and the capacity curves. It is coherent enough on its own terms to deserve a serious referee who can check the derivation and the experimental alignment.

Referee Report

2 major / 2 minor

Summary. The paper claims to derive a Fano-style accuracy upper bound for single-pass LLM reasoning in multi-hop QA, showing that accuracy collapses when task complexity exceeds finite model capacity. It introduces the InfoQA multi-call framework with capacity-aware decomposition and pruning, and validates both the bound and framework on a new noise-rich benchmark where model behavior aligns with predicted capacity curves.

Significance. If the bound is rigorously derived and the channel model holds, the work supplies a theoretical ceiling that explains single-pass failures in MHQA and motivates capacity-aware multi-step methods. The open-source InfoQA implementation and the new benchmark are concrete contributions that could be reused even if the bound requires refinement.

major comments (2)

[§3] §3 (Fano-style bound derivation): the mapping of autoregressive single-pass LLM generation to a memoryless noisy channel with finite capacity C is asserted without explicit justification that attention patterns or token-by-token conditioning preserve the standard Fano assumptions; the error-probability form P_e ≥ (H(X|Y)−1)/log|X| therefore rests on an unverified modeling step that is load-bearing for the central collapse claim.
[§4] §4 (experimental validation): the reported alignment between observed accuracy and the predicted capacity curves is presented without ablation of the effective-capacity parameter or controls that isolate the bound from prompt-engineering effects; this leaves open whether the curves confirm the theoretical ceiling or merely reflect empirical scaling.

minor comments (2)

Notation for the effective capacity C is introduced without a clear operational definition or estimation procedure from model internals.
The benchmark construction paragraph should include explicit statistics on hop count, evidence dispersion, and injected noise levels to allow reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. These have highlighted important areas where additional clarification and controls can strengthen the presentation of both the theoretical bound and its empirical validation. We address each major comment below and will incorporate the suggested revisions in the next version of the paper.

read point-by-point responses

Referee: [§3] §3 (Fano-style bound derivation): the mapping of autoregressive single-pass LLM generation to a memoryless noisy channel with finite capacity C is asserted without explicit justification that attention patterns or token-by-token conditioning preserve the standard Fano assumptions; the error-probability form P_e ≥ (H(X|Y)−1)/log|X| therefore rests on an unverified modeling step that is load-bearing for the central collapse claim.

Authors: We agree that the modeling step merits more explicit justification. In the revised §3 we will expand the derivation to clarify that we treat the full single-pass input-to-output mapping (question plus evidence to generated answer) as a discrete memoryless channel whose capacity C is bounded by the model’s finite parameters and context length. Under this abstraction the standard Fano inequality applies directly to the overall channel, independent of the internal autoregressive conditioning or attention patterns, which are subsumed into the channel transition probabilities. We will add a short paragraph with supporting references from the information-theoretic literature on neural channels to make this modeling choice transparent while preserving the collapse claim. revision: yes
Referee: [§4] §4 (experimental validation): the reported alignment between observed accuracy and the predicted capacity curves is presented without ablation of the effective-capacity parameter or controls that isolate the bound from prompt-engineering effects; this leaves open whether the curves confirm the theoretical ceiling or merely reflect empirical scaling.

Authors: We acknowledge that stronger controls are needed to isolate the capacity effect. In the revised manuscript we will augment §4 with (i) an ablation that varies effective capacity by using models of different sizes and by systematically truncating context length, and (ii) a comparison against standard prompt-engineering baselines (e.g., chain-of-thought without capacity-aware decomposition) to show that the observed alignment with the theoretical curves is not explained by prompting alone. These additions will be presented with new figures and tables so that readers can directly assess whether the data support the predicted capacity ceiling. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper applies the standard Fano inequality from information theory to a modeled noisy-channel view of single-pass LLM reasoning in MHQA, yielding an accuracy upper bound that depends on task complexity exceeding finite capacity C. This step invokes an external theorem rather than defining the bound in terms of itself, fitting parameters to the target data and renaming them as predictions, or relying on load-bearing self-citations whose content reduces to the present claim. The subsequent InfoQA framework is constructed on top of the bound but does not retroactively alter the bound's derivation; experimental alignment with capacity curves is presented as validation, not as the source of the bound. The derivation therefore remains self-contained against the external benchmark of Fano's inequality.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on modeling LLM single-pass reasoning as a finite-capacity noisy channel and invoking Fano's inequality; no new physical entities are introduced.

free parameters (1)

effective model capacity
The bound requires a numerical capacity value for the LLM; this is likely estimated or chosen to fit observed behavior.

axioms (1)

standard math Fano's inequality
Standard information-theoretic result used to bound error probability given conditional entropy or mutual information.

pith-pipeline@v0.9.0 · 5782 in / 1300 out tokens · 42998 ms · 2026-05-18T13:40:37.257803+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/InformationTheory or Cost.FunctionalEquation reality_from_one_distinction or washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 (A Fano-Style Accuracy Upper Bound... h(Acc)+(1−Acc)log(|A|−1)≥β−C

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 5 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of- thought for reasoning large language models.arXiv preprint arXiv:2503.09567,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Measuring and narrowing the compositionality gap in language models

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pp. 5687–5711,

work page 2023
[6]

Assist- ing in writing wikipedia-like articles from scratch with large language models

Yijia Shao, Yucheng Jiang, Theodore Kanell, Peter Xu, Omar Khattab, and Monica Lam. Assist- ing in writing wikipedia-like articles from scratch with large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 6252–6278,

work page 2024
[7]

A cognitive writing perspective for constrained long-form text generation.arXiv preprint arXiv:2502.12568,

Kaiyang Wan, Honglin Mu, Rui Hao, Haoran Luo, Tianle Gu, and Xiuying Chen. A cognitive writing perspective for constrained long-form text generation.arXiv preprint arXiv:2502.12568,

work page arXiv
[8]

Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts

Richmond Y Wong, Bjoern Hartmann, and Qian Yang. Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts. InProceedings of the 2023 CHI conference on human factors in computing systems, pp. 1–21,

work page 2023
[9]

Fire: Fact-checking with iterative retrieval and verification

Zhuohan Xie, Rui Xing, Yuxia Wang, Jiahui Geng, Hasan Iqbal, Dhruv Sahnan, Iryna Gurevych, and Preslav Nakov. Fire: Fact-checking with iterative retrieval and verification. InFindings of the Association for Computational Linguistics: NAACL 2025, pp. 2901–2914,

work page 2025
[10]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2369–2380,

work page 2018
[12]

No matter how smart the model is, if the task demands more information than the output can encode, an error plateau is inevitable

We provide complete proofs of the conditional Fano inequality and the output entropy bound, together with intuitive interpretations and implications for multi-hop reasoning. A.2.1 PROOF OF THECONDITIONALFANOINEQUALITY Setup.LetAbe the ground-truth answer, ˆA=g(Y, Q, C)the prediction derived from the model outputY(allowing the estimator to depend on(Q, C))...

work page 1961
[13]

accuracy cliff

Remarks. • The proof only uses that ˆAis a (deterministic) function ofY; if ˆAwere randomized givenY, equation 20 would still hold by the data-processing inequality (conditioning on (Q, C, Y)is at least as informative as conditioning on(Q, C, ˆA)). • The capacity constantCis taken as theeffectivesingle-pass capacityH(Y|Q, C)realized by the decoding policy...

work page 2021

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of- thought for reasoning large language models.arXiv preprint arXiv:2503.09567,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Measuring and narrowing the compositionality gap in language models

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pp. 5687–5711,

work page 2023

[6] [6]

Assist- ing in writing wikipedia-like articles from scratch with large language models

Yijia Shao, Yucheng Jiang, Theodore Kanell, Peter Xu, Omar Khattab, and Monica Lam. Assist- ing in writing wikipedia-like articles from scratch with large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 6252–6278,

work page 2024

[7] [7]

A cognitive writing perspective for constrained long-form text generation.arXiv preprint arXiv:2502.12568,

Kaiyang Wan, Honglin Mu, Rui Hao, Haoran Luo, Tianle Gu, and Xiuying Chen. A cognitive writing perspective for constrained long-form text generation.arXiv preprint arXiv:2502.12568,

work page arXiv

[8] [8]

Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts

Richmond Y Wong, Bjoern Hartmann, and Qian Yang. Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts. InProceedings of the 2023 CHI conference on human factors in computing systems, pp. 1–21,

work page 2023

[9] [9]

Fire: Fact-checking with iterative retrieval and verification

Zhuohan Xie, Rui Xing, Yuxia Wang, Jiahui Geng, Hasan Iqbal, Dhruv Sahnan, Iryna Gurevych, and Preslav Nakov. Fire: Fact-checking with iterative retrieval and verification. InFindings of the Association for Computational Linguistics: NAACL 2025, pp. 2901–2914,

work page 2025

[10] [10]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2369–2380,

work page 2018

[12] [12]

No matter how smart the model is, if the task demands more information than the output can encode, an error plateau is inevitable

We provide complete proofs of the conditional Fano inequality and the output entropy bound, together with intuitive interpretations and implications for multi-hop reasoning. A.2.1 PROOF OF THECONDITIONALFANOINEQUALITY Setup.LetAbe the ground-truth answer, ˆA=g(Y, Q, C)the prediction derived from the model outputY(allowing the estimator to depend on(Q, C))...

work page 1961

[13] [13]

accuracy cliff

Remarks. • The proof only uses that ˆAis a (deterministic) function ofY; if ˆAwere randomized givenY, equation 20 would still hold by the data-processing inequality (conditioning on (Q, C, Y)is at least as informative as conditioning on(Q, C, ˆA)). • The capacity constantCis taken as theeffectivesingle-pass capacityH(Y|Q, C)realized by the decoding policy...

work page 2021