pith. sign in

arxiv: 2509.21199 · v3 · submitted 2025-09-25 · 💻 cs.AI

A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA

Pith reviewed 2026-05-18 13:40 UTC · model grok-4.3

classification 💻 cs.AI
keywords multi-hop QALLM reasoningaccuracy upper boundFano inequalitycapacity limitstask decompositioninformation integration
0
0 comments X

The pith

Single-pass LLMs face a hard accuracy upper bound in multi-hop QA that collapses once task complexity exceeds their finite capacity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that single-pass reasoning by large language models on multi-hop questions is constrained by a limited ability to integrate dispersed evidence in one generation. Treating the process as transmission over a noisy channel with finite capacity, the authors derive a Fano-style upper bound on achievable accuracy. This bound predicts that accuracy must fall sharply for tasks whose information requirements outstrip the model's per-pass limit. The analysis then guides a multi-call decomposition method that keeps each step within the safe capacity range.

Core claim

The central claim is that single-pass LLM reasoning in multi-hop QA obeys a Fano-style accuracy upper bound derived by modeling the generation as a noisy communication channel whose capacity is finite. Accuracy therefore inevitably declines once the amount of interdependent evidence that must be combined exceeds this capacity, regardless of prompt engineering.

What carries the argument

The Fano-style accuracy upper bound obtained by casting single-pass LLM output as transmission through a noisy channel with limited capacity for reliable evidence integration.

Load-bearing premise

Single-pass LLM reasoning can be modeled as a noisy communication channel with finite capacity to which Fano's inequality applies directly.

What would settle it

A controlled experiment in which single-pass accuracy stays high while the number of required evidence hops is increased past the estimated capacity point would falsify the predicted collapse.

Figures

Figures reproduced from arXiv: 2509.21199 by Honglin Mu, Kaiyang Wan, Lang Gao, Preslav Nakov, Xiuying Chen, Yuxia Wang.

Figure 1
Figure 1. Figure 1: Comparison of single-pass and multi-call reasoning paradigms. Single-pass reasoning is [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Accuracy Cliff. The the￾oretical upper bound on accuracy is plot￾ted against information demand β, using C = 200 as an illustrative example. Phase Transition and the Cliff Edge. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Error Accumulation. Even a small per-step error rate (ε) causes a rapid decay in overall success probability as the number of hops (K) increases. By the chain rule, Pr(Succ) is the product of the condi￾tional success probabilities pk at each step: Pr(Succ) = K Y +1 k=1 Pr Sk | S<k = K Y +1 k=1 pk, (8) pk = Pr Zˆ k = Zk ∧ Zˆ k = ϕk(Zˆ k−1, Q, C) [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The InfoQA framework integrates three key components: (1) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qwen3-14B F1 vs. theoretical curves across single-pass methods. The x-axis shows the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qwen3-8B’s Empirical F1 vs. theoretical curves across single-pass methods. The x-axis [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
read the original abstract

Multi-Hop Question Answering (MHQA) requires integrating dispersed, interdependent evidence through sequential reasoning under noise. This task is challenging for LLMs as they have a finite per-pass output capacity, beyond which the integration of task-relevant evidence proves unreliable. Consequently, the single-pass reasoning paradigm is inherently vulnerable to this capacity overflow. To formalize this bottleneck, our analysis establishes a Fano-style accuracy upper bound, defining a theoretical performance ceiling for single-pass LLMs. This bound reveals that accuracy inevitably collapses once task complexity exceeds model capacity, providing general principles for capacity-aware representation and structuring of MHQA in LLMs. Building on these principles, we introduce a proof-of-concept multi-call framework for MHQA, InfoQA. It ensures high per-step accuracy by combining capacity-aware task decomposition with active pruning of prior reasoning traces, keeping the information load within the single-pass limit. It further achieves robustness by a dependency-explicit workflow that enables precise control over the reasoning path. We construct a stringent and noise-rich benchmark to validate our theory and framework. Experimental results show that model behavior aligns with our predicted capacity curves while InfoQA achieves consistent performance improvements. We hope our work inspires more LLM multi-step reasoning methods: \faGithub \href{https://github.com/KaiyangWan/InfoQA}{InfoQA}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to derive a Fano-style accuracy upper bound for single-pass LLM reasoning in multi-hop QA, showing that accuracy collapses when task complexity exceeds finite model capacity. It introduces the InfoQA multi-call framework with capacity-aware decomposition and pruning, and validates both the bound and framework on a new noise-rich benchmark where model behavior aligns with predicted capacity curves.

Significance. If the bound is rigorously derived and the channel model holds, the work supplies a theoretical ceiling that explains single-pass failures in MHQA and motivates capacity-aware multi-step methods. The open-source InfoQA implementation and the new benchmark are concrete contributions that could be reused even if the bound requires refinement.

major comments (2)
  1. [§3] §3 (Fano-style bound derivation): the mapping of autoregressive single-pass LLM generation to a memoryless noisy channel with finite capacity C is asserted without explicit justification that attention patterns or token-by-token conditioning preserve the standard Fano assumptions; the error-probability form P_e ≥ (H(X|Y)−1)/log|X| therefore rests on an unverified modeling step that is load-bearing for the central collapse claim.
  2. [§4] §4 (experimental validation): the reported alignment between observed accuracy and the predicted capacity curves is presented without ablation of the effective-capacity parameter or controls that isolate the bound from prompt-engineering effects; this leaves open whether the curves confirm the theoretical ceiling or merely reflect empirical scaling.
minor comments (2)
  1. Notation for the effective capacity C is introduced without a clear operational definition or estimation procedure from model internals.
  2. The benchmark construction paragraph should include explicit statistics on hop count, evidence dispersion, and injected noise levels to allow reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. These have highlighted important areas where additional clarification and controls can strengthen the presentation of both the theoretical bound and its empirical validation. We address each major comment below and will incorporate the suggested revisions in the next version of the paper.

read point-by-point responses
  1. Referee: [§3] §3 (Fano-style bound derivation): the mapping of autoregressive single-pass LLM generation to a memoryless noisy channel with finite capacity C is asserted without explicit justification that attention patterns or token-by-token conditioning preserve the standard Fano assumptions; the error-probability form P_e ≥ (H(X|Y)−1)/log|X| therefore rests on an unverified modeling step that is load-bearing for the central collapse claim.

    Authors: We agree that the modeling step merits more explicit justification. In the revised §3 we will expand the derivation to clarify that we treat the full single-pass input-to-output mapping (question plus evidence to generated answer) as a discrete memoryless channel whose capacity C is bounded by the model’s finite parameters and context length. Under this abstraction the standard Fano inequality applies directly to the overall channel, independent of the internal autoregressive conditioning or attention patterns, which are subsumed into the channel transition probabilities. We will add a short paragraph with supporting references from the information-theoretic literature on neural channels to make this modeling choice transparent while preserving the collapse claim. revision: yes

  2. Referee: [§4] §4 (experimental validation): the reported alignment between observed accuracy and the predicted capacity curves is presented without ablation of the effective-capacity parameter or controls that isolate the bound from prompt-engineering effects; this leaves open whether the curves confirm the theoretical ceiling or merely reflect empirical scaling.

    Authors: We acknowledge that stronger controls are needed to isolate the capacity effect. In the revised manuscript we will augment §4 with (i) an ablation that varies effective capacity by using models of different sizes and by systematically truncating context length, and (ii) a comparison against standard prompt-engineering baselines (e.g., chain-of-thought without capacity-aware decomposition) to show that the observed alignment with the theoretical curves is not explained by prompting alone. These additions will be presented with new figures and tables so that readers can directly assess whether the data support the predicted capacity ceiling. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper applies the standard Fano inequality from information theory to a modeled noisy-channel view of single-pass LLM reasoning in MHQA, yielding an accuracy upper bound that depends on task complexity exceeding finite capacity C. This step invokes an external theorem rather than defining the bound in terms of itself, fitting parameters to the target data and renaming them as predictions, or relying on load-bearing self-citations whose content reduces to the present claim. The subsequent InfoQA framework is constructed on top of the bound but does not retroactively alter the bound's derivation; experimental alignment with capacity curves is presented as validation, not as the source of the bound. The derivation therefore remains self-contained against the external benchmark of Fano's inequality.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on modeling LLM single-pass reasoning as a finite-capacity noisy channel and invoking Fano's inequality; no new physical entities are introduced.

free parameters (1)
  • effective model capacity
    The bound requires a numerical capacity value for the LLM; this is likely estimated or chosen to fit observed behavior.
axioms (1)
  • standard math Fano's inequality
    Standard information-theoretic result used to bound error probability given conditional entropy or mutual information.

pith-pipeline@v0.9.0 · 5782 in / 1300 out tokens · 42998 ms · 2026-05-18T13:40:37.257803+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 5 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

  3. [3]

    Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of- thought for reasoning large language models.arXiv preprint arXiv:2503.09567,

  4. [4]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

  5. [5]

    Measuring and narrowing the compositionality gap in language models

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pp. 5687–5711,

  6. [6]

    Assist- ing in writing wikipedia-like articles from scratch with large language models

    Yijia Shao, Yucheng Jiang, Theodore Kanell, Peter Xu, Omar Khattab, and Monica Lam. Assist- ing in writing wikipedia-like articles from scratch with large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 6252–6278,

  7. [7]

    A cognitive writing perspective for constrained long-form text generation.arXiv preprint arXiv:2502.12568,

    Kaiyang Wan, Honglin Mu, Rui Hao, Haoran Luo, Tianle Gu, and Xiuying Chen. A cognitive writing perspective for constrained long-form text generation.arXiv preprint arXiv:2502.12568,

  8. [8]

    Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts

    Richmond Y Wong, Bjoern Hartmann, and Qian Yang. Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts. InProceedings of the 2023 CHI conference on human factors in computing systems, pp. 1–21,

  9. [9]

    Fire: Fact-checking with iterative retrieval and verification

    Zhuohan Xie, Rui Xing, Yuxia Wang, Jiahui Geng, Hasan Iqbal, Dhruv Sahnan, Iryna Gurevych, and Preslav Nakov. Fire: Fact-checking with iterative retrieval and verification. InFindings of the Association for Computational Linguistics: NAACL 2025, pp. 2901–2914,

  10. [10]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  11. [11]

    Hotpotqa: A dataset for diverse, explainable multi-hop question answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2369–2380,

  12. [12]

    No matter how smart the model is, if the task demands more information than the output can encode, an error plateau is inevitable

    We provide complete proofs of the conditional Fano inequality and the output entropy bound, together with intuitive interpretations and implications for multi-hop reasoning. A.2.1 PROOF OF THECONDITIONALFANOINEQUALITY Setup.LetAbe the ground-truth answer, ˆA=g(Y, Q, C)the prediction derived from the model outputY(allowing the estimator to depend on(Q, C))...

  13. [13]

    accuracy cliff

    Remarks. • The proof only uses that ˆAis a (deterministic) function ofY; if ˆAwere randomized givenY, equation 20 would still hold by the data-processing inequality (conditioning on (Q, C, Y)is at least as informative as conditioning on(Q, C, ˆA)). • The capacity constantCis taken as theeffectivesingle-pass capacityH(Y|Q, C)realized by the decoding policy...