On the Limits of Prompt-Conditioned Language Models as General-Purpose Learners

David Mguni; Julian Ma; Jun Wang

arxiv: 2606.23668 · v1 · pith:AMMQWDHQnew · submitted 2026-06-22 · 💻 cs.LG

On the Limits of Prompt-Conditioned Language Models as General-Purpose Learners

David Mguni , Julian Ma , Jun Wang This is my paper

Pith reviewed 2026-06-26 09:11 UTC · model grok-4.3

classification 💻 cs.LG

keywords large language modelspromptingexpressivity floorPAC-Bayes boundscheap-talk gamealignment constraintsgeneralization limitstask inference

0 comments

The pith

Prompt-conditioned language models face an expressivity floor from language's limited capacity, making correct behavior unattainable for some task families even with infinite data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models user-LLM interactions as a bilevel cheap-talk game to show how prompts encode and reinterpret tasks under constraints. It decomposes task inference from execution and applies PAC-Bayes bounds to separate removable estimation error from fixed structural limits. This produces an expressivity floor where language channel capacity is exceeded by task complexity, rendering some tasks indistinguishable, plus an objective-misalignment floor from restricted output sets. If these floors hold, prompt-only LLMs cannot serve as universal solvers because certain task families retain positive error no matter the data volume, optimization, or scale. The work indicates that richer interfaces beyond language prompts are needed to supply more task information.

Core claim

Prompt-conditioned LLMs are not universal problem solvers through prompting alone, as there exist task families for which correct behaviour is provably unattainable even in the infinite-data regime. This follows from language acting as a capacity-limited communication channel; when the informational complexity of a task family exceeds the capacity of that channel, distinct tasks become unavoidably indistinguishable to the solver, inducing a strictly positive error floor that cannot be eliminated by additional data, optimisation, or model scaling alone. Alignment constraints add a second irreducible distortion when the user-ideal distribution lies outside the admissible output set.

What carries the argument

The bilevel cheap-talk game that models how latent tasks are encoded into prompts and reinterpreted under alignment constraints, from which PAC-Bayes bounds derive the expressivity floor separating estimation error from structural limitations.

Load-bearing premise

Language functions as a capacity-limited communication channel whose informational capacity can be strictly exceeded by the complexity of certain task families.

What would settle it

A demonstration that error on every task family can be driven to zero by increasing data volume and model size alone would falsify the existence of an expressivity floor.

Figures

Figures reproduced from arXiv: 2606.23668 by David Mguni, Julian Ma, Jun Wang.

**Figure 1.** Figure 1: Conceptual Diagram informational and objective bottlenecks that arise in prompt-based interaction, without making claims about internal and physical modularity within the model (see [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are frequently portrayed as general-purpose solvers capable of solving arbitrary tasks. We argue that this view overlooks a fundamental constraint: language is a compressed and capacity-limited interface for conveying task information. Modelling User--System interaction as a bilevel \emph{cheap-talk} game, we analyse how latent tasks are encoded into prompts and reinterpreted under alignment and safety constraints. We introduce a conceptual decomposition separating task inference from execution and derive PAC-Bayes bounds that distinguish finite-sample estimation error from irreducible structural limitations. Our first main result establishes an \emph{expressivity floor}: language acts as a capacity-limited communication channel, and whenever the informational complexity of a task family exceeds the capacity of that channel, distinct tasks become unavoidably indistinguishable to the Solver, inducing a strictly positive error floor that cannot be eliminated by additional data, optimisation, or model scaling alone. We then establish an \emph{objective-misalignment floor}: when alignment constraints restrict the admissible output set, the User-ideal distribution may lie outside the feasible class, inducing an irreducible distortion. Together, these results yield a formal negative conclusion: prompt-conditioned LLMs are not universal problem solvers through prompting alone, as there exist task families for which correct behaviour is provably unattainable even in the infinite-data regime. More broadly, our analysis shows the limits of prompt-based generalisation arise from information-constrained communication and alignment-constrained objectives. This suggests that interfaces beyond natural language, including multimodal observations and, external memory, may reduce the inherent LLM limitations by increasing the task-relevant information available to the System.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives error floors for prompt LLMs via cheap-talk games and PAC-Bayes, but the fixed-capacity assumption on language channels is the part that needs checking.

read the letter

The main thing to know is that this work models user-LLM interaction as a bilevel cheap-talk game, then applies PAC-Bayes bounds to split finite-sample error from two irreducible floors: one from task complexity exceeding language channel capacity (expressivity floor), and one from alignment constraints pushing the feasible outputs away from the ideal (misalignment floor). The claim is that some task families cannot be solved correctly by prompting alone, even with infinite data.

What is new is the explicit formulation of those two floors as consequences of the game plus the bounds, plus the separation of inference from execution. The abstract lays this out directly as a negative result on prompt-based generality.

The paper does a solid job using established tools to formalize a limit that many have suspected informally. It also correctly notes that the argument points toward needing richer interfaces like multimodal input or external memory.

The soft spot is the modeling choice that language capacity is strictly bounded and independent of prompt length or structure. The stress-test concern lands here: if longer or more structured prompts can increase effective capacity without bound, the claimed separation between estimation error and structural error does not follow. The abstract does not show why alignment or safety constraints would keep capacity fixed even as prompts grow, so the positive floor may not be as general as stated. Without the full derivations it is hard to tell how tightly the bounds are tied to that assumption.

This is for people working on theoretical limits of LLMs and generalization. It shows clear engagement with the modeling problem and is worth a serious referee to verify the bounds and the capacity premise.

Referee Report

2 major / 2 minor

Summary. The paper models User-System interactions as a bilevel cheap-talk game under alignment constraints and applies PAC-Bayes analysis to derive two floors: an expressivity floor (when task-family complexity exceeds the fixed capacity of the language channel, tasks become indistinguishable and a positive error floor persists at infinite data) and an objective-misalignment floor (when alignment restricts the output set away from the User-ideal distribution). The central conclusion is that prompt-conditioned LLMs cannot be universal solvers, as certain task families are provably unsolvable through prompting alone.

Significance. If the bounds are valid, the work supplies a formal information-theoretic account of why prompting cannot overcome all generalization limits, distinguishing estimation error from irreducible structural error and motivating non-linguistic interfaces. The use of cheap-talk games and PAC-Bayes tools to obtain explicit negative results on universality is a constructive contribution to the theory of LLM capabilities.

major comments (2)

[bilevel cheap-talk game and expressivity floor derivation] The derivation of the expressivity floor (abstract and the bilevel cheap-talk game section) treats the language channel capacity as strictly finite and independent of prompt length. This modeling choice is load-bearing: if effective capacity can grow without bound via longer or structured prompts, the claimed separation between finite-sample error and an irreducible positive floor does not follow from the stated premises.
[PAC-Bayes bounds] The PAC-Bayes bounds that separate estimation error from structural limitations (the section introducing the bounds) rely on the capacity bound remaining fixed once alignment/safety constraints are imposed. The manuscript should supply the explicit dependence (or independence) of the capacity term on prompt length and show that the floor remains positive when prompt length is allowed to vary.

minor comments (2)

Notation for the task family complexity measure and the channel capacity should be introduced with a single consistent symbol and cross-referenced to the game definition.
The discussion of related work on communication complexity and alignment constraints is brief; adding two or three key citations would clarify the novelty of the bilevel formulation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the modeling assumptions around channel capacity that underpin our negative results. We address each major comment below.

read point-by-point responses

Referee: [bilevel cheap-talk game and expressivity floor derivation] The derivation of the expressivity floor (abstract and the bilevel cheap-talk game section) treats the language channel capacity as strictly finite and independent of prompt length. This modeling choice is load-bearing: if effective capacity can grow without bound via longer or structured prompts, the claimed separation between finite-sample error and an irreducible positive floor does not follow from the stated premises.

Authors: The manuscript models the language channel capacity as finite for any given prompt to isolate the structural information bottleneck. We agree that the current text does not supply an explicit functional dependence on prompt length. In the revision we will add a paragraph in the bilevel cheap-talk game section stating the capacity term explicitly as a function of prompt length (via the mutual-information bound) and showing that the expressivity floor remains strictly positive for every finite length. revision: yes
Referee: [PAC-Bayes bounds] The PAC-Bayes bounds that separate estimation error from structural limitations (the section introducing the bounds) rely on the capacity bound remaining fixed once alignment/safety constraints are imposed. The manuscript should supply the explicit dependence (or independence) of the capacity term on prompt length and show that the floor remains positive when prompt length is allowed to vary.

Authors: The PAC-Bayes derivation conditions on a fixed capacity once the language interface and alignment constraints are chosen. We accept that the dependence on prompt length must be stated explicitly. The revised manuscript will include this dependence in the bounds section and verify that the objective-misalignment floor (and the expressivity floor) stay positive for any finite prompt length, while noting that unbounded lengths fall outside the prompt-conditioned setting analyzed in the paper. revision: yes

Circularity Check

0 steps flagged

No circularity; bounds derived from explicit modeling assumptions and PAC-Bayes analysis

full rationale

The paper models User-System interaction as a bilevel cheap-talk game, introduces a decomposition of task inference from execution, and applies PAC-Bayes bounds to separate estimation error from structural limits induced by a capacity-limited language channel. The expressivity floor and objective-misalignment floor are presented as direct consequences of these modeling choices and the assumption that task complexity can exceed channel capacity. No equations reduce to self-definition, no parameters are fitted and relabeled as predictions, and no self-citations or imported uniqueness theorems are invoked as load-bearing steps. The negative result holds conditionally under the stated premises rather than by tautology. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that language is a capacity-limited channel and that task families possess measurable informational complexity that can exceed it; no free parameters or invented entities are stated in the abstract.

axioms (2)

domain assumption language acts as a capacity-limited communication channel
Invoked to establish the expressivity floor when task complexity exceeds channel capacity.
domain assumption alignment constraints restrict the admissible output set
Used to derive the objective-misalignment floor when the user-ideal distribution lies outside the feasible class.

pith-pipeline@v0.9.1-grok · 5819 in / 1383 out tokens · 25058 ms · 2026-06-26T09:11:48.666236+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 13 canonical work pages · 2 internal anchors

[1]

User-friendly introduction to pac-bayes bounds.arXiv preprint arXiv:2110.11216,

[Alq21] Pierre Alquier. User-friendly introduction to pac-bayes bounds.arXiv preprint arXiv:2110.11216,

work page arXiv
[2]

Language models as agent models.arXiv preprint arXiv:2212.01681,

[And22] Jacob Andreas. Language models as agent models.arXiv preprint arXiv:2212.01681,

work page arXiv
[3]

Role of chatgpt in computer programming.: Chatgpt in computer programming.Mesopotamian Journal of Computer Science, 2023:8–16,

[Bis23] Som Biswas. Role of chatgpt in computer programming.: Chatgpt in computer programming.Mesopotamian Journal of Computer Science, 2023:8–16,

2023
[4]

Training Verifiers to Solve Math Word Problems

[CKB+21] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Chessgpt: Bridging policy learning and language modeling.arXiv preprint arXiv:2306.09200,

[FLW+23] Xidong Feng, Yicheng Luo, Ziyan Wang, Hongrui Tang, Mengyue Yang, Kun Shao, David Mguni, Yali Du, and Jun Wang. Chessgpt: Bridging policy learning and language modeling.arXiv preprint arXiv:2306.09200,

work page arXiv
[6]

Mathematical capabilities of chatgpt.arXiv preprint arXiv:2301.13867,

[FPG+23] Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Christian Petersen, Alexis Chevalier, and Julius Berner. Mathematical capabilities of chatgpt.arXiv preprint arXiv:2301.13867,

work page arXiv
[7]

Challenges and applications of large language models.arXiv preprint arXiv:2307.10169,

[KHM+23] Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. Challenges and applications of large language models.arXiv preprint arXiv:2307.10169,

work page arXiv
[8]

Limitations of autoregressive models and their alternatives.arXiv preprint arXiv:2010.11939,

[LJL+20] Chu-Cheng Lin, Aaron Jaech, Xin Li, Matthew R Gormley, and Jason Eisner. Limitations of autoregressive models and their alternatives.arXiv preprint arXiv:2010.11939,

work page arXiv 2010
[9]

Large language models play starcraft ii: Benchmarks and a chain of summarization approach.arXiv preprint arXiv:2312.11865,

[MMY+23] Weiyu Ma, Qirui Mi, Xue Yan, Yuqiao Wu, Runji Lin, Haifeng Zhang, and Jun Wang. Large language models play starcraft ii: Benchmarks and a chain of summarization approach.arXiv preprint arXiv:2312.11865,

work page arXiv
[10]

Understanding the capabilities, limitations, and societal impact of large language models.arXiv preprint arXiv:2102.02503,

[TBCG21] Alex Tamkin, Miles Brundage, Jack Clark, and Deep Ganguli. Understanding the capabilities, limitations, and societal impact of large language models.arXiv preprint arXiv:2102.02503,

work page arXiv
[11]

Large language models in medicine.Nature medicine, 29(8):1930–1940,

[TTE+23] Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine.Nature medicine, 29(8):1930–1940,

1930
[12]

The learnability of in-context learning.arXiv preprint arXiv:2303.07895,

[WLS23] Noam Wies, Yoav Levine, and Amnon Shashua. The learnability of in-context learning.arXiv preprint arXiv:2303.07895,

work page arXiv
[13]

An Explanation of In-context Learning as Implicit Bayesian Inference

[XRLM21] Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference.arXiv preprint arXiv:2111.02080,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Ask more, know better: Reinforce- learned prompt questions for decision making with large language models

[YSC+23] Xue Yan, Yan Song, Xinyu Cui, Filippos Christianos, Haifeng Zhang, David Henry Mguni, and Jun Wang. Ask more, know better: Reinforce- learned prompt questions for decision making with large language models. arXiv preprint arXiv:2310.18127,

work page arXiv
[15]

Proagent: Building proactive cooperative ai with large language models.arXiv preprint arXiv:2308.11339,

[ZYH+23] Ceyao Zhang, Kaijie Yang, Siyi Hu, Zihao Wang, Guanghe Li, Yihang Sun, Cheng Zhang, Zhaowei Zhang, Anji Liu, Song-Chun Zhu, et al. Proagent: Building proactive cooperative ai with large language models.arXiv preprint arXiv:2308.11339,

work page arXiv

[1] [1]

User-friendly introduction to pac-bayes bounds.arXiv preprint arXiv:2110.11216,

[Alq21] Pierre Alquier. User-friendly introduction to pac-bayes bounds.arXiv preprint arXiv:2110.11216,

work page arXiv

[2] [2]

Language models as agent models.arXiv preprint arXiv:2212.01681,

[And22] Jacob Andreas. Language models as agent models.arXiv preprint arXiv:2212.01681,

work page arXiv

[3] [3]

Role of chatgpt in computer programming.: Chatgpt in computer programming.Mesopotamian Journal of Computer Science, 2023:8–16,

[Bis23] Som Biswas. Role of chatgpt in computer programming.: Chatgpt in computer programming.Mesopotamian Journal of Computer Science, 2023:8–16,

2023

[4] [4]

Training Verifiers to Solve Math Word Problems

[CKB+21] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Chessgpt: Bridging policy learning and language modeling.arXiv preprint arXiv:2306.09200,

[FLW+23] Xidong Feng, Yicheng Luo, Ziyan Wang, Hongrui Tang, Mengyue Yang, Kun Shao, David Mguni, Yali Du, and Jun Wang. Chessgpt: Bridging policy learning and language modeling.arXiv preprint arXiv:2306.09200,

work page arXiv

[6] [6]

Mathematical capabilities of chatgpt.arXiv preprint arXiv:2301.13867,

[FPG+23] Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Christian Petersen, Alexis Chevalier, and Julius Berner. Mathematical capabilities of chatgpt.arXiv preprint arXiv:2301.13867,

work page arXiv

[7] [7]

Challenges and applications of large language models.arXiv preprint arXiv:2307.10169,

[KHM+23] Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. Challenges and applications of large language models.arXiv preprint arXiv:2307.10169,

work page arXiv

[8] [8]

Limitations of autoregressive models and their alternatives.arXiv preprint arXiv:2010.11939,

[LJL+20] Chu-Cheng Lin, Aaron Jaech, Xin Li, Matthew R Gormley, and Jason Eisner. Limitations of autoregressive models and their alternatives.arXiv preprint arXiv:2010.11939,

work page arXiv 2010

[9] [9]

Large language models play starcraft ii: Benchmarks and a chain of summarization approach.arXiv preprint arXiv:2312.11865,

[MMY+23] Weiyu Ma, Qirui Mi, Xue Yan, Yuqiao Wu, Runji Lin, Haifeng Zhang, and Jun Wang. Large language models play starcraft ii: Benchmarks and a chain of summarization approach.arXiv preprint arXiv:2312.11865,

work page arXiv

[10] [10]

Understanding the capabilities, limitations, and societal impact of large language models.arXiv preprint arXiv:2102.02503,

[TBCG21] Alex Tamkin, Miles Brundage, Jack Clark, and Deep Ganguli. Understanding the capabilities, limitations, and societal impact of large language models.arXiv preprint arXiv:2102.02503,

work page arXiv

[11] [11]

Large language models in medicine.Nature medicine, 29(8):1930–1940,

[TTE+23] Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine.Nature medicine, 29(8):1930–1940,

1930

[12] [12]

The learnability of in-context learning.arXiv preprint arXiv:2303.07895,

[WLS23] Noam Wies, Yoav Levine, and Amnon Shashua. The learnability of in-context learning.arXiv preprint arXiv:2303.07895,

work page arXiv

[13] [13]

An Explanation of In-context Learning as Implicit Bayesian Inference

[XRLM21] Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference.arXiv preprint arXiv:2111.02080,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Ask more, know better: Reinforce- learned prompt questions for decision making with large language models

[YSC+23] Xue Yan, Yan Song, Xinyu Cui, Filippos Christianos, Haifeng Zhang, David Henry Mguni, and Jun Wang. Ask more, know better: Reinforce- learned prompt questions for decision making with large language models. arXiv preprint arXiv:2310.18127,

work page arXiv

[15] [15]

Proagent: Building proactive cooperative ai with large language models.arXiv preprint arXiv:2308.11339,

[ZYH+23] Ceyao Zhang, Kaijie Yang, Siyi Hu, Zihao Wang, Guanghe Li, Yihang Sun, Cheng Zhang, Zhaowei Zhang, Anji Liu, Song-Chun Zhu, et al. Proagent: Building proactive cooperative ai with large language models.arXiv preprint arXiv:2308.11339,

work page arXiv