arxiv: 2605.09365 · v1 · submitted 2026-05-10 · 💻 cs.AI · cs.CL

Recognition: no theorem link

Position: Avoid Overstretching LLMs for every Enterprise Task

Kuldeep Singh , Anson Bastos , Isaiah Onando Mulang'

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:24 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords enterprise AILLM limitationsmodular architecturesknowledge basessymbolic proceduresstructured extractiondeterministic workflowsmonolithic vs modular

0 comments

The pith

Language models should serve only as extraction interfaces in enterprise workflows, with knowledge and computation handled by dedicated knowledge bases and symbolic systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Enterprise workloads consist mainly of deterministic, structured tasks that demand high reliability, low latency, and broad knowledge under tight constraints. Relying on LLMs as complete monolithic solutions proves inefficient and unreliable because models have finite capacity and cannot encompass all required knowledge. The paper positions LLMs as tools for structured data extraction only, delegating storage and processing to external symbolic components. This modular design is shown to offer better reliability, scalability, and transparency. Theoretical arguments demonstrate inherent limits to what finite models can achieve in such settings.

Core claim

Finite-capacity models cannot fully capture the breadth of knowledge required for enterprise tasks, creating inherent limits to efficiency and interpretability. Therefore, language models should primarily be used for structured extraction in deterministic enterprise workflows, while computation and storage are delegated to knowledge bases and symbolic procedures, resulting in modular architectures that are more reliable and maintainable than monolithic frameworks.

What carries the argument

The modular architecture that treats language models as interfaces for structured extraction, externalizing knowledge to dedicated bases and computation to symbolic procedures.

Load-bearing premise

Enterprise workloads are dominated by deterministic, structured, knowledge-dependent tasks under strict cost, latency, and reliability constraints that finite models cannot handle.

What would settle it

Demonstrating a real enterprise workflow where a single fine-tuned LLM matches or exceeds the reliability, cost, and latency of a modular extraction-plus-knowledge-base system while handling equivalent knowledge breadth.

Figures

Figures reproduced from arXiv: 2605.09365 by Anson Bastos, Isaiah Onando Mulang', Kuldeep Singh.

read the original abstract

Enterprise workloads are dominated by deterministic, structured, and knowledge-dependent tasks operating under strict cost, latency, and reliability constraints. While these are often addressed through large language model (LLM) deployment or distillation into smaller models, we argue this is inefficient, unreliable, and misaligned with enterprise task structures. Instead, AI systems should treat language models as interfaces rather than monolithic engines, externalizing knowledge and computation into dedicated components for greater reliability, scalability, and transparency. Our theoretical evidences show that finite-capacity models cannot fully capture the breadth of knowledge required for enterprise tasks, creating inherent limits to efficiency and interpretability. Building on this, we take the position that language models should primarily be used for structured extraction in deterministic enterprise workflows, while computation and storage are delegated to knowledge bases and symbolic procedures. We formally demonstrate that such modular architectures are more reliable and maintainable than monolithic frameworks, offering a sustainable foundation for enterprise tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that enterprise workloads consist primarily of deterministic, structured, and knowledge-dependent tasks subject to strict cost, latency, and reliability constraints. It argues that deploying LLMs (or distilling them) for these tasks is inefficient and unreliable, and instead advocates treating LLMs solely as interfaces for structured extraction while delegating computation and storage to knowledge bases and symbolic procedures. The authors assert that finite-capacity models cannot capture enterprise knowledge breadth and claim to provide theoretical evidence and a formal demonstration that modular architectures are more reliable and maintainable than monolithic LLM frameworks.

Significance. If the position is substantiated with the promised evidence, it could meaningfully shape enterprise AI deployment practices by encouraging hybrid modular designs that prioritize reliability, transparency, and scalability over end-to-end LLM usage. The argument addresses a timely practical concern in applied AI and could stimulate discussion on architectural choices in constrained environments. However, the current manuscript supplies no supporting formal content, limiting its immediate contribution to the literature.

major comments (2)

[Abstract] Abstract: the manuscript asserts 'theoretical evidences' and a 'formal demonstration' that finite-capacity models cannot capture enterprise knowledge breadth and that modular architectures are more reliable, yet the text contains no equations, proofs, theorems, empirical data, or derivations to support these central claims. This is load-bearing because the position rests entirely on the unshown arguments rather than on general premises alone.
[Abstract] Abstract: the foundational premise that 'Enterprise workloads are dominated by deterministic, structured, and knowledge-dependent tasks' is stated without references, statistics, or case studies, which directly underpins the recommendation to restrict LLMs to extraction roles and is therefore load-bearing for the architectural position.

minor comments (1)

[Abstract] The phrasing 'theoretical evidences' is grammatically nonstandard and should be revised to 'theoretical evidence' or 'theoretical arguments'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the practical relevance of the position. We agree that the abstract's phrasing overpromises on formality and that the core premise requires better grounding. We will revise the manuscript to address these issues while preserving its nature as a position paper.

read point-by-point responses

Referee: [Abstract] Abstract: the manuscript asserts 'theoretical evidences' and a 'formal demonstration' that finite-capacity models cannot capture enterprise knowledge breadth and that modular architectures are more reliable, yet the text contains no equations, proofs, theorems, empirical data, or derivations to support these central claims. This is load-bearing because the position rests entirely on the unshown arguments rather than on general premises alone.

Authors: We accept this criticism. The manuscript is a position paper whose arguments rest on conceptual reasoning about model capacity limits and the mismatch between monolithic LLMs and structured enterprise tasks, rather than on new theorems or experiments. We will revise the abstract to replace 'theoretical evidences' and 'formal demonstration' with 'conceptual arguments' and 'reasoned analysis'. The main text will be expanded with additional elaboration on these points and citations to existing literature on neural network capacity and hybrid symbolic-neural systems. We do not believe formal proofs are necessary or appropriate for this format, but we will make the supporting logic more explicit. revision: partial
Referee: [Abstract] Abstract: the foundational premise that 'Enterprise workloads are dominated by deterministic, structured, and knowledge-dependent tasks' is stated without references, statistics, or case studies, which directly underpins the recommendation to restrict LLMs to extraction roles and is therefore load-bearing for the architectural position.

Authors: This observation is correct. The premise is based on patterns from enterprise deployments and industry practice, but the draft provides no supporting citations. In revision we will add references to relevant surveys, reports on robotic process automation adoption, and studies of knowledge-intensive workflows to substantiate the claim. We will also qualify the language if needed to reflect that the dominance holds for many, though not all, enterprise tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; position paper rests on explicit premises

full rationale

The manuscript is a position paper whose central recommendation (LLMs as extraction interfaces with externalized knowledge and symbolic components) is advanced from stated premises about deterministic enterprise tasks, strict constraints, and finite model capacity. The abstract's references to 'theoretical evidences' and 'formal demonstration' are argumentative summaries of the position rather than mathematical derivations, equations, or fitted quantities. No self-citations, ansatzes, uniqueness theorems, or renamings appear in the provided text that reduce any claim to its own inputs by construction. The argument is self-contained against external benchmarks of task structure and model limits, with no load-bearing internal reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only. The position rests on the premise that model capacity is fundamentally insufficient for enterprise knowledge breadth and that modular delegation is superior without further justification.

axioms (1)

domain assumption Finite-capacity models cannot fully capture the breadth of knowledge required for enterprise tasks
Invoked to explain inherent limits to efficiency and interpretability.

pith-pipeline@v0.9.0 · 5456 in / 1243 out tokens · 84502 ms · 2026-05-12T03:24:55.426912+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

[1]

Reducing hallucination in structured outputs via retrieval-augmented generation

Patrice Béchard and Orlando Marquez Ayala. Reducing hallucination in structured outputs via retrieval-augmented generation. InNAACL (Industry Track), 2024

work page 2024
[2]

Separations in the representational capabilities of transformers and recurrent architectures

Satwik Bhattamishra, Michael Hahn, Phil Blunsom, and Varun Kanade. Separations in the representational capabilities of transformers and recurrent architectures. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[3]

Enterprise ai adoption: Balancing innovation and roi in 2026

BizzDesign. Enterprise ai adoption: Balancing innovation and roi in 2026. BizzDesign blog,

work page 2026
[4]

Published Jan 27, 2026

work page 2026
[5]

Ai trends 2025: Adoption barriers and updated predictions

Deloitte. Ai trends 2025: Adoption barriers and updated predictions. Deloitte US AI Pulse Check Series, 2025

work page 2025
[6]

State of ai in the enterprise 2026

Deloitte. State of ai in the enterprise 2026. Deloitte Global, 2026. Survey of 3,235 senior leaders across 24 countries, conducted Aug–Sep 2025

work page 2026
[7]

Fixing it in post: A comparative study of llm post-training data quality and model performance

Aladin Djuhera, Swanand Ravindra Kadhe, Syed Zawad, Farhan Ahmed, Heiko Ludwig, and Holger Boche. Fixing it in post: A comparative study of llm post-training data quality and model performance. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

work page 2025
[8]

Ai adoption outpaces governance: Responsible ai pulse survey

Ernst & Young. Ai adoption outpaces governance: Responsible ai pulse survey. EY Global Survey, 2025

work page 2025
[9]

Gartner predicts over 40% of agentic ai projects will be canceled by end of 2027

Gartner. Gartner predicts over 40% of agentic ai projects will be canceled by end of 2027. Gartner Press Release, 2025

work page 2027
[10]

Hána and B

Š. Hána and B. Lameijer. Ai-based systems adoption in business operations: barriers and performance effects.Operations Management Research, 2025

work page 2025
[11]

Overcoming the organizational barriers to ai adoption

Harvard Business Review. Overcoming the organizational barriers to ai adoption. Harvard Business Review article, 2025. Published Nov 11, 2025

work page 2025
[12]

State of enterprise ai adoption report 2025

Information Services Group (ISG). State of enterprise ai adoption report 2025. ISG report,

work page 2025
[13]

Analysis of 1,200 generative, agentic, and traditional AI use cases

work page
[14]

Karakurt and A

E. Karakurt and A. Akbulut. Retrieval-augmented generation and large language models for enterprise knowledge management: A systematic literature review.Applied Sciences, 2025

work page 2025
[15]

Foundation models for tabular data within systemic contexts need grounding, 2025

Tassilo Klein and Johannes Hoffart. Foundation models for tabular data within systemic contexts need grounding, 2025

work page 2025
[16]

Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021. 17

work page 2021
[17]

End-to-end ontology learning with large language models.Advances in Neural Information Processing Systems, 37:87184–87225, 2024

Andy Lo, Albert Q Jiang, Wenda Li, and Mateja Jamnik. End-to-end ontology learning with large language models.Advances in Neural Information Processing Systems, 37:87184–87225, 2024

work page 2024
[18]

Canonical intermediate representation for llm-based optimization problem formulation and code generation.arXiv preprint arXiv:2602.02029, 2026

Zhongyuan Lyu, Shuoyu Hu, Lujie Liu, Hongxia Yang, and Ming LI. Canonical intermediate representation for llm-based optimization problem formulation and code generation.arXiv preprint arXiv:2602.02029, 2026

work page arXiv 2026
[19]

The state of ai: Global survey 2025, 2025

McKinsey & Company. The state of ai: Global survey 2025, 2025

work page 2025
[20]

2025: The state of ai in healthcare

Menlo Ventures. 2025: The state of ai in healthcare. Menlo Ventures report, 2025. Survey of 700+ healthcare executives, conducted Aug–Sep 2025

work page 2025
[21]

The state of enterprise ai

OpenAI. The state of enterprise ai. OpenAI report, 2025. Based on usage telemetry from 9,000 workers across nearly 100 enterprises

work page 2025
[22]

merging worlds

OpenText, Capgemini, and Sogeti. merging worlds. Industry survey report, OpenText Cor- poration, 2025. Survey of enterprise leaders on AI adoption in quality engineering practices. Reports that while nearly 90% of organizations pursue generative AI in quality engineering, only 15% have achieved enterprise-scale deployment. Top barriers include data privac...

work page 2025
[23]

The GenAI Divide: State of AI in Business 2025

Pepper Foster. The artificial intelligence (ai) roi report. Pepper Foster report, 2025. Cites MIT study “The GenAI Divide: State of AI in Business 2025”

work page 2025
[24]

Romeo and J

E. Romeo and J. Lacko. Adoption and integration of ai in organizations: a systematic review of challenges and drivers.Kybernetes, 2025

work page 2025
[25]

Ai adoption is soaring, but few companies are measuring its impact

S&P Global Sustainable1. Ai adoption is soaring, but few companies are measuring its impact. S&P Global Insights, 2025

work page 2025
[26]

What formal lan- guages can transformers express? a survey.Transactions of the Association for Computational Linguistics, 2024

Lena Strobl, William Merrill, Gail Weiss, David Chiang, and Dana Angluin. What formal lan- guages can transformers express? a survey.Transactions of the Association for Computational Linguistics, 2024

work page 2024
[27]

Claude ai agent deletes firm database in seconds.The Guardian, April 2026

The Guardian. Claude ai agent deletes firm database in seconds.The Guardian, April 2026. Accessed: 2026-05-02

work page 2026
[28]

Enterprise ai adoption and roi: Three-year executive study

Vistage and Wharton Human-AI Research / GBK Collective. Enterprise ai adoption and roi: Three-year executive study. Wharton/Vistage report, 2025. Survey of 800 US executives, June 26–July 11, 2025

work page 2025
[29]

The state of digital adoption 2025 (special ai edition)

WalkMe. The state of digital adoption 2025 (special ai edition). WalkMe Research Report, 2025

work page 2025
[30]

Lumina: Detecting hallucinations in rag system with context–knowledge signals

Samuel Yeh, Sharon Li, and Tanwi Mallick. Lumina: Detecting hallucinations in rag system with context–knowledge signals. InSocially Responsible and Trustworthy Foundation Models at NeurIPS 2025, 2025

work page 2025
[31]

textbook-quality

Yedi Zhang, Yufan Cai, Xinyue Zuo, Xiaokun Luan, Kailong Wang, Zhe Hou, Yifan Zhang, Zhiyuan Wei, Meng Sun, Jun Sun, et al. Position: Trustworthy ai agents require the integration of large language models and formal methods. InForty-second International Conference on Machine Learning Position Paper Track, 2025. 18 A Technical Appendix Enterprise applicati...

work page 2025
[32]

6SLM surveys document systematic OOD failures

Small models are strong on compression-friendly tasks (patterns with low entropy). 6SLM surveys document systematic OOD failures. 7Compositionality failures in multimodal VLM/SLM systems confirm this limit. 8RAG surveys highlight that retrieval is essential for SLM factuality. 9Transformer–RNN separations show that small transformers require large width f...

work page
[33]

Large models exhibit emergent behaviors: they cross the information threshold

work page
[34]

Retrieval-augmented systems outperform parametric SLM-only systems: retrieval expands the information channel

work page
[35]

Prompting cannot substitute for parametric deficiency: prompts only add input-information I(X), not model-informationI(W). Across domains, empirical findings confirm the information-theoretic predictions: SLMs have limited mutual information with high-complexity tasks, cannot generalize OOD without external information, and cannot simulate algorithmic str...

work page
[36]

The teacher compresses internal computationRinto an outputY

work page
[37]

If the task requires high information complexity (e.g., many latent states or steps), this double compression tends to privilege superficial heuristics over algorithmic fidelity

The student compresses the teacher’s behavior into fewer parameters. If the task requires high information complexity (e.g., many latent states or steps), this double compression tends to privilege superficial heuristics over algorithmic fidelity. D.5 Limitations of Distillation Without Rationale We now state the main limitations that follow from the abov...

work page