arxiv: 2605.01214 · v1 · submitted 2026-05-02 · 💻 cs.AI · cs.CY

Recognition: unknown

Agentic AI Systems Should Be Designed as Marginal Token Allocators

Siqi Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 15:08 UTC · model grok-4.3

classification 💻 cs.AI cs.CY

keywords agentic AImarginal token allocationAI system designeconomic framingfirst-order conditionsfailure modes

0 comments

The pith

Agentic AI systems should be designed and evaluated as marginal token allocation economies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper follows one request through four layers of an agentic system and shows that the router choosing models, the agent choosing actions, the serving stack producing tokens, and the training pipeline selecting traces are each solving the identical first-order condition of marginal benefit equaling marginal cost plus latency cost plus risk cost. This shared framing is offered as a minimal accounting object that replaces isolated local optimizations. A sympathetic reader would care because the approach directly accounts for why systems that minimize tokens at each step still produce over-routing, over-delegation, under-verification, serving congestion, stale rollouts, and cache misuse. The position paper therefore points to concrete evaluation and design changes rather than a full economic theory.

Core claim

Agentic AI systems should be designed and evaluated as marginal token allocation economies rather than as text generators priced by the unit. All four layers solve the same first-order condition—marginal benefit equals marginal cost plus latency cost plus risk cost—with different index sets and different prices. Adopting marginal token allocation as the shared accounting object explains recurring misallocations and defines a research agenda in token-aware evaluation, autonomy pricing, congestion-priced serving, and risk-adjusted RL budgeting.

What carries the argument

The first-order condition for marginal token allocation, in which benefit is balanced against cost, latency, and risk across layers that face different prices and index sets.

Load-bearing premise

The economic marginal-allocation analogy accurately captures the decision problems in each layer and adopting it as the shared accounting object will produce better designs and fewer misallocations without introducing new unmodeled complexities.

What would settle it

An experiment that redesigns the four layers around a shared marginal token allocation objective and measures whether the predicted failure modes decrease compared with current isolated designs.

read the original abstract

This position paper argues that agentic AI systems should be designed and evaluated as \emph{marginal token allocation economies} rather than as text generators priced by the unit. We follow a single request -- a developer asking a coding agent to fix a failing test -- through four economic layers that today are designed in isolation: a router that decides which model answers, an agent that decides whether to plan, act, verify, or defer, a serving stack that decides how to produce each token, and a training pipeline that decides whether the trace is worth learning from. We show that all four layers are solving the \emph{same} first-order condition -- marginal benefit equals marginal cost plus latency cost plus risk cost -- with different index sets and different prices. The framing is deliberately minimal: we do not propose a complete theory of AI economics. But adopting marginal token allocation as the shared accounting object explains why systems that locally minimize tokens globally misallocate them, predicts a small set of recurring failure modes (over-routing, over-delegation, under-verification, serving congestion, stale rollouts, cache misuse), and points to a concrete research agenda in token-aware evaluation, autonomy pricing, congestion-priced serving, and risk-adjusted RL budgeting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean conceptual unification of agentic layers around marginal token allocation but stays entirely at the level of analogy without equations or checks.

read the letter

The main takeaway is that this position paper reframes agentic systems as token economies where each layer—router, agent loop, serving stack, and training—faces the same marginal-benefit-equals-marginal-cost problem, just with different variables for latency and risk. It traces one request through the stack and uses that to explain why local token minimization often produces global waste, then lists a handful of recurring failure modes like over-routing and stale rollouts. That part is straightforward and could help engineers spot misalignments they already see in practice. The research agenda it sketches—token-aware eval, autonomy pricing, congestion-aware serving—is also concrete enough to be actionable for someone already building these systems. What is actually new is the single shared accounting object across all four layers; prior work has looked at pieces of this but not tied them together this way. The paper does not claim a theorem or new measurement, and it is honest about that scope. The soft spot is the lack of any supporting detail. No first-order conditions are written out for even one layer, there are no worked examples with numbers, and no data or simulation is offered to show the framing reduces misallocation in practice. The economic analogy is plausible on its face, but without that next step it is hard to judge whether treating everything as marginal token allocation will actually change design decisions or just add another descriptive layer. This is for people already working on multi-layer agent infrastructure who want a high-level organizing idea rather than a plug-and-play method. It is coherent on its own terms and engages the practical problems directly, so it deserves a serious referee who can ask for illustrations or initial experiments. I would send it to review.

Referee Report

2 major / 1 minor

Summary. This position paper proposes that agentic AI systems should be designed and evaluated as marginal token allocation economies rather than as text generators priced by the unit. It traces a single developer request (fixing a failing test) through four layers designed in isolation: a router selecting models, an agent choosing actions (plan/act/verify/defer), a serving stack producing tokens, and a training pipeline deciding on trace learning. The central claim is that all four layers solve the same first-order condition—marginal benefit equals marginal cost plus latency cost plus risk cost—with different index sets and prices. The framing is minimal, avoids a complete theory, and uses the lens to explain misallocations and outline a research agenda in token-aware evaluation, autonomy pricing, congestion-priced serving, and risk-adjusted RL budgeting.

Significance. If the analogy holds, the paper provides a coherent interpretive lens that unifies design decisions across layers and explains why local token minimization can produce global misallocations, while predicting specific recurring failure modes. It merits explicit credit for its deliberately minimal scope, the forward-looking identification of failure modes (over-routing, over-delegation, under-verification, serving congestion, stale rollouts, cache misuse), and the concrete research agenda without overclaiming derivations or data. As a position paper, its value is prospective and conceptual rather than demonstrated through equations or experiments.

major comments (2)

Abstract: The assertion that 'we show that all four layers are solving the same first-order condition' is presented without explicit equations, index sets, or derivations for the router, agent, serving stack, or training pipeline. This equivalence is load-bearing for the unification claim, the explanation of misallocations, and the predicted failure modes.
Abstract and implied layer sections: The paper states that the layers solve the marginal-benefit-equals-marginal-cost-plus-latency-plus-risk condition but supplies no schematic formalization or mapping of decision variables for any layer, leaving the shared accounting object as an asserted analogy rather than a demonstrated equivalence.

minor comments (1)

Abstract: The term 'priced by the unit' is used without clarifying what the unit refers to in the token-allocation context; a brief parenthetical would improve precision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the prospective value of the position paper's unifying lens. We address the two major comments below by proposing targeted revisions that clarify the core claim without altering the paper's deliberately minimal scope.

read point-by-point responses

Referee: Abstract: The assertion that 'we show that all four layers are solving the same first-order condition' is presented without explicit equations, index sets, or derivations for the router, agent, serving stack, or training pipeline. This equivalence is load-bearing for the unification claim, the explanation of misallocations, and the predicted failure modes.

Authors: We agree that the abstract's use of 'we show' overstates the current presentation, as the manuscript supplies no explicit equations or derivations. As a position paper, the intent is to offer a conceptual lens rather than a complete economic model. We will revise the abstract to replace 'we show' with 'we illustrate that' the layers align on the same first-order condition, and we will add a concise schematic table early in the main text. The table will map, for each layer, the decision variables, index sets, and relevant marginal prices/costs (benefit, latency, risk) without providing full derivations. This makes the shared structure explicit while preserving the paper's minimal framing. revision: yes
Referee: Abstract and implied layer sections: The paper states that the layers solve the marginal-benefit-equals-marginal-cost-plus-latency-plus-risk condition but supplies no schematic formalization or mapping of decision variables for any layer, leaving the shared accounting object as an asserted analogy rather than a demonstrated equivalence.

Authors: The lack of a schematic mapping is a fair critique that leaves the unification as an asserted parallel. We will introduce a short formalization subsection (or figure) that supplies a high-level mapping of decision variables for each layer to the marginal condition. For instance, the router's index set is over candidate models and token budgets; the agent's is over action types (plan/act/verify/defer) with associated latency and risk penalties; similar mappings will be sketched for the serving stack and training pipeline. This converts the analogy into an explicit, if schematic, equivalence without expanding into a full theory. revision: yes

Circularity Check

0 steps flagged

No significant circularity; position paper proposes interpretive analogy without formal derivation or fitted predictions

full rationale

The paper is explicitly a position paper offering marginal token allocation as a shared accounting lens rather than a derived mathematical identity. It asserts that the four layers solve the same first-order condition (marginal benefit equals marginal cost plus latency plus risk) but supplies no equations, index sets, derivations, or data. No predictions are generated from fitted parameters, no self-definitional loops exist, and no load-bearing self-citations or uniqueness theorems are invoked. The central claim functions as a proposed framing that explains misallocations and suggests research directions, remaining self-contained against external benchmarks with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that AI decision layers can be usefully modeled as marginal allocators; no free parameters or new physical entities are introduced, only a conceptual reframing.

axioms (1)

domain assumption The four layers of agentic AI systems solve equivalent marginal benefit-equals-cost conditions
Invoked when the paper states that router, agent, serving, and training decisions all optimize the same first-order condition.

invented entities (1)

marginal token allocation economy no independent evidence
purpose: A shared accounting framework for designing and evaluating agentic AI systems
Proposed as the recommended design object; no independent falsifiable evidence is supplied in the abstract.

pith-pipeline@v0.9.0 · 5507 in / 1427 out tokens · 45935 ms · 2026-05-09T15:08:47.332849+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 13 canonical work pages · 7 internal anchors

[1]

Taming throughput- latency tradeoff in llm inference with sarathi-serve,

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming throughput-latency tradeoff in llm inference with sarathi-serve, 2024. URLhttps://arxiv.org/abs/2403.02310

work page arXiv 2024
[2]

Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms.Annual Meeting of the Association for Computational Linguistics, 2024

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms.Annual Meeting of the Association for Computational Linguistics, 2024

2024
[3]

The market for “lemons”: Quality uncertainty and the market mechanism

George A Akerlof. The market for “lemons”: Quality uncertainty and the market mechanism. Quarterly Journal of Economics, 84(3):488–500, 1970

1970
[4]

Production, information costs, and economic organiza- tion.The American Economic Review, 62(5):777–795, 1972

Armen A Alchian and Harold Demsetz. Production, information costs, and economic organiza- tion.The American Economic Review, 62(5):777–795, 1972

1972
[5]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review arXiv 2022
[6]

Language models are few-shot learners.Advances in Neural Information Processing Systems, 2020

Tom B Brown et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, 2020

2020
[7]

Abdelfattah, Ziheng Jiang, and Xuehai Qian

Chi-Chih Chang, Siqi Zhu, Zhichen Zeng, Haibin Lin, Jiaxuan You, Mohamed S. Abdelfattah, Ziheng Jiang, and Xuehai Qian. Srt: Accelerating reinforcement learning via speculative rollout with tree-structured cache, 2026. URLhttps://arxiv.org/abs/2601.09083

work page arXiv 2026
[8]

Frugalgpt: How to use large language models while reducing cost and improving performance, 2023

Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance, 2023. URLhttps://arxiv.org/abs/2305. 05176

2023
[9]

The nature of the firm.Economica, 4(16):386–405, 1937

Ronald H Coase. The nature of the firm.Economica, 4(16):386–405, 1937

1937
[10]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Princeton University Press, 1994

Avinash K Dixit and Robert S Pindyck.Investment Under Uncertainty. Princeton University Press, 1994

1994
[13]

arXiv preprint arXiv:2408.15792 , year=

Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, and Hao Zhang. Efficient llm scheduling by learning to rank, 2024. URLhttps://arxiv.org/abs/2408.15792

work page arXiv 2024
[14]

Efficiently scal- ing llm reasoning with certaindex.arXiv preprint arXiv:2412.20993, 2024

Yichao Fu, Junda Chen, Siqi Zhu, Zheyu Fu, Zhongdongming Dai, Yonghao Zhuang, Yian Ma, Aurick Qiao, Tajana Rosing, Ion Stoica, and Hao Zhang. Efficiently scaling llm reasoning with certaindex, 2025. URLhttps://arxiv.org/abs/2412.20993

work page arXiv 2025
[15]

Training compute-optimal large language models.Advances in Neural Information Processing Systems, 2022

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. Training compute-optimal large language models.Advances in Neural Information Processing Systems, 2022

2022
[16]

Moral hazard and observability.The Bell Journal of Economics, pages 74–91, 1979

Bengt Holmström. Moral hazard and observability.The Bell Journal of Economics, pages 74–91, 1979

1979
[17]

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system. InarXiv preprint arXiv:2403.12031, 2024

work page arXiv 2024
[18]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. 10

work page internal anchor Pith review Pith/arXiv arXiv 2001
[19]

Houghton Mifflin, 1921

Frank H Knight.Risk, Uncertainty, and Profit. Houghton Mifflin, 1921

1921
[20]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2023

2023
[21]

The theory of incentives: The principal-agent model.Princeton University Press, 2002

Jean-Jacques Laffont and David Martimort. The theory of incentives: The principal-agent model.Princeton University Press, 2002

2002
[22]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, 2023

2023
[23]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 2020

2020
[24]

Let’s verify step by step.International Conference on Learning Representations, 2024

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.International Conference on Learning Representations, 2024

2024
[25]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, et al. Agentbench: Evaluating llms as agents. InInternational Conference on Learning Representations, 2024

2024
[26]

Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 2023

Aman Madaan et al. Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 2023

2023
[27]

Portfolio selection.The Journal of Finance, 7(1):77–91, 1952

Harry Markowitz. Portfolio selection.The Journal of Finance, 7(1):77–91, 1952

1952
[28]

Oxford University Press, 1995

Andreu Mas-Colell, Michael D Whinston, and Jerry R Green.Microeconomic Theory. Oxford University Press, 1995

1995
[29]

The optimal structure of incentives and authority within an organization.The Bell Journal of Economics, pages 105–131, 1976

James A Mirrlees. The optimal structure of incentives and authority within an organization.The Bell Journal of Economics, pages 105–131, 1976

1976
[30]

RouteLLM: Learning to Route LLMs with Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data, 2024. URLhttps://arxiv.org/abs/2406.18665

work page internal anchor Pith review arXiv 2024
[31]

OpenAI o1 System Card

OpenAI. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 2022

2022
[33]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InPro- ceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023

2023
[34]

Splitwise: Efficient generative llm inference using phase splitting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. In ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), 2024

2024
[35]

Macmillan, 1920

Arthur Cecil Pigou.The Economics of Welfare. Macmillan, 1920

1920
[36]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, et al. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, 2023

2023
[37]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Janvi Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, 2023. 11

2023
[38]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. InarXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[39]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, 2023

2023
[40]

A contribution to the theory of economic growth.The Quarterly Journal of Economics, 70(1):65–94, 1956

Robert M Solow. A contribution to the theory of economic growth.The Quarterly Journal of Economics, 70(1):65–94, 1956

1956
[41]

Job market signaling.Quarterly Journal of Economics, 87(3):355–374, 1973

Michael Spence. Job market signaling.Quarterly Journal of Economics, 87(3):355–374, 1973

1973
[42]

MIT Press, 1988

Jean Tirole.The Theory of Industrial Organization. MIT Press, 1988

1988
[43]

Congestion theory and transport investment.The American Economic Review, 59(2):251–260, 1969

William S Vickrey. Congestion theory and transport investment.The American Economic Review, 59(2):251–260, 1969

1969
[44]

V oyager: An open-ended embodied agent with large language models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. InTransactions on Machine Learning Research, 2024

2024
[45]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

2023
[46]

Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving. InUSENIX Symposium on Operating Systems Design and Implementation (OSDI), 2024

2024
[47]

Opentinker: Separating concerns in agentic reinforcement learning,

Siqi Zhu and Jiaxuan You. Opentinker: Separating concerns in agentic reinforcement learning,
[48]

OpenTinker Authors

URLhttps://arxiv.org/abs/2601.07376. 12 A Open Problems The framework leaves a focused set of open problems. (1)Estimation of ∆Qi from logsvia causal inference / off-policy evaluation [32], with calibrated variance. (2)Risk pricing: an empirical proxy for ρ∆Ri that incorporates the Knightian component of Section 2. (3)Mechanism-design routing: do incentiv...

work page arXiv