pith. sign in

arxiv: 2605.17410 · v1 · pith:3RH46JCInew · submitted 2026-05-17 · 💻 cs.AI

Computational Challenges in Token Economics: Bridging Economic Theory and AI System Design

Pith reviewed 2026-05-20 12:56 UTC · model grok-4.3

classification 💻 cs.AI
keywords token economicscomputational challengesAI system designLLM inferenceresource allocationeconomic theorytrilemmareal-time systems
0
0 comments X

The pith

Computational feasibility is the governing constraint in applying token economics to AI inference systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper argues that token economics in large language model systems is limited primarily by computational challenges rather than pure economic theory. It identifies fundamental tensions between achieving fine-grained valuation of tokens, maintaining low-latency execution, and ensuring optimal resource allocation under uncertainty. To address this, the authors introduce the concept of Computational Token Economics and the Token Economics Trilemma as a way to frame these inherent trade-offs. A sympathetic reader would care because these issues determine whether economic principles can practically guide the design of scalable AI infrastructure. The paper categorizes challenges into real-time value accounting, constrained resource allocation, and economic-aware system architecture, setting out an agenda for future research at this intersection.

Core claim

We argue that computational feasibility is not merely one dimension of token economics, but its governing constraint: these challenges are driven by fundamental tensions among fine-grained valuation, low-latency execution, and allocation optimality under uncertainty. We introduce the Token Economics Trilemma as a conditional no-free-lunch principle that captures the inherent trade-offs among granularity, real-time performance, and optimality.

What carries the argument

The Token Economics Trilemma, which structures the problem space by capturing trade-offs among granularity in valuation, real-time performance in execution, and optimality in allocation under uncertainty.

If this is right

  • Real-time value accounting systems must be developed to track token values at fine granularity without excessive overhead.
  • Constrained resource allocation algorithms are needed that optimize under uncertainty while respecting latency requirements.
  • AI system architectures should incorporate economic awareness to better manage token-based resource decisions.
  • The trilemma suggests that improving one aspect of the system will likely require compromises in the others.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future AI designs might need to prioritize certain trade-offs based on application needs, such as favoring speed in interactive systems.
  • This framework could extend to other AI components beyond tokens, like attention mechanisms or model parameters.
  • Empirical studies simulating the trilemma in actual inference setups could quantify the trade-off curves.
  • Integration with existing economic models from distributed computing might yield hybrid solutions.

Load-bearing premise

The tensions among fine-grained valuation, low-latency execution, and allocation optimality under uncertainty are fundamental and irreducible rather than solvable through future engineering advances.

What would settle it

A demonstration of an AI inference system that simultaneously achieves high-granularity token valuation, sub-millisecond latency for decisions, and provably optimal allocation despite uncertainty would falsify the trilemma.

Figures

Figures reproduced from arXiv: 2605.17410 by Ou Wu, Yingjun Deng.

Figure 1
Figure 1. Figure 1: The Token Economics Impossibility Triangle. Token-economic systems face a three-way [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Interface between the three crucial challenges. Challenge I (Sensing) produces a value [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Systematic directional biases of common value proxies relative to true economic value [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Regimes where fine-grained token economics is advantageous (green) versus regimes where [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Contrast between standard PagedAttention (left) and a Value-Aware extension (right). [PITH_FULL_IMAGE:figures/full_fig_p030_5.png] view at source ↗
read the original abstract

Token economics has emerged as a useful lens for understanding resource allocation, value creation, and pricing in large language model systems. While recent work has increasingly treated tokens as economic primitives, there remains a substantial gap between high-level economic theory and the computational realities of modern AI infrastructure. This paper identifies and analyzes the key computational challenges that arise when token-economic principles are implemented in real-time inference systems. We argue that computational feasibility is not merely one dimension of token economics, but its governing constraint: these challenges are driven by fundamental tensions among fine-grained valuation, low-latency execution, and allocation optimality under uncertainty. To structure this problem space, we introduce the notion of \textbf{Computational Token Economics} and propose the \textbf{Token Economics Trilemma} -- a conditional no-free-lunch principle that captures the inherent trade-offs among granularity, real-time performance, and optimality. We further categorize the main technical challenges into three areas: real-time value accounting, constrained resource allocation, and economic-aware system architecture. Rather than presenting a complete solution, this paper aims to define a research agenda for bridging token economics and AI system design, highlighting open problems at the intersection of computational economics, machine learning systems, and AI infrastructure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that computational feasibility is the governing constraint in token economics for AI systems, driven by fundamental tensions among fine-grained valuation, low-latency execution, and allocation optimality under uncertainty. It introduces 'Computational Token Economics' and the 'Token Economics Trilemma' as a conditional no-free-lunch principle to structure the problem space, categorizes challenges into real-time value accounting, constrained resource allocation, and economic-aware system architecture, and positions the work as defining a research agenda rather than providing complete solutions or empirical demonstrations.

Significance. If the trilemma framing and challenge categorization prove useful in guiding future interdisciplinary research, this could be a significant contribution by highlighting open problems at the intersection of computational economics, machine learning systems, and AI infrastructure. The paper correctly identifies an emerging area and avoids overclaiming by explicitly stating it presents no complete solution. However, the absence of formal derivations, proofs, data, or demonstrations that the identified trade-offs are irreducible rather than engineering-contingent limits its immediate technical impact; its primary value is agenda-setting.

major comments (2)
  1. [Abstract] Abstract: The central claim that computational feasibility is the 'governing constraint' because the challenges are 'driven by fundamental tensions' and the Token Economics Trilemma captures 'inherent trade-offs' among granularity, real-time performance, and optimality rests on an unformalized premise. The manuscript introduces the trilemma by definition as a 'conditional no-free-lunch principle' without a mathematical statement, proof of conditional necessity, or concrete counter-examples showing why advances in algorithms or system architectures cannot relax one or more constraints simultaneously. This is load-bearing for the governing-constraint argument.
  2. [Challenge categorization sections] The categorization of challenges (real-time value accounting, constrained resource allocation, economic-aware architecture) is presented as structuring the problem space, but without reference to specific existing LLM inference implementations, performance measurements, or attempted mitigations, it is difficult to assess whether the tensions are fundamental or contingent on current design choices.
minor comments (1)
  1. [Abstract] The abstract and introduction could more explicitly reference motivating examples from current LLM serving systems (e.g., specific token pricing or scheduling mechanisms) to ground the trilemma in practice.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We appreciate the recognition that the manuscript is positioned as an agenda-setting contribution rather than a complete technical solution. Below we respond to each major comment and indicate planned revisions to address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that computational feasibility is the 'governing constraint' because the challenges are 'driven by fundamental tensions' and the Token Economics Trilemma captures 'inherent trade-offs' among granularity, real-time performance, and optimality rests on an unformalized premise. The manuscript introduces the trilemma by definition as a 'conditional no-free-lunch principle' without a mathematical statement, proof of conditional necessity, or concrete counter-examples showing why advances in algorithms or system architectures cannot relax one or more constraints simultaneously. This is load-bearing for the governing-constraint argument.

    Authors: We agree that the Token Economics Trilemma is introduced as a conceptual organizing principle rather than a formally derived theorem with proofs or exhaustive counter-examples. The manuscript explicitly states its goal is to define a research agenda and highlight open problems at the intersection of fields, not to deliver complete formal resolutions. In revision we will expand the abstract and introduction to more explicitly discuss the conditional assumptions behind the trilemma, clarify that it is offered as a conditional no-free-lunch framing analogous to other conceptual trilemmas in systems research, and add brief illustrative examples from existing token-based inference pipelines to show how the three dimensions interact in practice. We will not add a full mathematical proof, as that would exceed the paper's stated scope, but the added discussion will better motivate why formalization is a valuable direction for future work. revision: partial

  2. Referee: [Challenge categorization sections] The categorization of challenges (real-time value accounting, constrained resource allocation, economic-aware architecture) is presented as structuring the problem space, but without reference to specific existing LLM inference implementations, performance measurements, or attempted mitigations, it is difficult to assess whether the tensions are fundamental or contingent on current design choices.

    Authors: We accept that referencing concrete LLM inference systems would help readers distinguish fundamental tensions from those tied to particular engineering choices. In the revised manuscript we will incorporate targeted references to representative implementations (for example, continuous batching techniques, speculative decoding, and memory-management strategies in popular serving frameworks) and briefly describe how their observed performance characteristics map onto the three challenge areas. These additions will be kept concise and will not shift the paper away from its agenda-setting purpose; they will serve only to ground the categorization in current practice. revision: yes

Circularity Check

0 steps flagged

No circularity: position paper introduces Token Economics Trilemma by definition as research agenda

full rationale

The manuscript is explicitly a position paper that defines Computational Token Economics and proposes the Token Economics Trilemma to structure open challenges. No equations, fitted parameters, predictions, or derivation chains appear in the provided text. The central framing is introduced directly as a conditional no-free-lunch principle capturing tensions among granularity, latency, and optimality; this is definitional rather than reduced from prior inputs, self-citations, or ansatzes. No self-citation load-bearing steps, uniqueness theorems imported from the authors, or renamings of known results are present. The paper positions itself as identifying a research agenda rather than deriving results that collapse to its own assumptions by construction, making the analysis self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper rests on the domain assumption that token usage in LLMs can be usefully modeled with economic primitives and that the listed tensions are structural rather than temporary engineering limits.

axioms (1)
  • domain assumption Token economics principles apply directly to resource allocation in real-time LLM inference systems
    Invoked when stating that token economics has emerged as a lens and when defining the trilemma as governing constraint.
invented entities (2)
  • Token Economics Trilemma no independent evidence
    purpose: To capture inherent trade-offs among granularity, real-time performance, and optimality
    Introduced as a conditional no-free-lunch principle without independent evidence or derivation.
  • Computational Token Economics no independent evidence
    purpose: To structure the problem space of implementing token-economic principles in AI infrastructure
    New label for the intersection of computational economics and AI systems.

pith-pipeline@v0.9.0 · 5739 in / 1392 out tokens · 41006 ms · 2026-05-20T12:56:52.836128+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We argue that computational feasibility is not merely one dimension of token economics, but its governing constraint: these challenges are driven by fundamental tensions among fine-grained valuation, low-latency execution, and allocation optimality under uncertainty. ... the Token Economics Trilemma — a conditional no-free-lunch principle that captures the inherent trade-offs among granularity, real-time performance, and optimality.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Proposition 3.1 (Token Economics Trilemma, Informal). ... no online policy can simultaneously achieve Granularity: G=N ... Real-time: R≤1 ... Optimality: O=o(1)

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 16 internal anchors

  1. [1]

    Token Economics for LLM Agents: A Dual-View Study from Computing and Economics

    Y. Chen, J. Chen, C. He, Y. Li, Y. Ji, Y. Wu, D. Yang, L. Diao, L. Shou, H. Zhang, H. Li, and G. Chen, “Token economics for LLM agents: A dual-view study from computing and economics,”arXiv preprint arXiv:2605.09104, 2026

  2. [2]

    Agentic AI Systems Should Be Designed as Marginal Token Allocators

    S. Zhu, “Agentic AI systems should be designed as marginal token allocators,”arXiv preprint arXiv:2605.01214, 2026

  3. [3]

    Token Is All You Price

    W. Zhong, “Token is all you price,”arXiv preprint arXiv:2510.09859, 2025

  4. [4]

    The economics of large language models: Token allocation, fine-tuning, and optimal pricing,

    D. Bergemann, A. Bonatti, and A. Smolin, “The economics of large language models: Token allocation, fine-tuning, and optimal pricing,” inProceedings of the 26th ACM Conference on Economics and Computation, 2025

  5. [5]

    AI token futures market: Commoditization of compute and derivatives contract design,

    Y. Xing, “AI token futures market: Commoditization of compute and derivatives contract design,”arXiv preprint arXiv:2603.21690, 2026

  6. [6]

    Shapley-Coop: Credit assignment for emergent cooperation in self-interested LLM agents,

    Y. Hua, H. Chen, S. Wang, W. Li, X. Wang, and J. Luo, “Shapley-Coop: Credit assignment for emergent cooperation in self-interested LLM agents,” inNeurIPS, 2025

  7. [7]

    TokenButler: Token Importance is Predictable

    Y. Akhauri, A. F. AbouElhamayed, Y. Gao, C.-C. Chang, N. Jain, and M. S. Abdelfattah, “TokenButler: Token importance is predictable,”arXiv preprint arXiv:2503.07518, 2025

  8. [8]

    TokenShapley: Token level context attribution with shapley value,

    Y. Xiao, Y. Zhu, S. Samyoun, W. Zhang, J. T. Wang, and J. Du, “TokenShapley: Token level context attribution with shapley value,” inFindings of ACL, 2025, pp. 3882–3894. 40

  9. [9]

    Utility-Aware Data Pricing: Token-Level Quality and Empirical Training Gain for LLMs

    M. Xu, Q. Luo, and K. Li, “Utility-aware data pricing: Token-level quality and empirical training gain for LLMs,”arXiv preprint arXiv:2604.22893, 2026

  10. [10]

    Is your LLM overcharging you? tokenization, transparency, and incentives,

    A. A. Velasco, S. Tsirtsis, N. Okati, and M. Gomez-Rodriguez, “Is your LLM overcharging you? tokenization, transparency, and incentives,” inICML 2025 Workshop on Tokenization (TokShop), 2025

  11. [11]

    CoIn: Counting the invisible reasoning tokens in commercial opaque LLM APIs,

    G. Sun, Z. Wang, B. Tian, M. Liu, Z. Shen, S. He, Y. He, W. Ye, Y. Wang, and A. Li, “CoIn: Counting the invisible reasoning tokens in commercial opaque LLM APIs,”arXiv preprint arXiv:2505.13778, 2025

  12. [12]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    M. Reidet al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024

  13. [13]

    The Llama 3 Herd of Models

    A. Dubeyet al., “The Llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024. [14]RULER: What’s the Real Context Size of Your Long-Context Language Models?, 2024

  14. [14]

    Leave no context behind: Efficient infinite context transformers with infini-attention.arXiv preprint arXiv:2404.07143, 2024

    T. Munkhdalai, M. Faruqui, and S. Gopal, “Leave no context behind: Efficient infinite context transformers with Infini-attention,”arXiv preprint arXiv:2404.07143, 2024

  15. [15]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI, “DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  16. [16]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Kimi Team, “Kimi k1.5: Scaling reinforcement learning with LLMs,”arXiv preprint arXiv:2501.12599, 2025

  17. [17]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    C. Snell, J. Lee, K. Xu, and A. Kumar, “Scaling llm test-time compute optimally can be more effective than scaling model parameters,”arXiv preprint arXiv:2408.03314, 2024

  18. [18]

    SARATHI- SERVE: Efficient LLM inference by piggybacking decodes with chunked prefills,

    A. Agrawal, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, and R. Ramjee, “SARATHI- SERVE: Efficient LLM inference by piggybacking decodes with chunked prefills,”arXiv preprint arXiv:2403.02310, 2024

  19. [19]

    Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving, 2024

    Y. Zhong, S. Liu, J. Chen, J. Hu, Y. Zhu, X. Liu, X. Jin, and H. Zhang, “DistServe: Dis- aggregating prefill and decoding for goodput-optimized large language model serving,”arXiv preprint arXiv:2401.09670, 2024

  20. [20]

    Kivi: A tuning-free asymmetric 2bit quantization for kv cache,

    Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V. Braverman, B. Chen, and X. Hu, “Kivi: A tuning-free asymmetric 2bit quantization for kv cache,” inICML, 2024

  21. [21]

    SnapKV: LLM Knows What You are Looking for Before Generation

    Y. Li, Y. Huang, B. Yang, B. Venkitesh, A. Locatelli, H. Ye, T. Cai, P. Lewis, and D. Chen, “SnapKV: LLM knows what you are looking for before generation,”arXiv preprint arXiv:2404.14469, 2024

  22. [22]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    DeepSeek-AI, “DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model,”arXiv preprint arXiv:2405.04434, 2024

  23. [23]

    Arbitrage: Efficient reasoning via advantage-aware speculation,

    M. Maheswaran, R. Tiwari, Y. Hu, K. Dilmen, C. Hooper, H. Xi, N. Lee, M. Farajtabar, M. W. Mahoney, K. Keutzer, and A. Gholami, “Arbitrage: Efficient reasoning via advantage-aware speculation,”arXiv preprint arXiv:2512.05033, 2025. 41

  24. [24]

    TopLoc: A locality sensitive hashing scheme for trustless verifiable inference,

    J. M. Ong, M. D. Ferrante, A. Pazdera, R. Garner, S. Jaghouar, M. Basra, M. Ryabinin, and J. Hagemann, “TopLoc: A locality sensitive hashing scheme for trustless verifiable inference,” arXiv preprint arXiv:2501.16007, 2025

  25. [25]

    Inference economics: A new paradigm for the economics of artificial intelligence,

    BRASS DIGITAL LAB, “Inference economics: A new paradigm for the economics of artificial intelligence,” BRASS DIGITAL LAB, Tech. Rep., 2026

  26. [26]

    Efficient memory management for large language model serving with pagedattention,

    W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Sto- ica, “Efficient memory management for large language model serving with pagedattention,” in Proceedings of the 29th Symposium on Operating Systems Principles, 2023

  27. [27]

    Sglang: Efficient execution of structured language model programs,

    L. Zheng, L. Yin, Z. Xie, J. Huang, C. Sun, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Y. Sheng, “Sglang: Efficient execution of structured language model programs,” inNeurIPS, 2024, pp. 62 557–62 583

  28. [28]

    Flashattention-2: Faster attention with better parallelism and work partitioning,

    T. Dao, “Flashattention-2: Faster attention with better parallelism and work partitioning,” in Proceedings of the ICLR, ser. ICLR, 2024

  29. [29]

    FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

    J. Shah, G. Bikshandi, Y. Zhang, V. Thakkar, P. Ramani, and T. Dao, “Flashattention-3: Fast and accurate attention with asynchrony and low-precision,”arXiv preprint arXiv:2407.08608, 2024

  30. [30]

    H 2o: Heavy-hitter oracle for efficient generative inference of large language models,

    Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y. Tian, C. R´ e, C. Barrett, Z. Wang, and B. Chen, “H 2o: Heavy-hitter oracle for efficient generative inference of large language models,” inNeurIPS, 2023

  31. [31]

    Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time,

    Z. Liu, A. Desai, F. Lian, H. Wang, H. Xie, Y. Zhang, T. Chen, and Z. Wang, “Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time,” inNeurIPS, 2023

  32. [32]

    Efficient streaming language models with attention sinks,

    G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,” inICLR, 2024

  33. [33]

    PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    Z. Cai, Y. Zhang, B. Gao, Y. Liu, T. Li, K. Liu, H. Lin, X. Lu, and S. Han, “Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling,”arXiv preprint arXiv:2406.02069, 2024

  34. [34]

    Fast inference from transformers via speculative decoding,

    Y. Leviathan, M. Kalman, and Y. Matias, “Fast inference from transformers via speculative decoding,” inICML, 2023

  35. [35]

    Specinfer: Accelerating generative large language model serving with tree-based speculative inference and verification,

    X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang, Z. Wong, Z. Chen, D. Arfeen, R. Abhyankar, and Z. Jia, “Specinfer: Accelerating generative large language model serving with tree-based speculative inference and verification,” inASPLOS, 2024, pp. 932–949

  36. [36]

    Medusa: Simple llm inference acceleration framework with multiple decoding heads,

    T. Cai, Y. Li, Z. Geng, H. Peng, J. D. Lee, D. Chen, and T. Dao, “Medusa: Simple llm inference acceleration framework with multiple decoding heads,” inICML, 2024

  37. [37]

    Eagle: Speculative sampling requires rethinking feature uncertainty,

    Y. Li, F. Wei, C. Zhang, and H. Zhang, “Eagle: Speculative sampling requires rethinking feature uncertainty,” inICML, 2024

  38. [38]

    Lookahead decoding: Lossless generation accelera- tion for large language models,

    Y. Fu, P. Bailis, I. Stoica, and H. Zhang, “Lookahead decoding: Lossless generation accelera- tion for large language models,” inICML, 2024. 42

  39. [39]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    L. Chen, M. Zaharia, and J. Zou, “Frugalgpt: How to use large language models while reducing cost and improving performance,”arXiv preprint arXiv:2305.05176, 2023

  40. [40]

    RouteLLM: Learning to Route LLMs with Preference Data

    I. Ong, A. Almahairi, V. Wu, W.-L. Chiang, T. Wu, J. E. Gonzalez, and I. Stoica, “Routellm: Learning to route llms with preference data,”arXiv preprint arXiv:2406.18665, 2024

  41. [41]

    Lost in the middle: How language models use long contexts,

    N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,”TACL, vol. 12, pp. 157–173, 2024

  42. [42]

    Longbench: A bilingual, multitask benchmark for long context understanding,

    Y. Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, Y. Dong, J. Tang, and J. Li, “Longbench: A bilingual, multitask benchmark for long context understanding,” inACL, 2024

  43. [43]

    Self-consistency improves chain of thought reasoning in language models,

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” inICLR, 2023

  44. [44]

    Tree of thoughts: Deliberate problem solving with large language models,

    S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” inNeurIPS, 2023

  45. [45]

    Toolformer: Language models can teach themselves to use tools,

    T. Schick, J. Dwivedi-Yu, R. Dess` ı, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” inNeurIPS, 2023

  46. [46]

    Voyager: An open-ended embodied agent with large language models,

    G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar, “Voyager: An open-ended embodied agent with large language models,”TMLR, 2024

  47. [47]

    Swe-agent: Agent-computer interfaces enable automated software engineering,

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer interfaces enable automated software engineering,” inNeurIPS, 2024

  48. [48]

    Orca: A distributed serving system for transformer-based generative models,

    G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for transformer-based generative models,” inOSDI, 2022, pp. 521–538

  49. [49]

    Borodin and R

    A. Borodin and R. El-Yaniv,Online Computation and Competitive Analysis. Cambridge, UK: Cambridge University Press, 1998

  50. [50]

    Introduction to online convex optimization,

    E. Hazan, “Introduction to online convex optimization,”Foundations and Trends in Optimiza- tion, vol. 2, no. 3–4, pp. 157–325, 2016

  51. [51]

    Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services,

    S. Gilbert and N. Lynch, “Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services,”ACM SIGACT News, vol. 33, no. 2, pp. 51–59, 2002

  52. [52]

    Efficient mechanisms for bilateral trading,

    R. B. Myerson and M. A. Satterthwaite, “Efficient mechanisms for bilateral trading,”Journal of Economic Theory, vol. 29, no. 2, pp. 265–281, 1983

  53. [53]

    A value for n-person games,

    L. S. Shapley, “A value for n-person games,” inContributions to the Theory of Games II, H. W. Kuhn and A. W. Tucker, Eds. Princeton, NJ, USA: Princeton University Press, 1953

  54. [54]

    Flashattention: Fast and memory-efficient exact attention with io-awareness,

    T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. R´ e, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” inNeurIPS, 2022, pp. 16 344–16 359

  55. [55]

    Shadow in the cache: Unveiling and mitigating privacy risks of KV-cache in LLM inference,

    Z. Luo, S. Shao, S. Zhang, L. Zhou, Y. Hu, C. Zhao, Z. Liu, and Z. Qin, “Shadow in the cache: Unveiling and mitigating privacy risks of KV-cache in LLM inference,” inProceedings of the 33rd Network and Distributed System Security Symposium (NDSS), 2026. 43