A Token/KV-Cache Communication Media Selection and Resource Allocation Strategy for Multi-Agent Collaboration

Kun Yang; Lipeng Dai; Luping Xiang

arxiv: 2605.25422 · v1 · pith:MPTOQRLYnew · submitted 2026-05-25 · 📡 eess.SP · cs.AI· cs.IT· math.IT

A Token/KV-Cache Communication Media Selection and Resource Allocation Strategy for Multi-Agent Collaboration

Lipeng Dai , Luping Xiang , Kun Yang This is my paper

Pith reviewed 2026-06-29 20:57 UTC · model grok-4.3

classification 📡 eess.SP cs.AIcs.ITmath.IT

keywords multi-agent collaborationLLMKV cachetoken transmissionresource allocationend-to-end latencywireless networksmedia selection

0 comments

The pith

Joint media selection and resource allocation minimizes end-to-end latency for multi-agent LLM collaboration over wireless links by adapting between token and KV-cache transmission.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that neither token-based nor KV-cache-based transmission is uniformly optimal for multi-agent LLM collaboration in wireless settings. The better choice depends on available computational resources and channel conditions, which create an inherent end-to-end latency trade-off between inference and transmission costs. The authors formulate a joint optimization problem to minimize this latency and develop a low-complexity JMSRA algorithm that selects the media type and allocates bandwidth across heterogeneous links. A sympathetic reader would care because embodied agents in future networks require low-latency coordination to operate autonomously without being limited by fixed communication strategies. Numerical results show the adaptive scheme reduces latency relative to baselines that use only one media type.

Core claim

Neither token-based transmission nor key-value (KV) cache-based transmission is uniformly optimal across operating regimes, as performance depends critically on system parameters such as available computational resources and channel conditions. A joint optimization problem is formulated to minimize the end-to-end latency of multi-agent collaboration, and a low-complexity joint media selection and resource allocation (JMSRA) algorithm is developed that adaptively coordinates the interaction media and bandwidth allocation over heterogeneous links, achieving markedly reduced E2E latency relative to conventional NL-only and KV-cache-only baselines.

What carries the argument

The joint media selection and resource allocation (JMSRA) algorithm that selects between token and KV-cache media while allocating bandwidth to minimize end-to-end latency under varying compute and channel conditions.

If this is right

The optimal interaction medium varies with available computational resources.
The optimal interaction medium varies with channel conditions.
Adaptive media selection and bandwidth allocation over heterogeneous links reduces end-to-end latency compared with fixed-media baselines.
The approach enables efficient multi-agent collaboration in future wireless networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The adaptive strategy could extend to other latency-sensitive embodied agent tasks that mix symbolic and latent-space exchanges.
Designs for heterogeneous device networks might incorporate similar joint selection to handle varying compute capabilities.
Dynamic media switching during a single collaboration session could further reduce latency if channel or load conditions change rapidly.

Load-bearing premise

The analytical characterization of end-to-end latency accurately captures the different inference and transmission costs of token versus KV-cache media under practical wireless constraints.

What would settle it

A simulation or measurement across multiple computational resource levels and channel conditions in which the JMSRA algorithm produces equal or higher end-to-end latency than the fixed NL-only or KV-cache-only baselines.

read the original abstract

The convergence of large language models (LLMs) with 6G networks is fostering a paradigm of autonomous multi-agent cooperation, which in turn is expected to substantially increase east-west traffic. Although latent-space interaction mechanisms can enable more efficient collaboration than symbolic natural-language (NL) exchanges, prior work often abstracts away the associated communication overhead under practical wireless constraints. In embodied multi-agent settings, heterogeneous interaction media incur disparate inference and transmission costs, thereby inducing an inherent end-to-end (E2E) latency trade-off. To address this, we propose a joint design that integrates communication-media selection with wireless resource allocation. Through analytical characterization and simulation-based evaluation, we show that neither token-based transmission nor key-value (KV) cache-based transmission is uniformly optimal across operating regimes, as performance depends critically on system parameters such as available computational resources and channel conditions. Accordingly, we formulate a joint optimization problem aimed at minimizing the E2E latency of multi-agent collaboration and develop a low-complexity joint media selection and resource allocation (JMSRA) algorithm. Numerical results further confirm that, by adaptively coordinating the interaction media and bandwidth allocation over heterogeneous links, the proposed scheme achieves markedly reduced E2E latency relative to conventional NL-only and KV-cache-only baselines, enabling efficient and robust multi-agent collaboration in future wireless networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Paper shows neither token nor KV-cache transmission is always best for multi-agent LLM latency over wireless links and gives a joint selection-plus-allocation algorithm that beats fixed baselines in simulations.

read the letter

The central point is that token transmission and KV-cache transmission each have different inference and transmission costs, so neither wins in every regime; the right choice depends on compute availability and channel conditions. The authors formulate an end-to-end latency model that captures both, turn it into a joint optimization over media selection and bandwidth allocation, and supply the JMSRA algorithm to solve it at low complexity.

What the paper does cleanly is close the gap left by earlier work that treated communication overhead as negligible. The simulations confirm the adaptive scheme cuts latency relative to the two pure baselines, and the non-uniform optimality result follows directly from the parameter dependence they highlight.

The soft spots are the usual ones for this style of work. Everything rests on analytical latency expressions and Monte-Carlo runs; without the full derivations it is difficult to judge how faithfully the model reflects real LLM serving times or realistic wireless fading. The reported gains are described only qualitatively, so the practical size of the improvement is still unclear. The low-complexity claim for JMSRA also needs explicit runtime or iteration counts to be convincing.

This is aimed at researchers working at the wireless-LLM boundary for 6G multi-agent systems. A reader who cares about system-level latency trade-offs will find a usable technique here. The contribution is concrete enough and the modeling is honest enough that the paper should go to referees rather than be desk-rejected.

Referee Report

1 major / 0 minor

Summary. The paper claims that in embodied multi-agent LLM collaboration over wireless links, neither token-based nor KV-cache-based transmission is uniformly optimal, as the E2E latency trade-off depends on computational resources and channel conditions. It formulates a joint optimization problem to minimize E2E latency, develops a low-complexity JMSRA algorithm for media selection and resource allocation, and reports via analytical characterization and simulations that the adaptive scheme yields markedly lower latency than NL-only and KV-cache-only baselines.

Significance. If the latency model and simulation results hold under realistic wireless constraints, the work would be significant for 6G multi-agent systems by providing an adaptive strategy that exploits the disparate inference/transmission costs of the two media. The emphasis on a low-complexity algorithm and explicit comparison to fixed baselines is a practical strength.

major comments (1)

[Abstract] The manuscript as provided contains only the abstract; no equations, latency models, optimization formulation, or simulation parameters/figures are supplied. This prevents verification of the central analytical characterization of the E2E latency trade-off and of the numerical claim of 'markedly reduced' latency.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the comments. We address the major comment below.

read point-by-point responses

Referee: [Abstract] The manuscript as provided contains only the abstract; no equations, latency models, optimization formulation, or simulation parameters/figures are supplied. This prevents verification of the central analytical characterization of the E2E latency trade-off and of the numerical claim of 'markedly reduced' latency.

Authors: The full manuscript posted on arXiv:2605.25422 contains the complete E2E latency models for token-based and KV-cache-based transmission, the joint optimization formulation, the JMSRA algorithm derivation, analytical characterizations of the latency trade-offs under varying compute and channel conditions, and all simulation parameters with figures. It appears the review was performed on an abstract-only excerpt; we will ensure the complete document is supplied in the next submission round. revision: no

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The provided source contains only the abstract and a note that full text is available elsewhere, with no equations, derivations, or self-citations visible. The central claim is an optimization formulation (JMSRA algorithm) whose latency reduction is evaluated numerically against explicit baselines; this structure does not reduce any prediction to its inputs by construction. Absent load-bearing equations or self-citation chains in the examined text, the derivation chain cannot be shown to collapse internally.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract; the central claim rests on standard assumptions from wireless resource allocation and optimization theory (e.g., existence of heterogeneous media costs and tractable latency models), with no free parameters, axioms, or invented entities explicitly identified in the provided text.

pith-pipeline@v0.9.1-grok · 5773 in / 1159 out tokens · 30928 ms · 2026-06-29T20:57:56.835936+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 14 canonical work pages · 7 internal anchors

[1]

Edge artificial intelligence for 6G : Vision, enabling technologies, and applications

Letaief K B, Shi Y, Lu J, et al. Edge artificial intelligence for 6G : Vision, enabling technologies, and applications. IEEE Journal on Selected Areas in Communications, 2022, 40: 5--36

2022
[2]

Overview of AI and communication for 6G network: Fundamentals, challenges, and future research opportunities

Cui Q, You X, Wei N, et al. Overview of AI and communication for 6G network: Fundamentals, challenges, and future research opportunities. Science China Information Sciences, 2025, 68: 171301

2025
[3]

Embodied multi-agent systems: A review

Li Z, Wu W, Guo Y, et al. Embodied multi-agent systems: A review. IEEE/CAA Journal of Automatica Sinica, 2025, 12: 1095--1116

2025
[4]

Toward agentic AI networking in 6G : A generative foundation model-as-agent approach

Xiao Y, Shi G, Zhang P. Toward agentic AI networking in 6G : A generative foundation model-as-agent approach. IEEE Communications Magazine, 2025, 63: 68--74

2025
[5]

When intelligence overloads infrastructure: A forecast model for AI -driven bottlenecks

Refai-Ahmed G, Tatipamula M, Zhirnov V, et al. When intelligence overloads infrastructure: A forecast model for AI -driven bottlenecks. arXiv preprint arXiv:2511.07265, 2025

work page arXiv 2025
[6]

Enabling Agents to Communicate Entirely in Latent Space

Du Z, Wang R, Bai H, et al. Enabling agents to communicate entirely in latent space. arXiv e-prints, 2025, page arXiv:2511.09149

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Thought communication in multiagent collaboration

Zheng Y, Zhao Z, Li Z, et al. Thought communication in multiagent collaboration. arXiv e-prints, 2025, page arXiv:2510.20733

work page arXiv 2025
[8]

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant

Fu T, Min Z, Zhang H, et al. Cache-to-cache: Direct semantic communication between large language models. arXiv e-prints, 2025, page arXiv:2510.03215

work page arXiv 2025
[9]

Latent Collaboration in Multi-Agent Systems

Zou J, Yang X, Qiu R, et al. Latent collaboration in multi-agent systems. arXiv e-prints, 2025, page arXiv:2511.20639

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Q-KVComm : Efficient multi-agent communication via adaptive KV cache compression

Kriuk B, Ng L. Q-KVComm : Efficient multi-agent communication via adaptive KV cache compression. arXiv e-prints, 2025, page arXiv:2512.17914

work page arXiv 2025
[11]

Agent Primitives: Reusable Latent Building Blocks for Multi-Agent Systems

Jin H, Peng K, Yu Y, et al. Agent primitives: Reusable latent building blocks for multi-agent systems. arXiv preprint arXiv:2602.03695, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Tractatus logico-philosophicus

Wittgenstein L. Tractatus logico-philosophicus. Barcelona: Linkgua, 2023

2023
[13]

When AI meets sustainable 6G

You X, Huang Y, Zhang C, et al. When AI meets sustainable 6G . Science China Information Sciences, 2025, 68: 110301

2025
[14]

Delay and load fairness optimization with queuing model in multi- AAV assisted MEC : A deep reinforcement learning approach

Tang Q, Li B, Yang H H, et al. Delay and load fairness optimization with queuing model in multi- AAV assisted MEC : A deep reinforcement learning approach. IEEE Transactions on Network and Service Management, 2025, 22: 1247--1258

2025
[15]

Towards wireless native big AI model: the mission and approach differ from large language model

Chen Z, Zhang Z, Liu C, et al. Towards wireless native big AI model: the mission and approach differ from large language model. Science China Information Sciences, 2025, 68: 170303

2025
[16]

Beyond the cloud: Edge inference for generative large language models in wireless networks

Zhang X, Nie J, Huang Y, et al. Beyond the cloud: Edge inference for generative large language models in wireless networks. IEEE Transactions on Wireless Communications , 2025, 24: 643--658

2025
[17]

Efficient LLM inference over heterogeneous edge networks with speculative decoding

Zhu B, Chen Z, Zhao L, et al. Efficient LLM inference over heterogeneous edge networks with speculative decoding. arXiv e-prints, 2025, page arXiv:2510.11331

work page arXiv 2025
[18]

Distributed on-device LLM inference with over-the-air computation

Zhang K, He H, Song S, et al. Distributed on-device LLM inference with over-the-air computation. arXiv e-prints, 2025, page arXiv:2502.12559

work page arXiv 2025
[19]

Joint caching and inference for large language models in wireless networks

Zhu B, Chen Z, Zhao L, et al. Joint caching and inference for large language models in wireless networks. In: Proceedings of ICC 2025 - IEEE International Conference on Communications, 2025. 6285-6290

2025
[20]

AirNet : Neural network transmission over the air

Jankowski M, Gündüz D, Mikolajczyk K. AirNet : Neural network transmission over the air. IEEE Transactions on Wireless Communications, 2024, 23: 12126--12139

2024
[21]

FAS-LLM : Large language model-based channel prediction for OTFS -enabled satellite- FAS links

Yang H, Lambotharan S, Derakhshani M. FAS-LLM : Large language model-based channel prediction for OTFS -enabled satellite- FAS links. IEEE Journal on Selected Areas in Communications, 2026, 44: 2952--2963

2026
[22]

Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems

Yan B, Zhou Z, Zhang L, et al. Beyond self-talk: A communication-centric survey of LLM -based multi-agent systems. arXiv e-prints, 2025, page arXiv:2502.14321

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

A Survey of Large Language Models

Zhao W X, Zhou K, Li J, et al. A survey of large language models. arXiv e-prints, 2023, page arXiv:2303.18223

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Orca: A distributed serving system for Transformer-Based generative models

Yu G I, Jeong J S, Kim G W, et al. Orca: A distributed serving system for Transformer-Based generative models. In: Proceedings of 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), Carlsbad, CA: USENIX Association, 2022. 521--538

2022
[25]

F lex G en: High-throughput generative inference of large language models with a single GPU

Sheng Y, Zheng L, Yuan B, et al. F lex G en: High-throughput generative inference of large language models with a single GPU . In: Proceedings of Krause A, Brunskill E, Cho K, et al., editors, Proceedings of the 40th International Conference on Machine Learning. PMLR, 2023. 31094--31116

2023
[26]

Compute or load KV cache? W hy not both? In: Proceedings of Singh A, Fazel M, Hsu D, et al., editors, Proceedings of the 42nd International Conference on Machine Learning

Jin S, Liu X, Zhang Q, et al. Compute or load KV cache? W hy not both? In: Proceedings of Singh A, Fazel M, Hsu D, et al., editors, Proceedings of the 42nd International Conference on Machine Learning. PMLR, 2025. 28031--28043

2025
[27]

DualPath : Breaking the storage bandwidth bottleneck in agentic LLM inference

Wu Y, Chen S, Zhong Y, et al. DualPath : Breaking the storage bandwidth bottleneck in agentic LLM inference. arXiv e-prints, 2026, page arXiv:2602.21548

work page arXiv 2026
[28]

DeepSeek-V3 Technical Report

DeepSeek-AI , Liu A, Feng B, et al. DeepSeek-V3 technical report. arXiv e-prints, 2024, page arXiv:2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Drone networking in the 6G era: A technology overview

Mishra D, Vegni A M, Loscrí V, et al. Drone networking in the 6G era: A technology overview. IEEE Communications Standards Magazine, 2021, 5: 88--95

2021
[30]

QAQ : Quality adaptive quantization for LLM KV cache

Cheng W, Dong S, Qin J, et al. QAQ : Quality adaptive quantization for LLM KV cache. In: Proceedings of Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025. 2542-2550

2025
[31]

LLaMA: Open and Efficient Foundation Language Models

Touvron H, Lavril T, Izacard G, et al. LLaMA : Open and efficient foundation language models. arXiv e-prints, 2023, page arXiv:2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Reference title

Author A, Author B, Author C. Reference title. Journal, 2024, 38: 13--28

2024
[33]

Reference title

Author A, Author B, Author C, et al. Reference title. In: Proceedings of Conference, Place, 2024. 6--12

2024
[34]

write newline

" write newline "" before.all 'output.state := FUNCTION fin.entry duplicate empty 'pop 'write if newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop...

[1] [1]

Edge artificial intelligence for 6G : Vision, enabling technologies, and applications

Letaief K B, Shi Y, Lu J, et al. Edge artificial intelligence for 6G : Vision, enabling technologies, and applications. IEEE Journal on Selected Areas in Communications, 2022, 40: 5--36

2022

[2] [2]

Overview of AI and communication for 6G network: Fundamentals, challenges, and future research opportunities

Cui Q, You X, Wei N, et al. Overview of AI and communication for 6G network: Fundamentals, challenges, and future research opportunities. Science China Information Sciences, 2025, 68: 171301

2025

[3] [3]

Embodied multi-agent systems: A review

Li Z, Wu W, Guo Y, et al. Embodied multi-agent systems: A review. IEEE/CAA Journal of Automatica Sinica, 2025, 12: 1095--1116

2025

[4] [4]

Toward agentic AI networking in 6G : A generative foundation model-as-agent approach

Xiao Y, Shi G, Zhang P. Toward agentic AI networking in 6G : A generative foundation model-as-agent approach. IEEE Communications Magazine, 2025, 63: 68--74

2025

[5] [5]

When intelligence overloads infrastructure: A forecast model for AI -driven bottlenecks

Refai-Ahmed G, Tatipamula M, Zhirnov V, et al. When intelligence overloads infrastructure: A forecast model for AI -driven bottlenecks. arXiv preprint arXiv:2511.07265, 2025

work page arXiv 2025

[6] [6]

Enabling Agents to Communicate Entirely in Latent Space

Du Z, Wang R, Bai H, et al. Enabling agents to communicate entirely in latent space. arXiv e-prints, 2025, page arXiv:2511.09149

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Thought communication in multiagent collaboration

Zheng Y, Zhao Z, Li Z, et al. Thought communication in multiagent collaboration. arXiv e-prints, 2025, page arXiv:2510.20733

work page arXiv 2025

[8] [8]

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant

Fu T, Min Z, Zhang H, et al. Cache-to-cache: Direct semantic communication between large language models. arXiv e-prints, 2025, page arXiv:2510.03215

work page arXiv 2025

[9] [9]

Latent Collaboration in Multi-Agent Systems

Zou J, Yang X, Qiu R, et al. Latent collaboration in multi-agent systems. arXiv e-prints, 2025, page arXiv:2511.20639

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Q-KVComm : Efficient multi-agent communication via adaptive KV cache compression

Kriuk B, Ng L. Q-KVComm : Efficient multi-agent communication via adaptive KV cache compression. arXiv e-prints, 2025, page arXiv:2512.17914

work page arXiv 2025

[11] [11]

Agent Primitives: Reusable Latent Building Blocks for Multi-Agent Systems

Jin H, Peng K, Yu Y, et al. Agent primitives: Reusable latent building blocks for multi-agent systems. arXiv preprint arXiv:2602.03695, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Tractatus logico-philosophicus

Wittgenstein L. Tractatus logico-philosophicus. Barcelona: Linkgua, 2023

2023

[13] [13]

When AI meets sustainable 6G

You X, Huang Y, Zhang C, et al. When AI meets sustainable 6G . Science China Information Sciences, 2025, 68: 110301

2025

[14] [14]

Delay and load fairness optimization with queuing model in multi- AAV assisted MEC : A deep reinforcement learning approach

Tang Q, Li B, Yang H H, et al. Delay and load fairness optimization with queuing model in multi- AAV assisted MEC : A deep reinforcement learning approach. IEEE Transactions on Network and Service Management, 2025, 22: 1247--1258

2025

[15] [15]

Towards wireless native big AI model: the mission and approach differ from large language model

Chen Z, Zhang Z, Liu C, et al. Towards wireless native big AI model: the mission and approach differ from large language model. Science China Information Sciences, 2025, 68: 170303

2025

[16] [16]

Beyond the cloud: Edge inference for generative large language models in wireless networks

Zhang X, Nie J, Huang Y, et al. Beyond the cloud: Edge inference for generative large language models in wireless networks. IEEE Transactions on Wireless Communications , 2025, 24: 643--658

2025

[17] [17]

Efficient LLM inference over heterogeneous edge networks with speculative decoding

Zhu B, Chen Z, Zhao L, et al. Efficient LLM inference over heterogeneous edge networks with speculative decoding. arXiv e-prints, 2025, page arXiv:2510.11331

work page arXiv 2025

[18] [18]

Distributed on-device LLM inference with over-the-air computation

Zhang K, He H, Song S, et al. Distributed on-device LLM inference with over-the-air computation. arXiv e-prints, 2025, page arXiv:2502.12559

work page arXiv 2025

[19] [19]

Joint caching and inference for large language models in wireless networks

Zhu B, Chen Z, Zhao L, et al. Joint caching and inference for large language models in wireless networks. In: Proceedings of ICC 2025 - IEEE International Conference on Communications, 2025. 6285-6290

2025

[20] [20]

AirNet : Neural network transmission over the air

Jankowski M, Gündüz D, Mikolajczyk K. AirNet : Neural network transmission over the air. IEEE Transactions on Wireless Communications, 2024, 23: 12126--12139

2024

[21] [21]

FAS-LLM : Large language model-based channel prediction for OTFS -enabled satellite- FAS links

Yang H, Lambotharan S, Derakhshani M. FAS-LLM : Large language model-based channel prediction for OTFS -enabled satellite- FAS links. IEEE Journal on Selected Areas in Communications, 2026, 44: 2952--2963

2026

[22] [22]

Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems

Yan B, Zhou Z, Zhang L, et al. Beyond self-talk: A communication-centric survey of LLM -based multi-agent systems. arXiv e-prints, 2025, page arXiv:2502.14321

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

A Survey of Large Language Models

Zhao W X, Zhou K, Li J, et al. A survey of large language models. arXiv e-prints, 2023, page arXiv:2303.18223

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Orca: A distributed serving system for Transformer-Based generative models

Yu G I, Jeong J S, Kim G W, et al. Orca: A distributed serving system for Transformer-Based generative models. In: Proceedings of 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), Carlsbad, CA: USENIX Association, 2022. 521--538

2022

[25] [25]

F lex G en: High-throughput generative inference of large language models with a single GPU

Sheng Y, Zheng L, Yuan B, et al. F lex G en: High-throughput generative inference of large language models with a single GPU . In: Proceedings of Krause A, Brunskill E, Cho K, et al., editors, Proceedings of the 40th International Conference on Machine Learning. PMLR, 2023. 31094--31116

2023

[26] [26]

Compute or load KV cache? W hy not both? In: Proceedings of Singh A, Fazel M, Hsu D, et al., editors, Proceedings of the 42nd International Conference on Machine Learning

Jin S, Liu X, Zhang Q, et al. Compute or load KV cache? W hy not both? In: Proceedings of Singh A, Fazel M, Hsu D, et al., editors, Proceedings of the 42nd International Conference on Machine Learning. PMLR, 2025. 28031--28043

2025

[27] [27]

DualPath : Breaking the storage bandwidth bottleneck in agentic LLM inference

Wu Y, Chen S, Zhong Y, et al. DualPath : Breaking the storage bandwidth bottleneck in agentic LLM inference. arXiv e-prints, 2026, page arXiv:2602.21548

work page arXiv 2026

[28] [28]

DeepSeek-V3 Technical Report

DeepSeek-AI , Liu A, Feng B, et al. DeepSeek-V3 technical report. arXiv e-prints, 2024, page arXiv:2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Drone networking in the 6G era: A technology overview

Mishra D, Vegni A M, Loscrí V, et al. Drone networking in the 6G era: A technology overview. IEEE Communications Standards Magazine, 2021, 5: 88--95

2021

[30] [30]

QAQ : Quality adaptive quantization for LLM KV cache

Cheng W, Dong S, Qin J, et al. QAQ : Quality adaptive quantization for LLM KV cache. In: Proceedings of Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025. 2542-2550

2025

[31] [31]

LLaMA: Open and Efficient Foundation Language Models

Touvron H, Lavril T, Izacard G, et al. LLaMA : Open and efficient foundation language models. arXiv e-prints, 2023, page arXiv:2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Reference title

Author A, Author B, Author C. Reference title. Journal, 2024, 38: 13--28

2024

[33] [33]

Reference title

Author A, Author B, Author C, et al. Reference title. In: Proceedings of Conference, Place, 2024. 6--12

2024

[34] [34]

write newline

" write newline "" before.all 'output.state := FUNCTION fin.entry duplicate empty 'pop 'write if newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop...