PPAI: Enabling Personalized LLM Agent Interoperability for Collaborative Edge Intelligence

Haodong Wang; Jian Lin; Kaibin Guo; Qianli Liu; Song Guo; Zicong Hong; Zile Wang

arxiv: 2605.18067 · v1 · pith:NLUSXB66new · submitted 2026-05-18 · 💻 cs.CL

PPAI: Enabling Personalized LLM Agent Interoperability for Collaborative Edge Intelligence

Zile Wang , Qianli Liu , Kaibin Guo , Haodong Wang , Jian Lin , Zicong Hong , Song Guo This is my paper

Pith reviewed 2026-05-20 11:15 UTC · model grok-4.3

classification 💻 cs.CL

keywords personalized LLM agentsedge intelligencepeer-to-peer collaborationagent interoperabilityquery-agent matchingBayesian gameload balancingP2P network

0 comments

The pith

PPAI enables personalized LLM agents on edge devices to collaborate peer-to-peer by routing tasks to specialized remote agents, improving accuracy up to 7.96% and reducing latency by 16.34%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PPAI as the first system allowing users with personalized LLM agents on edge devices to collaborate in a peer-to-peer network. Each user can delegate tasks to remote agents better suited for the query based on specialization rather than handling everything locally. It solves matching in a changing agent pool with a prototype-based scoring mechanism and handles rapid load shifts with a Bayesian game for local-global balance. A sympathetic reader would care because this expands the effective capabilities of limited edge hardware by sharing agent strengths across users without central servers. If correct, individual devices could complete a wider set of accurate tasks with lower delays by tapping into the diversity of nearby agents.

Core claim

PPAI is the first personalized LLM agent interoperability system which enables users to collaborate with each other based on agent specialization. It proposes a scalable prototype-based query-agent pair scoring mechanism to identify suitable agents within a P2P network with churn and a multi-agent interoperability Bayesian game to balance local demand and global efficiency when changes in remote agent load occur too quickly to be observed. A prototype implementation demonstrates that the system substantially broadens the range of tasks that could be carried out while maintaining load balance, achieving an average accuracy improvement of up to 7.96% across multiple tasks while reducinglatency

What carries the argument

The prototype-based query-agent pair scoring mechanism for matching in dynamic P2P networks combined with the multi-agent interoperability Bayesian game for load balancing under rapid unobserved changes.

If this is right

Tasks exceeding local agent expertise can be delegated to remote agents with better specialization for that query.
The matching process continues to function as agents join or leave the network.
Local device demand stays balanced against overall network efficiency even when full remote load data is unavailable.
A wider range of tasks becomes feasible on edge hardware while preserving system stability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The prototype scoring idea could extend to matching problems in other volatile distributed systems beyond LLM agents.
Game-theoretic balancing may prove useful in additional P2P settings where observation lags behind change rates.
Widespread use might create selective sharing networks among personal AI agents without requiring shared training data.

Load-bearing premise

A prototype-based scoring method can reliably match queries to agents in a network where agents frequently appear and disappear, and the Bayesian game can keep local and global loads balanced when remote conditions change faster than direct observation allows.

What would settle it

Deploy the prototype in a simulated P2P network with high agent churn and load fluctuations that occur faster than measurement intervals, then measure whether accuracy gains stay above 5% or latency reductions hold relative to a non-collaborative baseline.

Figures

Figures reproduced from arXiv: 2605.18067 by Haodong Wang, Jian Lin, Kaibin Guo, Qianli Liu, Song Guo, Zicong Hong, Zile Wang.

**Figure 4.** Figure 4: Accuracy degradation across some tasks when selecting [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 3.** Figure 3: Task counts where each agent ranks as the top-1, top-2, [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 5.** Figure 5: , each agent consists of system prompt, tool interfaces and specialized database to achieve personalized capability. When a user issues a query, it can be served either by the user’s local agent or by another agent in the network that is better suited for the task. To support such flexible and effective collaboration, our system routes each query to the most suitable agent across the network. Building on t… view at source ↗

**Figure 6.** Figure 6: Overview of our prototype-anchored framework for scalable query–agent scoring and matching. (a) Queries and agents [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of candidate models and our method’s [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

Deploying large language model (LLM) on edge device enables personalized LLM agents for various users. The growing availability of diverse personalized agents presents a unique opportunity for peer-to-peer (P2P) collaboration, wherein each user can delegate tasks beyond the local agent's expertise to remote agents more suited for the specific query. This paper introduces PPAI, the first personalized LLM agent interoperability system, which enables users to collaborate with each other based on agent specialization. However, the ever-changing pool of agents and their interchangeable capacity introduce new challenges when it comes to matching queries to agents and balancing loads, compared with existing P2P systems. Therefore, we propose a scalable query-agent pair scoring mechanism based on prototypes to identify suitable agents within a P2P network with churn. Moreover, we propose a multi-agent interoperability Bayesian game to balance local demand and global efficiency, when changes in remote agent load occur too quickly to be observed. Finally, we implement a prototype of PPAI and demonstrate that it substantially broadens the range of tasks that could be carried out while maintaining load balance. On average, it achieves an accuracy improvement of up to 7.96% across multiple tasks, while reducing latency by 16.34% compared to the baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PPAI applies prototype scoring and a Bayesian game to P2P collaboration among personalized edge LLM agents, but the reported gains rest on unverified assumptions about churn and fast load shifts.

read the letter

PPAI introduces a system for letting personalized LLM agents on edge devices delegate tasks to each other in a peer-to-peer network. It uses a prototype-based scoring method to match queries to suitable remote agents despite churn, plus a multi-agent Bayesian game to balance local demand against global efficiency when remote loads change faster than they can be observed. The prototype implementation is said to expand the range of doable tasks while keeping load balanced, with average gains of 7.96% accuracy and 16.34% lower latency versus baseline.

Referee Report

3 major / 2 minor

Summary. The paper introduces PPAI, presented as the first personalized LLM agent interoperability system for P2P collaboration on edge devices. It proposes a scalable prototype-based query-agent pair scoring mechanism to handle agent churn and a multi-agent interoperability Bayesian game to balance local demand against global efficiency when remote loads change faster than they can be observed. The authors report that a prototype implementation broadens task coverage while maintaining load balance, achieving up to 7.96% average accuracy improvement and 16.34% latency reduction versus baseline across multiple tasks.

Significance. If the scoring and game mechanisms prove stable under realistic churn and sub-observation load shifts, the work could meaningfully advance collaborative edge intelligence by allowing users to delegate tasks to specialized remote agents. The approach addresses a timely gap between personalized edge LLMs and dynamic P2P networks, but its significance is currently limited by the absence of detailed validation for the two load-bearing mechanisms.

major comments (3)

[Abstract] Abstract: The headline claims of +7.96% accuracy and -16.34% latency rest on the prototype-based scoring and Bayesian game, yet the abstract supplies no experimental setup, baselines, datasets, error bars, or implementation details. Without these, it is impossible to determine whether the reported gains survive the churn and rapid-load regimes identified as the core challenge.
[Bayesian game section] Section describing the Bayesian game formulation: The claim that the game equilibrates local demand and global efficiency when remote loads change too quickly to observe requires a concrete reduction to observable quantities or a stability argument; the current description does not show how the equilibrium remains well-defined or incentive-compatible under the exact conditions the paper flags as problematic.
[Prototype scoring section] Section on prototype-based scoring: The assertion that the mechanism reliably ranks remote agents despite churn lacks any analysis or experiment quantifying ranking accuracy as a function of churn rate or prototype update frequency; if ranking fails at realistic churn levels, the interoperability benefit reduces to local execution and the central performance claims do not hold.

minor comments (2)

[Related work] The manuscript should include a dedicated related-work subsection that explicitly positions PPAI against prior P2P agent or edge-LLM systems rather than asserting novelty in the abstract alone.
[Notation] Notation for the scoring function and game payoffs should be introduced once and used consistently; several terms appear to be defined only locally.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where additional clarity and validation would strengthen the manuscript. We address each major comment point by point below and commit to revisions that directly respond to the concerns about experimental context and mechanism robustness under churn and rapid load shifts.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claims of +7.96% accuracy and -16.34% latency rest on the prototype-based scoring and Bayesian game, yet the abstract supplies no experimental setup, baselines, datasets, error bars, or implementation details. Without these, it is impossible to determine whether the reported gains survive the churn and rapid-load regimes identified as the core challenge.

Authors: We agree that the abstract would be more informative with a concise description of the experimental context. In the revised version we will expand the abstract to note the prototype implementation on edge devices, the multi-task evaluation (including specific datasets and task types), the baselines used for comparison, and that reported figures are averages with observed variance across runs. This will allow readers to better assess the relevance of the gains to the churn and load-shift scenarios emphasized in the paper. revision: yes
Referee: [Bayesian game section] Section describing the Bayesian game formulation: The claim that the game equilibrates local demand and global efficiency when remote loads change too quickly to observe requires a concrete reduction to observable quantities or a stability argument; the current description does not show how the equilibrium remains well-defined or incentive-compatible under the exact conditions the paper flags as problematic.

Authors: We accept that an explicit stability argument is needed. We will revise the Bayesian game section to include a formal reduction showing how the equilibrium is computed from local observations and a prior over unobserved load states, together with a proof sketch that the resulting strategy profile remains incentive-compatible and well-defined even when remote loads vary faster than direct observation. This addition will directly address the conditions highlighted as central to the problem. revision: yes
Referee: [Prototype scoring section] Section on prototype-based scoring: The assertion that the mechanism reliably ranks remote agents despite churn lacks any analysis or experiment quantifying ranking accuracy as a function of churn rate or prototype update frequency; if ranking fails at realistic churn levels, the interoperability benefit reduces to local execution and the central performance claims do not hold.

Authors: The referee correctly notes the absence of a dedicated churn-sensitivity analysis. We will add a new subsection (and supporting appendix) that reports ranking accuracy of the prototype-based scorer as a function of churn rate and prototype refresh interval, using both simulation and prototype measurements. The added results will quantify the operating regime in which ranking remains reliable and confirm that the reported accuracy and latency gains are achieved within realistic churn levels. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or results

full rationale

The paper proposes two new mechanisms (prototype-based query-agent scoring for churny P2P networks and a multi-agent Bayesian game for unobservable load changes) and then reports empirical gains from a prototype implementation. No equations, fitted parameters, or self-citations are shown reducing the accuracy or latency claims to the inputs by construction. The derivation chain consists of algorithmic proposals followed by external validation on tasks, which remains independent of the target performance numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract alone does not supply enough technical detail to enumerate specific free parameters, axioms, or invented entities. The work introduces new mechanisms for scoring and game-based balancing but their internal structure and assumptions remain unspecified.

pith-pipeline@v0.9.0 · 5767 in / 1233 out tokens · 55567 ms · 2026-05-20T11:15:08.482394+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

prototype-anchored query-agent pair scoring … KL divergence … cosine similarity … Bayesian game … Cost of Delegation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 6 internal anchors

[1]

Demystifying small language models for edge deployment,

Z. Lu, X. Li, D. Cai, R. Yi, F. Liu, W. Liu, J. Luan, X. Zhang, N. D. Lane, and M. Xu, “Demystifying small language models for edge deployment,” inACL, 2025

work page 2025
[2]

Small Language Models are the Future of Agentic AI

P. Belcak, G. Heinrich, S. Diao, Y . Fu, X. Dong, S. Muralidharan, Y . C. Lin, and P. Molchanov, “Small language models are the future of agentic ai,”arXiv preprint arXiv:2506.02153, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

D 2moe: Dual routing and dynamic scheduling for efficient on-device moe-based llm serving,

H. Wang, Q. Zhou, Z. Hong, and S. Guo, “D 2moe: Dual routing and dynamic scheduling for efficient on-device moe-based llm serving,” MobiCom, 2025

work page 2025
[4]

Klotski: Efficient mixture-of-expert inference via expert- aware multi-batch pipeline,

Z. Fang, Y . Huang, Z. Hong, Y . Lyu, W. Chen, Y . Yu, F. Yu, and Z. Zheng, “Klotski: Efficient mixture-of-expert inference via expert- aware multi-batch pipeline,” inASPLOS, 2025

work page 2025
[5]

The power of scale for parameter-efficient prompt tuning,

B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” inEMNLP, 2021

work page 2021
[6]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” inICLR, 2022

work page 2022
[7]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,” inNeurIPS, 2022

work page 2022
[8]

Retrieval- augmented generation for knowledge-intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,” inNeurIPS, 2020

work page 2020
[9]

Self-rag: Learning to retrieve, generate, and critique through self-reflection,

A. Asai, Z. Wu, Y . Wang, A. Sil, and H. Hajishirzi, “Self-rag: Learning to retrieve, generate, and critique through self-reflection,” inICLR, 2023

work page 2023
[10]

Incentives build robustness in bittorrent,

B. Cohen, “Incentives build robustness in bittorrent,” inP2P Econ, vol. 6, 2003, pp. 68–72

work page 2003
[11]

Kademlia: A peer-to-peer informa- tion system based on the xor metric,

P. Maymounkov and D. Mazieres, “Kademlia: A peer-to-peer informa- tion system based on the xor metric,” inIPTPS, 2002, pp. 53–65

work page 2002
[12]

Gossip-based aggregation in large dynamic networks,

M. Jelasity, A. Montresor, and O. Babaoglu, “Gossip-based aggregation in large dynamic networks,” inTOCS, vol. 23, no. 3, 2005, pp. 219–252

work page 2005
[13]

Routerdc: Query- based router by dual contrastive learning for assembling large language models,

S. Chen, W. Jiang, B. Lin, J. Kwok, and Y . Zhang, “Routerdc: Query- based router by dual contrastive learning for assembling large language models,” inNeurIPS, 2024

work page 2024
[14]

Kabb: Knowledge-aware bayesian bandits for dynamic expert coordination in multi-agent systems,

J. Zhang, Z. Huang, Y . Fan, N. Liu, M. Li, Z. Yang, J. Yao, J. Wang, and K. Wang, “Kabb: Knowledge-aware bayesian bandits for dynamic expert coordination in multi-agent systems,” inICML, 2025

work page 2025
[15]

Mind2web: Towards a generalist agent for the web,

X. Deng, Y . Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y . Su, “Mind2web: Towards a generalist agent for the web,” inNeurIPS, 2023

work page 2023
[16]

Swe-agent: Agent-computer interfaces enable automated software engineering,

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer interfaces enable automated software engineering,” inNeurIPS, 2024

work page 2024
[17]

Metagpt: Meta programming for a multi-agent collaborative framework,

S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Linet al., “Metagpt: Meta programming for a multi-agent collaborative framework,” inICLR, 2023

work page 2023
[18]

Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration,

J. Wang, H. Xu, H. Jia, X. Zhang, M. Yan, W. Shen, J. Zhang, F. Huang, and J. Sang, “Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration,” inNeurIPS, 2024

work page 2024
[19]

Mobilesteward: Integrating multiple app-oriented agents with self-evolution to automate cross-app instructions,

Y . Liu, H. Sun, W. Liu, J. Luan, B. Du, and R. Yan, “Mobilesteward: Integrating multiple app-oriented agents with self-evolution to automate cross-app instructions,” inKDD, 2025

work page 2025
[20]

Minirag: Towards extremely simple retrieval-augmented generation,

T. Fan, J. Wang, X. Ren, and C. Huang, “Minirag: Towards extremely simple retrieval-augmented generation,”arXiv preprint arXiv:2501.06713, 2025

work page arXiv 2025
[21]

Gpipe: Efficient training of giant neural networks using pipeline parallelism,

Y . Huang, Y . Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wuet al., “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” inNeurIPS, 2019

work page 2019
[22]

{TVM}: An automated{End-to-End} optimizing compiler for deep learning,

T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y . Hu, L. Cezeet al., “{TVM}: An automated{End-to-End} optimizing compiler for deep learning,” inOSDI 18, 2018, pp. 578–594

work page 2018
[23]

Mell: Memory-efficient large language model serving via multi-gpu kv cache management,

Q. Liu, Z. Hong, P. Li, F. Chen, and S. Guo, “Mell: Memory-efficient large language model serving via multi-gpu kv cache management,” in INFOCOM, 2025

work page 2025
[24]

Optq: Accurate post-training quantization for generative pre-trained transformers,

E. Frantar, S. Ashkboos, T. Hoefler, and D.-A. Alistarh, “Optq: Accurate post-training quantization for generative pre-trained transformers,” in ICLR, 2023

work page 2023
[25]

Minillm: Knowledge distillation of large language models,

Y . Gu, L. Dong, F. Wei, and M. Huang, “Minillm: Knowledge distillation of large language models,” inICLR, 2024

work page 2024
[26]

Routing to the expert: Efficient reward-guided ensemble of large language models,

K. Lu, H. Yuan, R. Lin, J. Lin, Z. Yuan, C. Zhou, and J. Zhou, “Routing to the expert: Efficient reward-guided ensemble of large language models,” inNAACL, 2024

work page 2024
[27]

Routellm: Learning to route llms from preference data,

I. Ong, A. Almahairi, V . Wu, W.-L. Chiang, T. Wu, J. E. Gonzalez, M. W. Kadous, and I. Stoica, “Routellm: Learning to route llms from preference data,” inICLR, 2025

work page 2025
[28]

Llm-blender: Ensembling large language models with pairwise ranking and generative fusion,

D. Jiang, X. Ren, and B. Y . Lin, “Llm-blender: Ensembling large language models with pairwise ranking and generative fusion,” inACL, 2023

work page 2023
[29]

Fusing models with complementary expertise,

H. Wang, F. M. Polo, Y . Sun, S. Kundu, E. Xing, and M. Yurochkin, “Fusing models with complementary expertise,” inICLR, 2024

work page 2024
[30]

Ties- merging: Resolving interference when merging models,

P. Yadav, D. Tam, L. Choshen, C. A. Raffel, and M. Bansal, “Ties- merging: Resolving interference when merging models,” inNeurIPS, 2023

work page 2023
[31]

Qwen2.5: A party of foundation models!

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen2.5: A party of foundation models!”

work page
[32]

Available: https://qwenlm.github.io/blog/qwen2.5/

[Online]. Available: https://qwenlm.github.io/blog/qwen2.5/

work page
[33]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

OpenCodeReasoning: Advancing Data Distillation for Competitive Coding

W. U. Ahmad, S. Narenthiran, S. Majumdar, A. Ficek, S. Jain, J. Huang, V . Noroozi, and B. Ginsburg, “Opencodereasoning: Advancing data distillation for competitive coding,”arXiv preprint arXiv:2504.01943, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

J. Chen, Z. Cai, K. Ji, X. Wang, W. Liu, R. Wang, J. Hou, and B. Wang, “Huatuogpt-o1, towards medical complex reasoning with llms,”arXiv preprint arXiv:2412.18925, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

The llama 3 herd of models,

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,” 2024. [Online]. Available: https://ai.meta.com/research/ publications/the-llama-3-herd-of-models/

work page 2024
[37]

Sharegpt,

OpenAI, “Sharegpt,” 2024. [Online]. Available: https://huggingface.co/ datasets/RyokoAI/ShareGPT52K

work page 2024
[38]

Prototypical networks for few-shot learning,

J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” inNeurIPS, 2017

work page 2017
[39]

Learning to compare: Relation network for few-shot learning,

F. Sung, Y . Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” in CVPR, 2018

work page 2018
[40]

Tapnet: Neural network augmented with task-adaptive projection for few-shot learning,

S. W. Yoon, J. Seo, and J. Moon, “Tapnet: Neural network augmented with task-adaptive projection for few-shot learning,” inICML, 2019

work page 2019
[41]

Gossip-based computation of aggregate information,

D. Kempe, A. Dobra, and J. Gehrke, “Gossip-based computation of aggregate information,” inFOCS, 2003

work page 2003
[42]

How bad is selfish routing?

T. Roughgarden and ´E. Tardos, “How bad is selfish routing?”Journal of the ACM, vol. 49, no. 2, pp. 236–259, 2002

work page 2002
[43]

Durrett,Probability: Theory and Examples, 5th ed., 2019

R. Durrett,Probability: Theory and Examples, 5th ed., 2019

work page 2019
[44]

Sentence-bert: Sentence embeddings using siamese bert-networks,

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inEMNLP, 2019

work page 2019
[45]

Agglomerativeclustering,

Scikit-learn, “Agglomerativeclustering,” 2024. [Online]. Avail- able: https://scikit-learn.org/stable/modules/generated/sklearn.cluster. AgglomerativeClustering.html

work page 2024
[46]

Measuring massive multitask language understanding,

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” in ICLR, 2021

work page 2021
[47]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakanoet al., “Training verifiers to solve math word problems,”arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[48]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams,

D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits, “What disease does this patient have? a large-scale open domain question answering dataset from medical exams,”Applied Sciences, vol. 11, no. 14, p. 6421, 2021

work page 2021
[49]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have solved question answering? try arc, the ai2 reasoning challenge,”arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[50]

Agieval: A human-centric benchmark for evaluating foundation models,

W. Zhong, R. Cui, Y . Guo, Y . Liang, S. Lu, Y . Wang, A. Saied, W. Chen, and N. Duan, “Agieval: A human-centric benchmark for evaluating foundation models,” inNAACL, 2024

work page 2024
[51]

More agents is all you need,

J. Li, Q. Zhang, Y . Yu, Q. Fu, and D. Ye, “More agents is all you need,” inTMLR, 2024

work page 2024

[1] [1]

Demystifying small language models for edge deployment,

Z. Lu, X. Li, D. Cai, R. Yi, F. Liu, W. Liu, J. Luan, X. Zhang, N. D. Lane, and M. Xu, “Demystifying small language models for edge deployment,” inACL, 2025

work page 2025

[2] [2]

Small Language Models are the Future of Agentic AI

P. Belcak, G. Heinrich, S. Diao, Y . Fu, X. Dong, S. Muralidharan, Y . C. Lin, and P. Molchanov, “Small language models are the future of agentic ai,”arXiv preprint arXiv:2506.02153, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

D 2moe: Dual routing and dynamic scheduling for efficient on-device moe-based llm serving,

H. Wang, Q. Zhou, Z. Hong, and S. Guo, “D 2moe: Dual routing and dynamic scheduling for efficient on-device moe-based llm serving,” MobiCom, 2025

work page 2025

[4] [4]

Klotski: Efficient mixture-of-expert inference via expert- aware multi-batch pipeline,

Z. Fang, Y . Huang, Z. Hong, Y . Lyu, W. Chen, Y . Yu, F. Yu, and Z. Zheng, “Klotski: Efficient mixture-of-expert inference via expert- aware multi-batch pipeline,” inASPLOS, 2025

work page 2025

[5] [5]

The power of scale for parameter-efficient prompt tuning,

B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” inEMNLP, 2021

work page 2021

[6] [6]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” inICLR, 2022

work page 2022

[7] [7]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,” inNeurIPS, 2022

work page 2022

[8] [8]

Retrieval- augmented generation for knowledge-intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,” inNeurIPS, 2020

work page 2020

[9] [9]

Self-rag: Learning to retrieve, generate, and critique through self-reflection,

A. Asai, Z. Wu, Y . Wang, A. Sil, and H. Hajishirzi, “Self-rag: Learning to retrieve, generate, and critique through self-reflection,” inICLR, 2023

work page 2023

[10] [10]

Incentives build robustness in bittorrent,

B. Cohen, “Incentives build robustness in bittorrent,” inP2P Econ, vol. 6, 2003, pp. 68–72

work page 2003

[11] [11]

Kademlia: A peer-to-peer informa- tion system based on the xor metric,

P. Maymounkov and D. Mazieres, “Kademlia: A peer-to-peer informa- tion system based on the xor metric,” inIPTPS, 2002, pp. 53–65

work page 2002

[12] [12]

Gossip-based aggregation in large dynamic networks,

M. Jelasity, A. Montresor, and O. Babaoglu, “Gossip-based aggregation in large dynamic networks,” inTOCS, vol. 23, no. 3, 2005, pp. 219–252

work page 2005

[13] [13]

Routerdc: Query- based router by dual contrastive learning for assembling large language models,

S. Chen, W. Jiang, B. Lin, J. Kwok, and Y . Zhang, “Routerdc: Query- based router by dual contrastive learning for assembling large language models,” inNeurIPS, 2024

work page 2024

[14] [14]

Kabb: Knowledge-aware bayesian bandits for dynamic expert coordination in multi-agent systems,

J. Zhang, Z. Huang, Y . Fan, N. Liu, M. Li, Z. Yang, J. Yao, J. Wang, and K. Wang, “Kabb: Knowledge-aware bayesian bandits for dynamic expert coordination in multi-agent systems,” inICML, 2025

work page 2025

[15] [15]

Mind2web: Towards a generalist agent for the web,

X. Deng, Y . Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y . Su, “Mind2web: Towards a generalist agent for the web,” inNeurIPS, 2023

work page 2023

[16] [16]

Swe-agent: Agent-computer interfaces enable automated software engineering,

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer interfaces enable automated software engineering,” inNeurIPS, 2024

work page 2024

[17] [17]

Metagpt: Meta programming for a multi-agent collaborative framework,

S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Linet al., “Metagpt: Meta programming for a multi-agent collaborative framework,” inICLR, 2023

work page 2023

[18] [18]

Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration,

J. Wang, H. Xu, H. Jia, X. Zhang, M. Yan, W. Shen, J. Zhang, F. Huang, and J. Sang, “Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration,” inNeurIPS, 2024

work page 2024

[19] [19]

Mobilesteward: Integrating multiple app-oriented agents with self-evolution to automate cross-app instructions,

Y . Liu, H. Sun, W. Liu, J. Luan, B. Du, and R. Yan, “Mobilesteward: Integrating multiple app-oriented agents with self-evolution to automate cross-app instructions,” inKDD, 2025

work page 2025

[20] [20]

Minirag: Towards extremely simple retrieval-augmented generation,

T. Fan, J. Wang, X. Ren, and C. Huang, “Minirag: Towards extremely simple retrieval-augmented generation,”arXiv preprint arXiv:2501.06713, 2025

work page arXiv 2025

[21] [21]

Gpipe: Efficient training of giant neural networks using pipeline parallelism,

Y . Huang, Y . Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wuet al., “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” inNeurIPS, 2019

work page 2019

[22] [22]

{TVM}: An automated{End-to-End} optimizing compiler for deep learning,

T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y . Hu, L. Cezeet al., “{TVM}: An automated{End-to-End} optimizing compiler for deep learning,” inOSDI 18, 2018, pp. 578–594

work page 2018

[23] [23]

Mell: Memory-efficient large language model serving via multi-gpu kv cache management,

Q. Liu, Z. Hong, P. Li, F. Chen, and S. Guo, “Mell: Memory-efficient large language model serving via multi-gpu kv cache management,” in INFOCOM, 2025

work page 2025

[24] [24]

Optq: Accurate post-training quantization for generative pre-trained transformers,

E. Frantar, S. Ashkboos, T. Hoefler, and D.-A. Alistarh, “Optq: Accurate post-training quantization for generative pre-trained transformers,” in ICLR, 2023

work page 2023

[25] [25]

Minillm: Knowledge distillation of large language models,

Y . Gu, L. Dong, F. Wei, and M. Huang, “Minillm: Knowledge distillation of large language models,” inICLR, 2024

work page 2024

[26] [26]

Routing to the expert: Efficient reward-guided ensemble of large language models,

K. Lu, H. Yuan, R. Lin, J. Lin, Z. Yuan, C. Zhou, and J. Zhou, “Routing to the expert: Efficient reward-guided ensemble of large language models,” inNAACL, 2024

work page 2024

[27] [27]

Routellm: Learning to route llms from preference data,

I. Ong, A. Almahairi, V . Wu, W.-L. Chiang, T. Wu, J. E. Gonzalez, M. W. Kadous, and I. Stoica, “Routellm: Learning to route llms from preference data,” inICLR, 2025

work page 2025

[28] [28]

Llm-blender: Ensembling large language models with pairwise ranking and generative fusion,

D. Jiang, X. Ren, and B. Y . Lin, “Llm-blender: Ensembling large language models with pairwise ranking and generative fusion,” inACL, 2023

work page 2023

[29] [29]

Fusing models with complementary expertise,

H. Wang, F. M. Polo, Y . Sun, S. Kundu, E. Xing, and M. Yurochkin, “Fusing models with complementary expertise,” inICLR, 2024

work page 2024

[30] [30]

Ties- merging: Resolving interference when merging models,

P. Yadav, D. Tam, L. Choshen, C. A. Raffel, and M. Bansal, “Ties- merging: Resolving interference when merging models,” inNeurIPS, 2023

work page 2023

[31] [31]

Qwen2.5: A party of foundation models!

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen2.5: A party of foundation models!”

work page

[32] [32]

Available: https://qwenlm.github.io/blog/qwen2.5/

[Online]. Available: https://qwenlm.github.io/blog/qwen2.5/

work page

[33] [33]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

OpenCodeReasoning: Advancing Data Distillation for Competitive Coding

W. U. Ahmad, S. Narenthiran, S. Majumdar, A. Ficek, S. Jain, J. Huang, V . Noroozi, and B. Ginsburg, “Opencodereasoning: Advancing data distillation for competitive coding,”arXiv preprint arXiv:2504.01943, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

J. Chen, Z. Cai, K. Ji, X. Wang, W. Liu, R. Wang, J. Hou, and B. Wang, “Huatuogpt-o1, towards medical complex reasoning with llms,”arXiv preprint arXiv:2412.18925, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

The llama 3 herd of models,

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,” 2024. [Online]. Available: https://ai.meta.com/research/ publications/the-llama-3-herd-of-models/

work page 2024

[37] [37]

Sharegpt,

OpenAI, “Sharegpt,” 2024. [Online]. Available: https://huggingface.co/ datasets/RyokoAI/ShareGPT52K

work page 2024

[38] [38]

Prototypical networks for few-shot learning,

J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” inNeurIPS, 2017

work page 2017

[39] [39]

Learning to compare: Relation network for few-shot learning,

F. Sung, Y . Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” in CVPR, 2018

work page 2018

[40] [40]

Tapnet: Neural network augmented with task-adaptive projection for few-shot learning,

S. W. Yoon, J. Seo, and J. Moon, “Tapnet: Neural network augmented with task-adaptive projection for few-shot learning,” inICML, 2019

work page 2019

[41] [41]

Gossip-based computation of aggregate information,

D. Kempe, A. Dobra, and J. Gehrke, “Gossip-based computation of aggregate information,” inFOCS, 2003

work page 2003

[42] [42]

How bad is selfish routing?

T. Roughgarden and ´E. Tardos, “How bad is selfish routing?”Journal of the ACM, vol. 49, no. 2, pp. 236–259, 2002

work page 2002

[43] [43]

Durrett,Probability: Theory and Examples, 5th ed., 2019

R. Durrett,Probability: Theory and Examples, 5th ed., 2019

work page 2019

[44] [44]

Sentence-bert: Sentence embeddings using siamese bert-networks,

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inEMNLP, 2019

work page 2019

[45] [45]

Agglomerativeclustering,

Scikit-learn, “Agglomerativeclustering,” 2024. [Online]. Avail- able: https://scikit-learn.org/stable/modules/generated/sklearn.cluster. AgglomerativeClustering.html

work page 2024

[46] [46]

Measuring massive multitask language understanding,

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” in ICLR, 2021

work page 2021

[47] [47]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakanoet al., “Training verifiers to solve math word problems,”arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[48] [48]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams,

D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits, “What disease does this patient have? a large-scale open domain question answering dataset from medical exams,”Applied Sciences, vol. 11, no. 14, p. 6421, 2021

work page 2021

[49] [49]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have solved question answering? try arc, the ai2 reasoning challenge,”arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[50] [50]

Agieval: A human-centric benchmark for evaluating foundation models,

W. Zhong, R. Cui, Y . Guo, Y . Liang, S. Lu, Y . Wang, A. Saied, W. Chen, and N. Duan, “Agieval: A human-centric benchmark for evaluating foundation models,” inNAACL, 2024

work page 2024

[51] [51]

More agents is all you need,

J. Li, Q. Zhang, Y . Yu, Q. Fu, and D. Ye, “More agents is all you need,” inTMLR, 2024

work page 2024