pith. sign in

arxiv: 2605.18067 · v1 · pith:NLUSXB66new · submitted 2026-05-18 · 💻 cs.CL

PPAI: Enabling Personalized LLM Agent Interoperability for Collaborative Edge Intelligence

Pith reviewed 2026-05-20 11:15 UTC · model grok-4.3

classification 💻 cs.CL
keywords personalized LLM agentsedge intelligencepeer-to-peer collaborationagent interoperabilityquery-agent matchingBayesian gameload balancingP2P network
0
0 comments X

The pith

PPAI enables personalized LLM agents on edge devices to collaborate peer-to-peer by routing tasks to specialized remote agents, improving accuracy up to 7.96% and reducing latency by 16.34%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PPAI as the first system allowing users with personalized LLM agents on edge devices to collaborate in a peer-to-peer network. Each user can delegate tasks to remote agents better suited for the query based on specialization rather than handling everything locally. It solves matching in a changing agent pool with a prototype-based scoring mechanism and handles rapid load shifts with a Bayesian game for local-global balance. A sympathetic reader would care because this expands the effective capabilities of limited edge hardware by sharing agent strengths across users without central servers. If correct, individual devices could complete a wider set of accurate tasks with lower delays by tapping into the diversity of nearby agents.

Core claim

PPAI is the first personalized LLM agent interoperability system which enables users to collaborate with each other based on agent specialization. It proposes a scalable prototype-based query-agent pair scoring mechanism to identify suitable agents within a P2P network with churn and a multi-agent interoperability Bayesian game to balance local demand and global efficiency when changes in remote agent load occur too quickly to be observed. A prototype implementation demonstrates that the system substantially broadens the range of tasks that could be carried out while maintaining load balance, achieving an average accuracy improvement of up to 7.96% across multiple tasks while reducinglatency

What carries the argument

The prototype-based query-agent pair scoring mechanism for matching in dynamic P2P networks combined with the multi-agent interoperability Bayesian game for load balancing under rapid unobserved changes.

If this is right

  • Tasks exceeding local agent expertise can be delegated to remote agents with better specialization for that query.
  • The matching process continues to function as agents join or leave the network.
  • Local device demand stays balanced against overall network efficiency even when full remote load data is unavailable.
  • A wider range of tasks becomes feasible on edge hardware while preserving system stability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The prototype scoring idea could extend to matching problems in other volatile distributed systems beyond LLM agents.
  • Game-theoretic balancing may prove useful in additional P2P settings where observation lags behind change rates.
  • Widespread use might create selective sharing networks among personal AI agents without requiring shared training data.

Load-bearing premise

A prototype-based scoring method can reliably match queries to agents in a network where agents frequently appear and disappear, and the Bayesian game can keep local and global loads balanced when remote conditions change faster than direct observation allows.

What would settle it

Deploy the prototype in a simulated P2P network with high agent churn and load fluctuations that occur faster than measurement intervals, then measure whether accuracy gains stay above 5% or latency reductions hold relative to a non-collaborative baseline.

Figures

Figures reproduced from arXiv: 2605.18067 by Haodong Wang, Jian Lin, Kaibin Guo, Qianli Liu, Song Guo, Zicong Hong, Zile Wang.

Figure 1
Figure 1. Figure 1: Our vision for PPAI: Two decades ago, P2P networks [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy degradation across some tasks when selecting [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Task counts where each agent ranks as the top-1, top-2, [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: , each agent consists of system prompt, tool interfaces and specialized database to achieve personalized capability. When a user issues a query, it can be served either by the user’s local agent or by another agent in the network that is better suited for the task. To support such flexible and effective collaboration, our system routes each query to the most suitable agent across the network. Building on t… view at source ↗
Figure 6
Figure 6. Figure 6: Overview of our prototype-anchored framework for scalable query–agent scoring and matching. (a) Queries and agents [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of candidate models and our method’s [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

Deploying large language model (LLM) on edge device enables personalized LLM agents for various users. The growing availability of diverse personalized agents presents a unique opportunity for peer-to-peer (P2P) collaboration, wherein each user can delegate tasks beyond the local agent's expertise to remote agents more suited for the specific query. This paper introduces PPAI, the first personalized LLM agent interoperability system, which enables users to collaborate with each other based on agent specialization. However, the ever-changing pool of agents and their interchangeable capacity introduce new challenges when it comes to matching queries to agents and balancing loads, compared with existing P2P systems. Therefore, we propose a scalable query-agent pair scoring mechanism based on prototypes to identify suitable agents within a P2P network with churn. Moreover, we propose a multi-agent interoperability Bayesian game to balance local demand and global efficiency, when changes in remote agent load occur too quickly to be observed. Finally, we implement a prototype of PPAI and demonstrate that it substantially broadens the range of tasks that could be carried out while maintaining load balance. On average, it achieves an accuracy improvement of up to 7.96% across multiple tasks, while reducing latency by 16.34% compared to the baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces PPAI, presented as the first personalized LLM agent interoperability system for P2P collaboration on edge devices. It proposes a scalable prototype-based query-agent pair scoring mechanism to handle agent churn and a multi-agent interoperability Bayesian game to balance local demand against global efficiency when remote loads change faster than they can be observed. The authors report that a prototype implementation broadens task coverage while maintaining load balance, achieving up to 7.96% average accuracy improvement and 16.34% latency reduction versus baseline across multiple tasks.

Significance. If the scoring and game mechanisms prove stable under realistic churn and sub-observation load shifts, the work could meaningfully advance collaborative edge intelligence by allowing users to delegate tasks to specialized remote agents. The approach addresses a timely gap between personalized edge LLMs and dynamic P2P networks, but its significance is currently limited by the absence of detailed validation for the two load-bearing mechanisms.

major comments (3)
  1. [Abstract] Abstract: The headline claims of +7.96% accuracy and -16.34% latency rest on the prototype-based scoring and Bayesian game, yet the abstract supplies no experimental setup, baselines, datasets, error bars, or implementation details. Without these, it is impossible to determine whether the reported gains survive the churn and rapid-load regimes identified as the core challenge.
  2. [Bayesian game section] Section describing the Bayesian game formulation: The claim that the game equilibrates local demand and global efficiency when remote loads change too quickly to observe requires a concrete reduction to observable quantities or a stability argument; the current description does not show how the equilibrium remains well-defined or incentive-compatible under the exact conditions the paper flags as problematic.
  3. [Prototype scoring section] Section on prototype-based scoring: The assertion that the mechanism reliably ranks remote agents despite churn lacks any analysis or experiment quantifying ranking accuracy as a function of churn rate or prototype update frequency; if ranking fails at realistic churn levels, the interoperability benefit reduces to local execution and the central performance claims do not hold.
minor comments (2)
  1. [Related work] The manuscript should include a dedicated related-work subsection that explicitly positions PPAI against prior P2P agent or edge-LLM systems rather than asserting novelty in the abstract alone.
  2. [Notation] Notation for the scoring function and game payoffs should be introduced once and used consistently; several terms appear to be defined only locally.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where additional clarity and validation would strengthen the manuscript. We address each major comment point by point below and commit to revisions that directly respond to the concerns about experimental context and mechanism robustness under churn and rapid load shifts.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claims of +7.96% accuracy and -16.34% latency rest on the prototype-based scoring and Bayesian game, yet the abstract supplies no experimental setup, baselines, datasets, error bars, or implementation details. Without these, it is impossible to determine whether the reported gains survive the churn and rapid-load regimes identified as the core challenge.

    Authors: We agree that the abstract would be more informative with a concise description of the experimental context. In the revised version we will expand the abstract to note the prototype implementation on edge devices, the multi-task evaluation (including specific datasets and task types), the baselines used for comparison, and that reported figures are averages with observed variance across runs. This will allow readers to better assess the relevance of the gains to the churn and load-shift scenarios emphasized in the paper. revision: yes

  2. Referee: [Bayesian game section] Section describing the Bayesian game formulation: The claim that the game equilibrates local demand and global efficiency when remote loads change too quickly to observe requires a concrete reduction to observable quantities or a stability argument; the current description does not show how the equilibrium remains well-defined or incentive-compatible under the exact conditions the paper flags as problematic.

    Authors: We accept that an explicit stability argument is needed. We will revise the Bayesian game section to include a formal reduction showing how the equilibrium is computed from local observations and a prior over unobserved load states, together with a proof sketch that the resulting strategy profile remains incentive-compatible and well-defined even when remote loads vary faster than direct observation. This addition will directly address the conditions highlighted as central to the problem. revision: yes

  3. Referee: [Prototype scoring section] Section on prototype-based scoring: The assertion that the mechanism reliably ranks remote agents despite churn lacks any analysis or experiment quantifying ranking accuracy as a function of churn rate or prototype update frequency; if ranking fails at realistic churn levels, the interoperability benefit reduces to local execution and the central performance claims do not hold.

    Authors: The referee correctly notes the absence of a dedicated churn-sensitivity analysis. We will add a new subsection (and supporting appendix) that reports ranking accuracy of the prototype-based scorer as a function of churn rate and prototype refresh interval, using both simulation and prototype measurements. The added results will quantify the operating regime in which ranking remains reliable and confirm that the reported accuracy and latency gains are achieved within realistic churn levels. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or results

full rationale

The paper proposes two new mechanisms (prototype-based query-agent scoring for churny P2P networks and a multi-agent Bayesian game for unobservable load changes) and then reports empirical gains from a prototype implementation. No equations, fitted parameters, or self-citations are shown reducing the accuracy or latency claims to the inputs by construction. The derivation chain consists of algorithmic proposals followed by external validation on tasks, which remains independent of the target performance numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract alone does not supply enough technical detail to enumerate specific free parameters, axioms, or invented entities. The work introduces new mechanisms for scoring and game-based balancing but their internal structure and assumptions remain unspecified.

pith-pipeline@v0.9.0 · 5767 in / 1233 out tokens · 55567 ms · 2026-05-20T11:15:08.482394+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 6 internal anchors

  1. [1]

    Demystifying small language models for edge deployment,

    Z. Lu, X. Li, D. Cai, R. Yi, F. Liu, W. Liu, J. Luan, X. Zhang, N. D. Lane, and M. Xu, “Demystifying small language models for edge deployment,” inACL, 2025

  2. [2]

    Small Language Models are the Future of Agentic AI

    P. Belcak, G. Heinrich, S. Diao, Y . Fu, X. Dong, S. Muralidharan, Y . C. Lin, and P. Molchanov, “Small language models are the future of agentic ai,”arXiv preprint arXiv:2506.02153, 2025

  3. [3]

    D 2moe: Dual routing and dynamic scheduling for efficient on-device moe-based llm serving,

    H. Wang, Q. Zhou, Z. Hong, and S. Guo, “D 2moe: Dual routing and dynamic scheduling for efficient on-device moe-based llm serving,” MobiCom, 2025

  4. [4]

    Klotski: Efficient mixture-of-expert inference via expert- aware multi-batch pipeline,

    Z. Fang, Y . Huang, Z. Hong, Y . Lyu, W. Chen, Y . Yu, F. Yu, and Z. Zheng, “Klotski: Efficient mixture-of-expert inference via expert- aware multi-batch pipeline,” inASPLOS, 2025

  5. [5]

    The power of scale for parameter-efficient prompt tuning,

    B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” inEMNLP, 2021

  6. [6]

    Lora: Low-rank adaptation of large language models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” inICLR, 2022

  7. [7]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,” inNeurIPS, 2022

  8. [8]

    Retrieval- augmented generation for knowledge-intensive nlp tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,” inNeurIPS, 2020

  9. [9]

    Self-rag: Learning to retrieve, generate, and critique through self-reflection,

    A. Asai, Z. Wu, Y . Wang, A. Sil, and H. Hajishirzi, “Self-rag: Learning to retrieve, generate, and critique through self-reflection,” inICLR, 2023

  10. [10]

    Incentives build robustness in bittorrent,

    B. Cohen, “Incentives build robustness in bittorrent,” inP2P Econ, vol. 6, 2003, pp. 68–72

  11. [11]

    Kademlia: A peer-to-peer informa- tion system based on the xor metric,

    P. Maymounkov and D. Mazieres, “Kademlia: A peer-to-peer informa- tion system based on the xor metric,” inIPTPS, 2002, pp. 53–65

  12. [12]

    Gossip-based aggregation in large dynamic networks,

    M. Jelasity, A. Montresor, and O. Babaoglu, “Gossip-based aggregation in large dynamic networks,” inTOCS, vol. 23, no. 3, 2005, pp. 219–252

  13. [13]

    Routerdc: Query- based router by dual contrastive learning for assembling large language models,

    S. Chen, W. Jiang, B. Lin, J. Kwok, and Y . Zhang, “Routerdc: Query- based router by dual contrastive learning for assembling large language models,” inNeurIPS, 2024

  14. [14]

    Kabb: Knowledge-aware bayesian bandits for dynamic expert coordination in multi-agent systems,

    J. Zhang, Z. Huang, Y . Fan, N. Liu, M. Li, Z. Yang, J. Yao, J. Wang, and K. Wang, “Kabb: Knowledge-aware bayesian bandits for dynamic expert coordination in multi-agent systems,” inICML, 2025

  15. [15]

    Mind2web: Towards a generalist agent for the web,

    X. Deng, Y . Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y . Su, “Mind2web: Towards a generalist agent for the web,” inNeurIPS, 2023

  16. [16]

    Swe-agent: Agent-computer interfaces enable automated software engineering,

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer interfaces enable automated software engineering,” inNeurIPS, 2024

  17. [17]

    Metagpt: Meta programming for a multi-agent collaborative framework,

    S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Linet al., “Metagpt: Meta programming for a multi-agent collaborative framework,” inICLR, 2023

  18. [18]

    Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration,

    J. Wang, H. Xu, H. Jia, X. Zhang, M. Yan, W. Shen, J. Zhang, F. Huang, and J. Sang, “Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration,” inNeurIPS, 2024

  19. [19]

    Mobilesteward: Integrating multiple app-oriented agents with self-evolution to automate cross-app instructions,

    Y . Liu, H. Sun, W. Liu, J. Luan, B. Du, and R. Yan, “Mobilesteward: Integrating multiple app-oriented agents with self-evolution to automate cross-app instructions,” inKDD, 2025

  20. [20]

    Minirag: Towards extremely simple retrieval-augmented generation,

    T. Fan, J. Wang, X. Ren, and C. Huang, “Minirag: Towards extremely simple retrieval-augmented generation,”arXiv preprint arXiv:2501.06713, 2025

  21. [21]

    Gpipe: Efficient training of giant neural networks using pipeline parallelism,

    Y . Huang, Y . Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wuet al., “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” inNeurIPS, 2019

  22. [22]

    {TVM}: An automated{End-to-End} optimizing compiler for deep learning,

    T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y . Hu, L. Cezeet al., “{TVM}: An automated{End-to-End} optimizing compiler for deep learning,” inOSDI 18, 2018, pp. 578–594

  23. [23]

    Mell: Memory-efficient large language model serving via multi-gpu kv cache management,

    Q. Liu, Z. Hong, P. Li, F. Chen, and S. Guo, “Mell: Memory-efficient large language model serving via multi-gpu kv cache management,” in INFOCOM, 2025

  24. [24]

    Optq: Accurate post-training quantization for generative pre-trained transformers,

    E. Frantar, S. Ashkboos, T. Hoefler, and D.-A. Alistarh, “Optq: Accurate post-training quantization for generative pre-trained transformers,” in ICLR, 2023

  25. [25]

    Minillm: Knowledge distillation of large language models,

    Y . Gu, L. Dong, F. Wei, and M. Huang, “Minillm: Knowledge distillation of large language models,” inICLR, 2024

  26. [26]

    Routing to the expert: Efficient reward-guided ensemble of large language models,

    K. Lu, H. Yuan, R. Lin, J. Lin, Z. Yuan, C. Zhou, and J. Zhou, “Routing to the expert: Efficient reward-guided ensemble of large language models,” inNAACL, 2024

  27. [27]

    Routellm: Learning to route llms from preference data,

    I. Ong, A. Almahairi, V . Wu, W.-L. Chiang, T. Wu, J. E. Gonzalez, M. W. Kadous, and I. Stoica, “Routellm: Learning to route llms from preference data,” inICLR, 2025

  28. [28]

    Llm-blender: Ensembling large language models with pairwise ranking and generative fusion,

    D. Jiang, X. Ren, and B. Y . Lin, “Llm-blender: Ensembling large language models with pairwise ranking and generative fusion,” inACL, 2023

  29. [29]

    Fusing models with complementary expertise,

    H. Wang, F. M. Polo, Y . Sun, S. Kundu, E. Xing, and M. Yurochkin, “Fusing models with complementary expertise,” inICLR, 2024

  30. [30]

    Ties- merging: Resolving interference when merging models,

    P. Yadav, D. Tam, L. Choshen, C. A. Raffel, and M. Bansal, “Ties- merging: Resolving interference when merging models,” inNeurIPS, 2023

  31. [31]

    Qwen2.5: A party of foundation models!

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen2.5: A party of foundation models!”

  32. [32]

    Available: https://qwenlm.github.io/blog/qwen2.5/

    [Online]. Available: https://qwenlm.github.io/blog/qwen2.5/

  33. [33]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  34. [34]

    OpenCodeReasoning: Advancing Data Distillation for Competitive Coding

    W. U. Ahmad, S. Narenthiran, S. Majumdar, A. Ficek, S. Jain, J. Huang, V . Noroozi, and B. Ginsburg, “Opencodereasoning: Advancing data distillation for competitive coding,”arXiv preprint arXiv:2504.01943, 2025

  35. [35]

    HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

    J. Chen, Z. Cai, K. Ji, X. Wang, W. Liu, R. Wang, J. Hou, and B. Wang, “Huatuogpt-o1, towards medical complex reasoning with llms,”arXiv preprint arXiv:2412.18925, 2024

  36. [36]

    The llama 3 herd of models,

    A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,” 2024. [Online]. Available: https://ai.meta.com/research/ publications/the-llama-3-herd-of-models/

  37. [37]

    Sharegpt,

    OpenAI, “Sharegpt,” 2024. [Online]. Available: https://huggingface.co/ datasets/RyokoAI/ShareGPT52K

  38. [38]

    Prototypical networks for few-shot learning,

    J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” inNeurIPS, 2017

  39. [39]

    Learning to compare: Relation network for few-shot learning,

    F. Sung, Y . Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” in CVPR, 2018

  40. [40]

    Tapnet: Neural network augmented with task-adaptive projection for few-shot learning,

    S. W. Yoon, J. Seo, and J. Moon, “Tapnet: Neural network augmented with task-adaptive projection for few-shot learning,” inICML, 2019

  41. [41]

    Gossip-based computation of aggregate information,

    D. Kempe, A. Dobra, and J. Gehrke, “Gossip-based computation of aggregate information,” inFOCS, 2003

  42. [42]

    How bad is selfish routing?

    T. Roughgarden and ´E. Tardos, “How bad is selfish routing?”Journal of the ACM, vol. 49, no. 2, pp. 236–259, 2002

  43. [43]

    Durrett,Probability: Theory and Examples, 5th ed., 2019

    R. Durrett,Probability: Theory and Examples, 5th ed., 2019

  44. [44]

    Sentence-bert: Sentence embeddings using siamese bert-networks,

    N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inEMNLP, 2019

  45. [45]

    Agglomerativeclustering,

    Scikit-learn, “Agglomerativeclustering,” 2024. [Online]. Avail- able: https://scikit-learn.org/stable/modules/generated/sklearn.cluster. AgglomerativeClustering.html

  46. [46]

    Measuring massive multitask language understanding,

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” in ICLR, 2021

  47. [47]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakanoet al., “Training verifiers to solve math word problems,”arXiv preprint arXiv:2110.14168, 2021

  48. [48]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams,

    D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits, “What disease does this patient have? a large-scale open domain question answering dataset from medical exams,”Applied Sciences, vol. 11, no. 14, p. 6421, 2021

  49. [49]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have solved question answering? try arc, the ai2 reasoning challenge,”arXiv preprint arXiv:1803.05457, 2018

  50. [50]

    Agieval: A human-centric benchmark for evaluating foundation models,

    W. Zhong, R. Cui, Y . Guo, Y . Liang, S. Lu, Y . Wang, A. Saied, W. Chen, and N. Duan, “Agieval: A human-centric benchmark for evaluating foundation models,” inNAACL, 2024

  51. [51]

    More agents is all you need,

    J. Li, Q. Zhang, Y . Yu, Q. Fu, and D. Ye, “More agents is all you need,” inTMLR, 2024