pith. sign in

arxiv: 2507.21159 · v3 · submitted 2025-07-25 · 💻 cs.AI · cs.LG· cs.MA

MAC: Masked Agent Collaboration Boosts Large Language Model Medical Decision-Making

Pith reviewed 2026-05-19 03:07 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.MA
keywords Masked Agent CollaborationLarge Language ModelsMedical Decision-MakingMulti-Agent SystemsPareto OptimizationCross-ConsistencyLLM Collaboration
0
0 comments X

The pith

A masked collaboration method for large language model agents improves medical decision-making by removing inconsistent outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the Masked Agent Collaboration framework to fix failures in multi-agent systems for healthcare tasks. It first picks a balanced set of models from a pool using Pareto optimization across size, speed, diversity, and throughput. It then computes cross-consistency from pairwise output similarities and masks the agent with the lowest value to drop likely inconsistent responses. The remaining agents then collaborate via adaptive progressive propagation, where each step feeds aggregated outputs from unmasked agents into the next layer through prompts. A sympathetic reader would care because current multi-agent setups often degrade under rigid patterns or conflicting outputs, and a reliable fix could support safer AI use in medical choices.

Core claim

The MAC framework harnesses Pareto-optimal agent construction and cross-consistency maximization mechanisms to achieve adaptive progressive propagation of collaborative information, boosting the medical decision-making capacity.

What carries the argument

The cross-consistency masking step, which measures pairwise output similarity among agents to identify and exclude the one producing the most semantically inconsistent output before collaboration continues.

If this is right

  • Pareto-frontier analysis selects agents that balance model capability with practical efficiency factors like inference time and throughput.
  • Masking the lowest cross-consistency agent removes outputs likely to conflict semantically with the group consensus.
  • Adaptive progressive propagation lets each agent incorporate aggregated results from unmasked agents in prior layers to refine its own output.
  • The overall process yields higher-quality final decisions in medical scenarios than static or unmasked collaboration patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same masking logic based on output similarity could apply to multi-agent setups in other domains requiring consistent group reasoning.
  • Incorporating diversity scores during agent selection may produce more robust outcomes than using multiple copies of the same model.
  • Further tests could check whether the method holds when the number of agents increases or when applied to specialized medical subfields.
  • The progressive propagation structure might be extended with iterative feedback to refine consistency across additional layers.

Load-bearing premise

That measuring pairwise output similarity to mask the lowest cross-consistency agent reliably eliminates semantically inconsistent outputs and produces superior final decisions compared to unmasked collaboration.

What would settle it

A head-to-head test on standard medical decision benchmarks where the masked method shows equal or lower accuracy than unmasked multi-agent collaboration or single-agent baselines would disprove the performance boost.

Figures

Figures reproduced from arXiv: 2507.21159 by Liuxin Bao, Yixuan Yuan, Zhihao Peng.

Figure 1
Figure 1. Figure 1: Comparisons on (a) NEJMQA and (b) MMLU-Pro-health demonstrate our substantial [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: For LLMs of equal parameter size, a higher SD value [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the proposed adaptive cluster collaborativeness. We first measure pairwise cross-consistency values between the LLM with the highest SD value and other models. Then, we iteratively mask the LLM showing the lowest pairwise CC value in the current layer and propagate only the outputs from remaining LLMs to the next layer. This adaptive mask mechanism significantly reduces the inconsistency of… view at source ↗
Figure 4
Figure 4. Figure 4: Comparisons of the ACC (in percentage), occupied [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Large language models (LLMs) have proven effective in artificial intelligence, where the multi-agent system (MAS) holds considerable promise for healthcare development by achieving the collaboration of LLMs. However, the absence of a systematic pipeline for agent construction and the rigidity of static collaboration patterns render current MAS-based models vulnerable to collaboration failures, resulting in substantial performance degradation in medical decision-making scenarios. To this end, we propose a novel Masked Agent Collaboration (MAC) framework that harnesses Pareto-optimal agent construction and cross-consistency maximization mechanisms to achieve adaptive progressive propagation of collaborative information, boosting the medical decision-making capacity. Specifically, we first conduct a Pareto-frontier factors analysis towards the LLMs pool to consider their key factors, including the model size, inference time, diversity score, and throughput ratio, where we calculate the similarity between pairwise outputs within an LLM to derive its diversity score. Beyond this analysis, we enable the identification of Pareto-optimal models that balance efficiency and capability, which are subsequently selected as collaborative agents to consider the fundamental trade-offs inherent in practical LLM deployment. Afterward, we measure the pairwise similarity between the outputs from collaborative agents to determine their cross-consistency values, subsequently masking out the agent with the lowest cross-consistency value to eliminate the output that is likely semantically inconsistent. Finally, we conduct collaboration of agents by achieving adaptive progressive propagation, where each agent aggregates the outputs of unmasked agents from the previous layer as its input to generate the corresponding output via prompt engineering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes the Masked Agent Collaboration (MAC) framework for multi-agent LLM systems in medical decision-making. It first performs Pareto-frontier analysis on an LLM pool using factors including model size, inference time, diversity score (computed from pairwise output similarity within each LLM), and throughput ratio to select balanced collaborative agents. It then computes cross-consistency via pairwise similarity of agent outputs, masks the lowest-consistency agent to remove likely semantically inconsistent outputs, and performs adaptive progressive propagation in which each agent aggregates outputs from unmasked agents of the prior layer as input for prompt-engineered generation.

Significance. If the central mechanisms are shown to work, the framework offers a systematic approach to dynamic multi-agent collaboration that explicitly trades off efficiency and capability via Pareto selection while attempting to reduce inconsistency through masking and layered propagation. This could be relevant for practical deployment of LLM agents in healthcare. The multi-factor Pareto analysis is a constructive element that acknowledges real-world constraints.

major comments (3)
  1. [Abstract / MAC framework description] Abstract, paragraph on cross-consistency maximization: the assertion that masking the agent with the lowest cross-consistency value 'eliminates the output that is likely semantically inconsistent' is load-bearing for the entire pipeline yet unsupported. No error analysis, ground-truth comparison, or ablation (masked vs. unmasked) is provided to show that low-similarity agents are disproportionately incorrect rather than correct outliers.
  2. [Abstract / Experimental validation] Abstract, final paragraph on adaptive progressive propagation: the performance-boost claim ('boosting the medical decision-making capacity') rests on the pipeline but the manuscript reports no quantitative results, metrics, baselines, medical datasets, or statistical tests. Without these, it is impossible to evaluate whether the masking-plus-propagation steps yield net improvement.
  3. [Abstract / cross-consistency mechanism] Cross-consistency computation: pairwise similarity is derived solely from the outputs of the Pareto-selected agents themselves. This creates a circularity risk in which the masking rule may simply enforce output agreement by construction rather than correctness; no external validation set or expert adjudication is described to break the circle.
minor comments (2)
  1. [Abstract / diversity score and cross-consistency] The concrete similarity function (lexical overlap, embedding cosine, etc.) used for both diversity score and cross-consistency is not specified, hindering reproducibility of the Pareto selection and masking steps.
  2. [Abstract / adaptive progressive propagation] The number of propagation layers, the final aggregation rule, and the prompt templates used in each layer are not detailed, making the adaptive progressive propagation difficult to implement or replicate.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We appreciate the opportunity to clarify the MAC framework and strengthen the empirical support in the manuscript. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract / MAC framework description] Abstract, paragraph on cross-consistency maximization: the assertion that masking the agent with the lowest cross-consistency value 'eliminates the output that is likely semantically inconsistent' is load-bearing for the entire pipeline yet unsupported. No error analysis, ground-truth comparison, or ablation (masked vs. unmasked) is provided to show that low-similarity agents are disproportionately incorrect rather than correct outliers.

    Authors: We agree that the masking claim requires stronger empirical grounding. The manuscript currently motivates the step via the assumption that low cross-consistency signals semantic divergence in medical reasoning. To address this directly, we will add a dedicated ablation study comparing end-to-end performance with and without masking, plus an error analysis that reports accuracy of masked versus retained agents against ground-truth labels on the evaluation datasets. These additions will quantify whether low-consistency outputs are indeed more error-prone. revision: yes

  2. Referee: [Abstract / Experimental validation] Abstract, final paragraph on adaptive progressive propagation: the performance-boost claim ('boosting the medical decision-making capacity') rests on the pipeline but the manuscript reports no quantitative results, metrics, baselines, medical datasets, or statistical tests. Without these, it is impossible to evaluate whether the masking-plus-propagation steps yield net improvement.

    Authors: We apologize for any lack of visibility in the abstract. The full manuscript contains quantitative results on medical decision-making tasks, using standard clinical datasets, with metrics including accuracy and F1-score, comparisons against single-LLM and static multi-agent baselines, and statistical significance testing. We will revise the abstract to summarize the key performance gains and statistical outcomes so that the empirical contribution of the adaptive propagation is immediately evident. revision: yes

  3. Referee: [Abstract / cross-consistency mechanism] Cross-consistency computation: pairwise similarity is derived solely from the outputs of the Pareto-selected agents themselves. This creates a circularity risk in which the masking rule may simply enforce output agreement by construction rather than correctness; no external validation set or expert adjudication is described to break the circle.

    Authors: We acknowledge the circularity concern. While internal cross-consistency is designed to surface disagreements among capable agents, we will augment the manuscript with an external validation protocol: a held-out subset annotated by medical experts or using established ground-truth diagnoses will be used to measure error rates of masked versus unmasked outputs. This will demonstrate that the masking step improves correctness beyond mere consensus. revision: yes

Circularity Check

0 steps flagged

No significant circularity; MAC is a heuristic framework with independent empirical claims

full rationale

The paper presents a methodological pipeline: Pareto-frontier selection on model size/inference time/diversity (computed from intra-LLM output similarity) and throughput, followed by inter-agent pairwise similarity for cross-consistency masking, then adaptive progressive propagation via prompt engineering. These steps are explicit design choices and do not reduce by construction to the inputs (no equation equates final decision quality to the masking rule itself). No self-citations, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided text. The performance boost is framed as an empirical outcome of the framework rather than a self-referential derivation, rendering the chain self-contained against external medical benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework rests on standard multi-agent LLM assumptions plus the untested premise that Pareto selection plus consistency masking improves outcomes; no new entities are introduced.

free parameters (1)
  • diversity score threshold
    Derived from pairwise output similarity; exact computation and any cutoffs are not specified in the abstract.
axioms (2)
  • domain assumption Pareto-optimal models balance efficiency and capability for practical deployment
    Invoked when selecting agents from the LLM pool based on size, time, diversity, and throughput.
  • ad hoc to paper Lowest cross-consistency output is the one most likely to be semantically inconsistent
    Central to the masking step; no justification or prior evidence cited in abstract.

pith-pipeline@v0.9.0 · 5804 in / 1253 out tokens · 35975 ms · 2026-05-19T03:07:47.240123+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 9 internal anchors

  1. [1]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    Are more llm calls all you need? towards the scaling properties of compound ai systems

    Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei A Zaharia, and James Y Zou. Are more llm calls all you need? towards the scaling properties of compound ai systems. Advances in Neural Information Processing Systems, 37:45767–45790, 2024

  4. [4]

    Can large language models provide useful feedback on research papers? a large-scale empirical analysis

    Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Yi Ding, Xinyu Yang, Kailas V odrahalli, Siyu He, Daniel Scott Smith, Yian Yin, et al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis. NEJM AI, 1(8):AIoa2400196, 2024

  5. [5]

    Optimizing generative ai by backpropagating language model feedback

    Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. Optimizing generative ai by backpropagating language model feedback. Nature, 639(8055):609–616, 2025

  6. [6]

    Large language models in medicine.Nature medicine, 29(8):1930–1940, 2023

    Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine.Nature medicine, 29(8):1930–1940, 2023

  7. [7]

    Knowledge-infused prompting improves clinical text generation with large language models

    Ran Xu, Hejie Cui, Yue Yu, Xuan Kan, Wenqi Shi, Yuchen Zhuang, Wei Jin, Joyce Ho, and Carl Yang. Knowledge-infused prompting improves clinical text generation with large language models. In NeurIPS 2023 Workshop on Synthetic Data Generation with Generative AI, 2023

  8. [8]

    Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning

    Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan Ilgen, Emma Pierson, Pang Wei W Koh, and Yulia Tsvetkov. Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning. Advances in Neural Information Processing Systems , 37:28858–28888, 2024

  9. [9]

    Gpt versus resident physicians—a benchmark based on official board scores

    Uriel Katz, Eran Cohen, Eliya Shachar, Jonathan Somer, Adam Fink, Eli Morse, Beki Shreiber, and Ido Wolf. Gpt versus resident physicians—a benchmark based on official board scores. NEJM AI, 1(5):AIdbp2300192, 2024

  10. [10]

    Map: Evaluation and multi-agent enhancement of large language models for inpatient pathways

    Zhen Chen, Zhihao Peng, Xusheng Liang, Cheng Wang, Peigan Liang, Linsheng Zeng, Minjie Ju, and Yixuan Yuan. Map: Evaluation and multi-agent enhancement of large language models for inpatient pathways. arXiv preprint arXiv:2503.13205, 2025

  11. [11]

    Mixture-of-agents enhances large language model capabilities

    Junlin Wang, Jue WANG, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities. In The Thirteenth International Conference on Learning Representations, 2025

  12. [12]

    Camel: Communicative agents for" mind" exploration of large language model society

    Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems, 36:51991–52008, 2023

  13. [13]

    Encouraging divergent thinking in large language models through multi- agent debate

    Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi- agent debate. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17889–17904, 2024

  14. [14]

    Chateval: Towards better LLM-based evaluators through multi-agent debate

    Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better LLM-based evaluators through multi-agent debate. In The Twelfth International Conference on Learning Representations, 2024. 10

  15. [15]

    Exploring collaboration mechanisms for LLM agents: A social psychology view

    Jintian Zhang, Xin Xu, Ningyu Zhang, Ruibo Liu, Bryan Hooi, and Shumin Deng. Exploring collaboration mechanisms for LLM agents: A social psychology view. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14544–14607, Bangkok, Thailand, August 2024. Association for Computational Linguistics

  16. [16]

    Multi-llm debate: Framework, principals, and interventions

    Andrew Estornell and Yang Liu. Multi-llm debate: Framework, principals, and interventions. Advances in Neural Information Processing Systems, 37:28938–28964, 2024

  17. [17]

    Don’t hallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration

    Shangbin Feng, Weijia Shi, Yike Wang, Wenxuan Ding, Vidhisha Balachandran, and Yulia Tsvetkov. Don’t hallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration. In ACL (1), 2024

  18. [18]

    Improv- ing factuality and reasoning in language models through multiagent debate

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. In Forty-first International Conference on Machine Learning, 2023

  19. [19]

    Rethinking mixture-of-agents: Is mixing different large language models beneficial? arXiv preprint arXiv:2502.00674, 2025

    Wenzhe Li, Yong Lin, Mengzhou Xia, and Chi Jin. Rethinking mixture-of-agents: Is mixing different large language models beneficial? arXiv preprint arXiv:2502.00674, 2025

  20. [20]

    Reducing the global burden of tuberculosis: the contribution of improved diagnostics

    Emmett Keeler, Mark D Perkins, Peter Small, Christy Hanson, Steven Reed, Jane Cunningham, Julia E Aledort, Lee Hillborne, Maria E Rafael, Federico Girosi, et al. Reducing the global burden of tuberculosis: the contribution of improved diagnostics. Nature, 444(Suppl 1):49–57, 2006

  21. [21]

    Artificial intelligence system reduces false-positive findings in the interpretation of breast ultrasound exams

    Yiqiu Shen, Farah E Shamout, Jamie R Oliver, Jan Witowski, Kawshik Kannan, Jungkyu Park, Nan Wu, Connor Huddleston, Stacey Wolfson, Alexandra Millet, et al. Artificial intelligence system reduces false-positive findings in the interpretation of breast ultrasound exams. Nature communications, 12(1):5645, 2021

  22. [22]

    Enhancing the reliability and accuracy of ai-enabled diagnosis via complementarity-driven deferral to clinicians

    Krishnamurthy Dvijotham, Jim Winkens, Melih Barsbey, Sumedh Ghaisas, Robert Stanforth, Nick Pawlowski, Patricia Strachan, Zahra Ahmed, Shekoofeh Azizi, Yoram Bachrach, et al. Enhancing the reliability and accuracy of ai-enabled diagnosis via complementarity-driven deferral to clinicians. Nature Medicine, 29(7):1814–1820, 2023

  23. [23]

    Mdagents: An adaptive collaboration of llms for medical decision-making

    Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik Siu Chan, Xuhai Xu, Daniel McDuff, Hyeon- hoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, Hae Park, et al. Mdagents: An adaptive collaboration of llms for medical decision-making. Advances in Neural Information Processing Systems, 37:79410–79452, 2024

  24. [24]

    Mitigating overconfidence in large language models: A behavioral lens on confidence estimation and calibration

    Bingbing Wen, Chenjun Xu, Robert Wolfe, Lucy Lu Wang, Bill Howe, et al. Mitigating overconfidence in large language models: A behavioral lens on confidence estimation and calibration. In NeurIPS 2024 Workshop on Behavioral Machine Learning, 2024

  25. [25]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  26. [26]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022

  27. [27]

    Tree of thoughts: Deliberate problem solving with large language models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Ad- vances in neural information processing systems, 36:11809–11822, 2023

  28. [28]

    Graph of thoughts: Solving elaborate problems with large language models

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690, 2024

  29. [29]

    Skeleton-of- thought: Large language models can do parallel decoding

    Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, and Yu Wang. Skeleton-of- thought: Large language models can do parallel decoding. Proceedings ENLSP-III, 2023. 11

  30. [30]

    A survey on large language model based autonomous agents

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345, 2024

  31. [31]

    Rethinking the bounds of llm reasoning: Are multi-agent discussions the key? In 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024, pages 6106–6131

    Qineng Wang, Zihao Wang, Ying Su, Hanghang Tong, and Yangqiu Song. Rethinking the bounds of llm reasoning: Are multi-agent discussions the key? In 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024, pages 6106–6131. Association for Computational Linguistics (ACL), 2024

  32. [32]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

  33. [33]

    Medagents: Large language models as collaborators for zero-shot medical reasoning

    Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. arXiv preprint arXiv:2311.10537, 2023

  34. [34]

    InarXiv preprint arXiv:2309.13007

    Justin Chih-Yao Chen, Swarnadeep Saha, and Mohit Bansal. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. arXiv preprint arXiv:2309.13007, 2023

  35. [35]

    arXiv preprint arXiv:2305.10142 , year =

    Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. Improving language model negotiation with self-play and in-context learning from ai feedback. arXiv preprint arXiv:2305.10142, 2023

  36. [36]

    rapidfuzz/rapidfuzz: Release 3.8.1, April 2024

    Max Bachmann. rapidfuzz/rapidfuzz: Release 3.8.1, April 2024

  37. [37]

    Binary codes capable of correcting deletions, insertions, and reversals

    Vladimir I Levenshtein et al. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710. Soviet Union, 1966

  38. [38]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

  39. [39]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024

  40. [40]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

  41. [41]

    Qwq-32b: Embracing the power of reinforcement learning, March 2025

    Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025

  42. [42]

    Open thoughts, February 2025

    OpenThoughts Team. Open thoughts, February 2025

  43. [43]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  44. [44]

    Introducing meta llama 3: The most capable openly available llm to date

    AI Meta. Introducing meta llama 3: The most capable openly available llm to date. Meta AI, 2(5):6, 2024

  45. [45]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023

  46. [46]

    Introducing dbrx: A new state-of-the-art open llm, 2024

    The Mosaic Research Team. Introducing dbrx: A new state-of-the-art open llm, 2024

  47. [47]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024. 12

  48. [48]

    WizardLM: Empowering large pre-trained language models to follow complex instructions

    Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023

  49. [49]

    Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation,

    David MW Powers. Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. arXiv preprint arXiv:2010.16061, 2020

  50. [50]

    Statistical problems in assessing methods of medical diagnosis, with special reference to x-ray techniques

    Jacob Yerushalmy. Statistical problems in assessing methods of medical diagnosis, with special reference to x-ray techniques. Public Health Reports (1896-1970), pages 1432–1449, 1947

  51. [51]

    Sensitivity and specificity revisited: significance of the terms in analytic and diagnostic language

    AJ Saah and DR Hoover. Sensitivity and specificity revisited: significance of the terms in analytic and diagnostic language. In Annales de Dermatologie et de Venereologie, volume 125, pages 291–294, 1998

  52. [52]

    Comparison of the predicted and observed secondary structure of t4 phage lysozyme

    Brian W Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2):442–451, 1975

  53. [53]

    Interrater reliability: the kappa statistic

    Mary L McHugh. Interrater reliability: the kappa statistic. Biochemia medica, 22(3):276–282, 2012

  54. [54]

    Llm-topla: Efficient llm ensemble by maximising diversity

    Selim Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, and Ling Liu. Llm-topla: Efficient llm ensemble by maximising diversity. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 11951–11966, 2024. 13