MAC: Masked Agent Collaboration Boosts Large Language Model Medical Decision-Making
Pith reviewed 2026-05-19 03:07 UTC · model grok-4.3
The pith
A masked collaboration method for large language model agents improves medical decision-making by removing inconsistent outputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The MAC framework harnesses Pareto-optimal agent construction and cross-consistency maximization mechanisms to achieve adaptive progressive propagation of collaborative information, boosting the medical decision-making capacity.
What carries the argument
The cross-consistency masking step, which measures pairwise output similarity among agents to identify and exclude the one producing the most semantically inconsistent output before collaboration continues.
If this is right
- Pareto-frontier analysis selects agents that balance model capability with practical efficiency factors like inference time and throughput.
- Masking the lowest cross-consistency agent removes outputs likely to conflict semantically with the group consensus.
- Adaptive progressive propagation lets each agent incorporate aggregated results from unmasked agents in prior layers to refine its own output.
- The overall process yields higher-quality final decisions in medical scenarios than static or unmasked collaboration patterns.
Where Pith is reading between the lines
- The same masking logic based on output similarity could apply to multi-agent setups in other domains requiring consistent group reasoning.
- Incorporating diversity scores during agent selection may produce more robust outcomes than using multiple copies of the same model.
- Further tests could check whether the method holds when the number of agents increases or when applied to specialized medical subfields.
- The progressive propagation structure might be extended with iterative feedback to refine consistency across additional layers.
Load-bearing premise
That measuring pairwise output similarity to mask the lowest cross-consistency agent reliably eliminates semantically inconsistent outputs and produces superior final decisions compared to unmasked collaboration.
What would settle it
A head-to-head test on standard medical decision benchmarks where the masked method shows equal or lower accuracy than unmasked multi-agent collaboration or single-agent baselines would disprove the performance boost.
Figures
read the original abstract
Large language models (LLMs) have proven effective in artificial intelligence, where the multi-agent system (MAS) holds considerable promise for healthcare development by achieving the collaboration of LLMs. However, the absence of a systematic pipeline for agent construction and the rigidity of static collaboration patterns render current MAS-based models vulnerable to collaboration failures, resulting in substantial performance degradation in medical decision-making scenarios. To this end, we propose a novel Masked Agent Collaboration (MAC) framework that harnesses Pareto-optimal agent construction and cross-consistency maximization mechanisms to achieve adaptive progressive propagation of collaborative information, boosting the medical decision-making capacity. Specifically, we first conduct a Pareto-frontier factors analysis towards the LLMs pool to consider their key factors, including the model size, inference time, diversity score, and throughput ratio, where we calculate the similarity between pairwise outputs within an LLM to derive its diversity score. Beyond this analysis, we enable the identification of Pareto-optimal models that balance efficiency and capability, which are subsequently selected as collaborative agents to consider the fundamental trade-offs inherent in practical LLM deployment. Afterward, we measure the pairwise similarity between the outputs from collaborative agents to determine their cross-consistency values, subsequently masking out the agent with the lowest cross-consistency value to eliminate the output that is likely semantically inconsistent. Finally, we conduct collaboration of agents by achieving adaptive progressive propagation, where each agent aggregates the outputs of unmasked agents from the previous layer as its input to generate the corresponding output via prompt engineering.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the Masked Agent Collaboration (MAC) framework for multi-agent LLM systems in medical decision-making. It first performs Pareto-frontier analysis on an LLM pool using factors including model size, inference time, diversity score (computed from pairwise output similarity within each LLM), and throughput ratio to select balanced collaborative agents. It then computes cross-consistency via pairwise similarity of agent outputs, masks the lowest-consistency agent to remove likely semantically inconsistent outputs, and performs adaptive progressive propagation in which each agent aggregates outputs from unmasked agents of the prior layer as input for prompt-engineered generation.
Significance. If the central mechanisms are shown to work, the framework offers a systematic approach to dynamic multi-agent collaboration that explicitly trades off efficiency and capability via Pareto selection while attempting to reduce inconsistency through masking and layered propagation. This could be relevant for practical deployment of LLM agents in healthcare. The multi-factor Pareto analysis is a constructive element that acknowledges real-world constraints.
major comments (3)
- [Abstract / MAC framework description] Abstract, paragraph on cross-consistency maximization: the assertion that masking the agent with the lowest cross-consistency value 'eliminates the output that is likely semantically inconsistent' is load-bearing for the entire pipeline yet unsupported. No error analysis, ground-truth comparison, or ablation (masked vs. unmasked) is provided to show that low-similarity agents are disproportionately incorrect rather than correct outliers.
- [Abstract / Experimental validation] Abstract, final paragraph on adaptive progressive propagation: the performance-boost claim ('boosting the medical decision-making capacity') rests on the pipeline but the manuscript reports no quantitative results, metrics, baselines, medical datasets, or statistical tests. Without these, it is impossible to evaluate whether the masking-plus-propagation steps yield net improvement.
- [Abstract / cross-consistency mechanism] Cross-consistency computation: pairwise similarity is derived solely from the outputs of the Pareto-selected agents themselves. This creates a circularity risk in which the masking rule may simply enforce output agreement by construction rather than correctness; no external validation set or expert adjudication is described to break the circle.
minor comments (2)
- [Abstract / diversity score and cross-consistency] The concrete similarity function (lexical overlap, embedding cosine, etc.) used for both diversity score and cross-consistency is not specified, hindering reproducibility of the Pareto selection and masking steps.
- [Abstract / adaptive progressive propagation] The number of propagation layers, the final aggregation rule, and the prompt templates used in each layer are not detailed, making the adaptive progressive propagation difficult to implement or replicate.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We appreciate the opportunity to clarify the MAC framework and strengthen the empirical support in the manuscript. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract / MAC framework description] Abstract, paragraph on cross-consistency maximization: the assertion that masking the agent with the lowest cross-consistency value 'eliminates the output that is likely semantically inconsistent' is load-bearing for the entire pipeline yet unsupported. No error analysis, ground-truth comparison, or ablation (masked vs. unmasked) is provided to show that low-similarity agents are disproportionately incorrect rather than correct outliers.
Authors: We agree that the masking claim requires stronger empirical grounding. The manuscript currently motivates the step via the assumption that low cross-consistency signals semantic divergence in medical reasoning. To address this directly, we will add a dedicated ablation study comparing end-to-end performance with and without masking, plus an error analysis that reports accuracy of masked versus retained agents against ground-truth labels on the evaluation datasets. These additions will quantify whether low-consistency outputs are indeed more error-prone. revision: yes
-
Referee: [Abstract / Experimental validation] Abstract, final paragraph on adaptive progressive propagation: the performance-boost claim ('boosting the medical decision-making capacity') rests on the pipeline but the manuscript reports no quantitative results, metrics, baselines, medical datasets, or statistical tests. Without these, it is impossible to evaluate whether the masking-plus-propagation steps yield net improvement.
Authors: We apologize for any lack of visibility in the abstract. The full manuscript contains quantitative results on medical decision-making tasks, using standard clinical datasets, with metrics including accuracy and F1-score, comparisons against single-LLM and static multi-agent baselines, and statistical significance testing. We will revise the abstract to summarize the key performance gains and statistical outcomes so that the empirical contribution of the adaptive propagation is immediately evident. revision: yes
-
Referee: [Abstract / cross-consistency mechanism] Cross-consistency computation: pairwise similarity is derived solely from the outputs of the Pareto-selected agents themselves. This creates a circularity risk in which the masking rule may simply enforce output agreement by construction rather than correctness; no external validation set or expert adjudication is described to break the circle.
Authors: We acknowledge the circularity concern. While internal cross-consistency is designed to surface disagreements among capable agents, we will augment the manuscript with an external validation protocol: a held-out subset annotated by medical experts or using established ground-truth diagnoses will be used to measure error rates of masked versus unmasked outputs. This will demonstrate that the masking step improves correctness beyond mere consensus. revision: yes
Circularity Check
No significant circularity; MAC is a heuristic framework with independent empirical claims
full rationale
The paper presents a methodological pipeline: Pareto-frontier selection on model size/inference time/diversity (computed from intra-LLM output similarity) and throughput, followed by inter-agent pairwise similarity for cross-consistency masking, then adaptive progressive propagation via prompt engineering. These steps are explicit design choices and do not reduce by construction to the inputs (no equation equates final decision quality to the masking rule itself). No self-citations, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided text. The performance boost is framed as an empirical outcome of the framework rather than a self-referential derivation, rendering the chain self-contained against external medical benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- diversity score threshold
axioms (2)
- domain assumption Pareto-optimal models balance efficiency and capability for practical deployment
- ad hoc to paper Lowest cross-consistency output is the one most likely to be semantically inconsistent
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we measure the pairwise similarity between the outputs from collaborative agents to determine their cross-consistency values, subsequently masking out the agent with the lowest cross-consistency value
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Pareto-frontier factors analysis towards the LLMs pool to consider their key factors, including the model size, inference time, diversity score, and throughput ratio
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Are more llm calls all you need? towards the scaling properties of compound ai systems
Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei A Zaharia, and James Y Zou. Are more llm calls all you need? towards the scaling properties of compound ai systems. Advances in Neural Information Processing Systems, 37:45767–45790, 2024
work page 2024
-
[4]
Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Yi Ding, Xinyu Yang, Kailas V odrahalli, Siyu He, Daniel Scott Smith, Yian Yin, et al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis. NEJM AI, 1(8):AIoa2400196, 2024
work page 2024
-
[5]
Optimizing generative ai by backpropagating language model feedback
Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. Optimizing generative ai by backpropagating language model feedback. Nature, 639(8055):609–616, 2025
work page 2025
-
[6]
Large language models in medicine.Nature medicine, 29(8):1930–1940, 2023
Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine.Nature medicine, 29(8):1930–1940, 2023
work page 1930
-
[7]
Knowledge-infused prompting improves clinical text generation with large language models
Ran Xu, Hejie Cui, Yue Yu, Xuan Kan, Wenqi Shi, Yuchen Zhuang, Wei Jin, Joyce Ho, and Carl Yang. Knowledge-infused prompting improves clinical text generation with large language models. In NeurIPS 2023 Workshop on Synthetic Data Generation with Generative AI, 2023
work page 2023
-
[8]
Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning
Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan Ilgen, Emma Pierson, Pang Wei W Koh, and Yulia Tsvetkov. Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning. Advances in Neural Information Processing Systems , 37:28858–28888, 2024
work page 2024
-
[9]
Gpt versus resident physicians—a benchmark based on official board scores
Uriel Katz, Eran Cohen, Eliya Shachar, Jonathan Somer, Adam Fink, Eli Morse, Beki Shreiber, and Ido Wolf. Gpt versus resident physicians—a benchmark based on official board scores. NEJM AI, 1(5):AIdbp2300192, 2024
work page 2024
-
[10]
Map: Evaluation and multi-agent enhancement of large language models for inpatient pathways
Zhen Chen, Zhihao Peng, Xusheng Liang, Cheng Wang, Peigan Liang, Linsheng Zeng, Minjie Ju, and Yixuan Yuan. Map: Evaluation and multi-agent enhancement of large language models for inpatient pathways. arXiv preprint arXiv:2503.13205, 2025
-
[11]
Mixture-of-agents enhances large language model capabilities
Junlin Wang, Jue WANG, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[12]
Camel: Communicative agents for" mind" exploration of large language model society
Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems, 36:51991–52008, 2023
work page 2023
-
[13]
Encouraging divergent thinking in large language models through multi- agent debate
Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi- agent debate. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17889–17904, 2024
work page 2024
-
[14]
Chateval: Towards better LLM-based evaluators through multi-agent debate
Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better LLM-based evaluators through multi-agent debate. In The Twelfth International Conference on Learning Representations, 2024. 10
work page 2024
-
[15]
Exploring collaboration mechanisms for LLM agents: A social psychology view
Jintian Zhang, Xin Xu, Ningyu Zhang, Ruibo Liu, Bryan Hooi, and Shumin Deng. Exploring collaboration mechanisms for LLM agents: A social psychology view. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14544–14607, Bangkok, Thailand, August 2024. Association for Computational Linguistics
work page 2024
-
[16]
Multi-llm debate: Framework, principals, and interventions
Andrew Estornell and Yang Liu. Multi-llm debate: Framework, principals, and interventions. Advances in Neural Information Processing Systems, 37:28938–28964, 2024
work page 2024
-
[17]
Don’t hallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration
Shangbin Feng, Weijia Shi, Yike Wang, Wenxuan Ding, Vidhisha Balachandran, and Yulia Tsvetkov. Don’t hallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration. In ACL (1), 2024
work page 2024
-
[18]
Improv- ing factuality and reasoning in language models through multiagent debate
Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. In Forty-first International Conference on Machine Learning, 2023
work page 2023
-
[19]
Wenzhe Li, Yong Lin, Mengzhou Xia, and Chi Jin. Rethinking mixture-of-agents: Is mixing different large language models beneficial? arXiv preprint arXiv:2502.00674, 2025
-
[20]
Reducing the global burden of tuberculosis: the contribution of improved diagnostics
Emmett Keeler, Mark D Perkins, Peter Small, Christy Hanson, Steven Reed, Jane Cunningham, Julia E Aledort, Lee Hillborne, Maria E Rafael, Federico Girosi, et al. Reducing the global burden of tuberculosis: the contribution of improved diagnostics. Nature, 444(Suppl 1):49–57, 2006
work page 2006
-
[21]
Yiqiu Shen, Farah E Shamout, Jamie R Oliver, Jan Witowski, Kawshik Kannan, Jungkyu Park, Nan Wu, Connor Huddleston, Stacey Wolfson, Alexandra Millet, et al. Artificial intelligence system reduces false-positive findings in the interpretation of breast ultrasound exams. Nature communications, 12(1):5645, 2021
work page 2021
-
[22]
Krishnamurthy Dvijotham, Jim Winkens, Melih Barsbey, Sumedh Ghaisas, Robert Stanforth, Nick Pawlowski, Patricia Strachan, Zahra Ahmed, Shekoofeh Azizi, Yoram Bachrach, et al. Enhancing the reliability and accuracy of ai-enabled diagnosis via complementarity-driven deferral to clinicians. Nature Medicine, 29(7):1814–1820, 2023
work page 2023
-
[23]
Mdagents: An adaptive collaboration of llms for medical decision-making
Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik Siu Chan, Xuhai Xu, Daniel McDuff, Hyeon- hoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, Hae Park, et al. Mdagents: An adaptive collaboration of llms for medical decision-making. Advances in Neural Information Processing Systems, 37:79410–79452, 2024
work page 2024
-
[24]
Bingbing Wen, Chenjun Xu, Robert Wolfe, Lucy Lu Wang, Bill Howe, et al. Mitigating overconfidence in large language models: A behavioral lens on confidence estimation and calibration. In NeurIPS 2024 Workshop on Behavioral Machine Learning, 2024
work page 2024
-
[25]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[26]
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[27]
Tree of thoughts: Deliberate problem solving with large language models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Ad- vances in neural information processing systems, 36:11809–11822, 2023
work page 2023
-
[28]
Graph of thoughts: Solving elaborate problems with large language models
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690, 2024
work page 2024
-
[29]
Skeleton-of- thought: Large language models can do parallel decoding
Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, and Yu Wang. Skeleton-of- thought: Large language models can do parallel decoding. Proceedings ENLSP-III, 2023. 11
work page 2023
-
[30]
A survey on large language model based autonomous agents
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345, 2024
work page 2024
-
[31]
Qineng Wang, Zihao Wang, Ying Su, Hanghang Tong, and Yangqiu Song. Rethinking the bounds of llm reasoning: Are multi-agent discussions the key? In 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024, pages 6106–6131. Association for Computational Linguistics (ACL), 2024
work page 2024
-
[32]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[33]
Medagents: Large language models as collaborators for zero-shot medical reasoning
Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. arXiv preprint arXiv:2311.10537, 2023
-
[34]
InarXiv preprint arXiv:2309.13007
Justin Chih-Yao Chen, Swarnadeep Saha, and Mohit Bansal. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. arXiv preprint arXiv:2309.13007, 2023
-
[35]
arXiv preprint arXiv:2305.10142 , year =
Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. Improving language model negotiation with self-play and in-context learning from ai feedback. arXiv preprint arXiv:2305.10142, 2023
-
[36]
rapidfuzz/rapidfuzz: Release 3.8.1, April 2024
Max Bachmann. rapidfuzz/rapidfuzz: Release 3.8.1, April 2024
work page 2024
-
[37]
Binary codes capable of correcting deletions, insertions, and reversals
Vladimir I Levenshtein et al. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710. Soviet Union, 1966
work page 1966
-
[38]
Mmlu-pro: A more robust and challenging multi-task language understanding benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024
work page 2024
-
[39]
Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Qwq-32b: Embracing the power of reinforcement learning, March 2025
Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025
work page 2025
- [42]
-
[43]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Introducing meta llama 3: The most capable openly available llm to date
AI Meta. Introducing meta llama 3: The most capable openly available llm to date. Meta AI, 2(5):6, 2024
work page 2024
-
[45]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Introducing dbrx: A new state-of-the-art open llm, 2024
The Mosaic Research Team. Introducing dbrx: A new state-of-the-art open llm, 2024
work page 2024
-
[47]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024. 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
WizardLM: Empowering large pre-trained language models to follow complex instructions
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation,
David MW Powers. Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. arXiv preprint arXiv:2010.16061, 2020
-
[50]
Jacob Yerushalmy. Statistical problems in assessing methods of medical diagnosis, with special reference to x-ray techniques. Public Health Reports (1896-1970), pages 1432–1449, 1947
work page 1970
-
[51]
Sensitivity and specificity revisited: significance of the terms in analytic and diagnostic language
AJ Saah and DR Hoover. Sensitivity and specificity revisited: significance of the terms in analytic and diagnostic language. In Annales de Dermatologie et de Venereologie, volume 125, pages 291–294, 1998
work page 1998
-
[52]
Comparison of the predicted and observed secondary structure of t4 phage lysozyme
Brian W Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2):442–451, 1975
work page 1975
-
[53]
Interrater reliability: the kappa statistic
Mary L McHugh. Interrater reliability: the kappa statistic. Biochemia medica, 22(3):276–282, 2012
work page 2012
-
[54]
Llm-topla: Efficient llm ensemble by maximising diversity
Selim Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, and Ling Liu. Llm-topla: Efficient llm ensemble by maximising diversity. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 11951–11966, 2024. 13
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.