MAC: Masked Agent Collaboration Boosts Large Language Model Medical Decision-Making

Liuxin Bao; Yixuan Yuan; Zhihao Peng

arxiv: 2507.21159 · v3 · submitted 2025-07-25 · 💻 cs.AI · cs.LG· cs.MA

MAC: Masked Agent Collaboration Boosts Large Language Model Medical Decision-Making

Zhihao Peng , Liuxin Bao , Yixuan Yuan This is my paper

Pith reviewed 2026-05-19 03:07 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.MA

keywords Masked Agent CollaborationLarge Language ModelsMedical Decision-MakingMulti-Agent SystemsPareto OptimizationCross-ConsistencyLLM Collaboration

0 comments

The pith

A masked collaboration method for large language model agents improves medical decision-making by removing inconsistent outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the Masked Agent Collaboration framework to fix failures in multi-agent systems for healthcare tasks. It first picks a balanced set of models from a pool using Pareto optimization across size, speed, diversity, and throughput. It then computes cross-consistency from pairwise output similarities and masks the agent with the lowest value to drop likely inconsistent responses. The remaining agents then collaborate via adaptive progressive propagation, where each step feeds aggregated outputs from unmasked agents into the next layer through prompts. A sympathetic reader would care because current multi-agent setups often degrade under rigid patterns or conflicting outputs, and a reliable fix could support safer AI use in medical choices.

Core claim

The MAC framework harnesses Pareto-optimal agent construction and cross-consistency maximization mechanisms to achieve adaptive progressive propagation of collaborative information, boosting the medical decision-making capacity.

What carries the argument

The cross-consistency masking step, which measures pairwise output similarity among agents to identify and exclude the one producing the most semantically inconsistent output before collaboration continues.

If this is right

Pareto-frontier analysis selects agents that balance model capability with practical efficiency factors like inference time and throughput.
Masking the lowest cross-consistency agent removes outputs likely to conflict semantically with the group consensus.
Adaptive progressive propagation lets each agent incorporate aggregated results from unmasked agents in prior layers to refine its own output.
The overall process yields higher-quality final decisions in medical scenarios than static or unmasked collaboration patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same masking logic based on output similarity could apply to multi-agent setups in other domains requiring consistent group reasoning.
Incorporating diversity scores during agent selection may produce more robust outcomes than using multiple copies of the same model.
Further tests could check whether the method holds when the number of agents increases or when applied to specialized medical subfields.
The progressive propagation structure might be extended with iterative feedback to refine consistency across additional layers.

Load-bearing premise

That measuring pairwise output similarity to mask the lowest cross-consistency agent reliably eliminates semantically inconsistent outputs and produces superior final decisions compared to unmasked collaboration.

What would settle it

A head-to-head test on standard medical decision benchmarks where the masked method shows equal or lower accuracy than unmasked multi-agent collaboration or single-agent baselines would disprove the performance boost.

Figures

Figures reproduced from arXiv: 2507.21159 by Liuxin Bao, Yixuan Yuan, Zhihao Peng.

**Figure 2.** Figure 2: For LLMs of equal parameter size, a higher SD value [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the proposed adaptive cluster collaborativeness. We first measure pairwise cross-consistency values between the LLM with the highest SD value and other models. Then, we iteratively mask the LLM showing the lowest pairwise CC value in the current layer and propagate only the outputs from remaining LLMs to the next layer. This adaptive mask mechanism significantly reduces the inconsistency of… view at source ↗

**Figure 4.** Figure 4: Comparisons of the ACC (in percentage), occupied [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Large language models (LLMs) have proven effective in artificial intelligence, where the multi-agent system (MAS) holds considerable promise for healthcare development by achieving the collaboration of LLMs. However, the absence of a systematic pipeline for agent construction and the rigidity of static collaboration patterns render current MAS-based models vulnerable to collaboration failures, resulting in substantial performance degradation in medical decision-making scenarios. To this end, we propose a novel Masked Agent Collaboration (MAC) framework that harnesses Pareto-optimal agent construction and cross-consistency maximization mechanisms to achieve adaptive progressive propagation of collaborative information, boosting the medical decision-making capacity. Specifically, we first conduct a Pareto-frontier factors analysis towards the LLMs pool to consider their key factors, including the model size, inference time, diversity score, and throughput ratio, where we calculate the similarity between pairwise outputs within an LLM to derive its diversity score. Beyond this analysis, we enable the identification of Pareto-optimal models that balance efficiency and capability, which are subsequently selected as collaborative agents to consider the fundamental trade-offs inherent in practical LLM deployment. Afterward, we measure the pairwise similarity between the outputs from collaborative agents to determine their cross-consistency values, subsequently masking out the agent with the lowest cross-consistency value to eliminate the output that is likely semantically inconsistent. Finally, we conduct collaboration of agents by achieving adaptive progressive propagation, where each agent aggregates the outputs of unmasked agents from the previous layer as its input to generate the corresponding output via prompt engineering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAC sketches a Pareto-plus-masking pipeline for LLM agents in medicine but the abstract shows no results or tests to back the performance claims.

read the letter

The main takeaway is a proposed Masked Agent Collaboration setup that picks Pareto-optimal LLMs by balancing size, speed, diversity, and throughput, then masks the agent whose output has the lowest pairwise similarity before feeding the rest into adaptive propagation layers. That combination of selection and dynamic filtering is the concrete new piece, and it directly targets the rigidity problem in current multi-agent medical systems. The write-up does a clear job spelling out the practical trade-offs in agent construction and why static patterns can degrade fast in healthcare settings. The masking step is a simple, implementable rule that tries to drop semantically inconsistent outputs without extra models. The soft spot is the complete absence of any experiments, metrics, baselines, or error breakdowns in the material we have. All the boosting claims sit on the mechanism description alone. The core assumption that lowest cross-consistency reliably removes the bad answer is also thin: agents can share the same pre-training biases and lock onto the same wrong diagnosis, so the mask might drop the only correct but divergent view instead. That matches the stress-test worry about correct outliers getting discarded. This is aimed at people already building or tuning multi-agent LLM pipelines for clinical tasks who need concrete filtering ideas. The design thinking looks honest and grounded in the stated goals, even without data yet. I would send it to a serious referee if the full version contains proper evaluations and comparisons; otherwise it stays too preliminary for a full review.

Referee Report

3 major / 2 minor

Summary. The paper proposes the Masked Agent Collaboration (MAC) framework for multi-agent LLM systems in medical decision-making. It first performs Pareto-frontier analysis on an LLM pool using factors including model size, inference time, diversity score (computed from pairwise output similarity within each LLM), and throughput ratio to select balanced collaborative agents. It then computes cross-consistency via pairwise similarity of agent outputs, masks the lowest-consistency agent to remove likely semantically inconsistent outputs, and performs adaptive progressive propagation in which each agent aggregates outputs from unmasked agents of the prior layer as input for prompt-engineered generation.

Significance. If the central mechanisms are shown to work, the framework offers a systematic approach to dynamic multi-agent collaboration that explicitly trades off efficiency and capability via Pareto selection while attempting to reduce inconsistency through masking and layered propagation. This could be relevant for practical deployment of LLM agents in healthcare. The multi-factor Pareto analysis is a constructive element that acknowledges real-world constraints.

major comments (3)

[Abstract / MAC framework description] Abstract, paragraph on cross-consistency maximization: the assertion that masking the agent with the lowest cross-consistency value 'eliminates the output that is likely semantically inconsistent' is load-bearing for the entire pipeline yet unsupported. No error analysis, ground-truth comparison, or ablation (masked vs. unmasked) is provided to show that low-similarity agents are disproportionately incorrect rather than correct outliers.
[Abstract / Experimental validation] Abstract, final paragraph on adaptive progressive propagation: the performance-boost claim ('boosting the medical decision-making capacity') rests on the pipeline but the manuscript reports no quantitative results, metrics, baselines, medical datasets, or statistical tests. Without these, it is impossible to evaluate whether the masking-plus-propagation steps yield net improvement.
[Abstract / cross-consistency mechanism] Cross-consistency computation: pairwise similarity is derived solely from the outputs of the Pareto-selected agents themselves. This creates a circularity risk in which the masking rule may simply enforce output agreement by construction rather than correctness; no external validation set or expert adjudication is described to break the circle.

minor comments (2)

[Abstract / diversity score and cross-consistency] The concrete similarity function (lexical overlap, embedding cosine, etc.) used for both diversity score and cross-consistency is not specified, hindering reproducibility of the Pareto selection and masking steps.
[Abstract / adaptive progressive propagation] The number of propagation layers, the final aggregation rule, and the prompt templates used in each layer are not detailed, making the adaptive progressive propagation difficult to implement or replicate.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We appreciate the opportunity to clarify the MAC framework and strengthen the empirical support in the manuscript. We address each major comment below.

read point-by-point responses

Referee: [Abstract / MAC framework description] Abstract, paragraph on cross-consistency maximization: the assertion that masking the agent with the lowest cross-consistency value 'eliminates the output that is likely semantically inconsistent' is load-bearing for the entire pipeline yet unsupported. No error analysis, ground-truth comparison, or ablation (masked vs. unmasked) is provided to show that low-similarity agents are disproportionately incorrect rather than correct outliers.

Authors: We agree that the masking claim requires stronger empirical grounding. The manuscript currently motivates the step via the assumption that low cross-consistency signals semantic divergence in medical reasoning. To address this directly, we will add a dedicated ablation study comparing end-to-end performance with and without masking, plus an error analysis that reports accuracy of masked versus retained agents against ground-truth labels on the evaluation datasets. These additions will quantify whether low-consistency outputs are indeed more error-prone. revision: yes
Referee: [Abstract / Experimental validation] Abstract, final paragraph on adaptive progressive propagation: the performance-boost claim ('boosting the medical decision-making capacity') rests on the pipeline but the manuscript reports no quantitative results, metrics, baselines, medical datasets, or statistical tests. Without these, it is impossible to evaluate whether the masking-plus-propagation steps yield net improvement.

Authors: We apologize for any lack of visibility in the abstract. The full manuscript contains quantitative results on medical decision-making tasks, using standard clinical datasets, with metrics including accuracy and F1-score, comparisons against single-LLM and static multi-agent baselines, and statistical significance testing. We will revise the abstract to summarize the key performance gains and statistical outcomes so that the empirical contribution of the adaptive propagation is immediately evident. revision: yes
Referee: [Abstract / cross-consistency mechanism] Cross-consistency computation: pairwise similarity is derived solely from the outputs of the Pareto-selected agents themselves. This creates a circularity risk in which the masking rule may simply enforce output agreement by construction rather than correctness; no external validation set or expert adjudication is described to break the circle.

Authors: We acknowledge the circularity concern. While internal cross-consistency is designed to surface disagreements among capable agents, we will augment the manuscript with an external validation protocol: a held-out subset annotated by medical experts or using established ground-truth diagnoses will be used to measure error rates of masked versus unmasked outputs. This will demonstrate that the masking step improves correctness beyond mere consensus. revision: yes

Circularity Check

0 steps flagged

No significant circularity; MAC is a heuristic framework with independent empirical claims

full rationale

The paper presents a methodological pipeline: Pareto-frontier selection on model size/inference time/diversity (computed from intra-LLM output similarity) and throughput, followed by inter-agent pairwise similarity for cross-consistency masking, then adaptive progressive propagation via prompt engineering. These steps are explicit design choices and do not reduce by construction to the inputs (no equation equates final decision quality to the masking rule itself). No self-citations, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided text. The performance boost is framed as an empirical outcome of the framework rather than a self-referential derivation, rendering the chain self-contained against external medical benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework rests on standard multi-agent LLM assumptions plus the untested premise that Pareto selection plus consistency masking improves outcomes; no new entities are introduced.

free parameters (1)

diversity score threshold
Derived from pairwise output similarity; exact computation and any cutoffs are not specified in the abstract.

axioms (2)

domain assumption Pareto-optimal models balance efficiency and capability for practical deployment
Invoked when selecting agents from the LLM pool based on size, time, diversity, and throughput.
ad hoc to paper Lowest cross-consistency output is the one most likely to be semantically inconsistent
Central to the masking step; no justification or prior evidence cited in abstract.

pith-pipeline@v0.9.0 · 5804 in / 1253 out tokens · 35975 ms · 2026-05-19T03:07:47.240123+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we measure the pairwise similarity between the outputs from collaborative agents to determine their cross-consistency values, subsequently masking out the agent with the lowest cross-consistency value
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Pareto-frontier factors analysis towards the LLMs pool to consider their key factors, including the model size, inference time, diversity score, and throughput ratio

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 9 internal anchors

[1]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Are more llm calls all you need? towards the scaling properties of compound ai systems

Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei A Zaharia, and James Y Zou. Are more llm calls all you need? towards the scaling properties of compound ai systems. Advances in Neural Information Processing Systems, 37:45767–45790, 2024

work page 2024
[4]

Can large language models provide useful feedback on research papers? a large-scale empirical analysis

Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Yi Ding, Xinyu Yang, Kailas V odrahalli, Siyu He, Daniel Scott Smith, Yian Yin, et al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis. NEJM AI, 1(8):AIoa2400196, 2024

work page 2024
[5]

Optimizing generative ai by backpropagating language model feedback

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. Optimizing generative ai by backpropagating language model feedback. Nature, 639(8055):609–616, 2025

work page 2025
[6]

Large language models in medicine.Nature medicine, 29(8):1930–1940, 2023

Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine.Nature medicine, 29(8):1930–1940, 2023

work page 1930
[7]

Knowledge-infused prompting improves clinical text generation with large language models

Ran Xu, Hejie Cui, Yue Yu, Xuan Kan, Wenqi Shi, Yuchen Zhuang, Wei Jin, Joyce Ho, and Carl Yang. Knowledge-infused prompting improves clinical text generation with large language models. In NeurIPS 2023 Workshop on Synthetic Data Generation with Generative AI, 2023

work page 2023
[8]

Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning

Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan Ilgen, Emma Pierson, Pang Wei W Koh, and Yulia Tsvetkov. Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning. Advances in Neural Information Processing Systems , 37:28858–28888, 2024

work page 2024
[9]

Gpt versus resident physicians—a benchmark based on official board scores

Uriel Katz, Eran Cohen, Eliya Shachar, Jonathan Somer, Adam Fink, Eli Morse, Beki Shreiber, and Ido Wolf. Gpt versus resident physicians—a benchmark based on official board scores. NEJM AI, 1(5):AIdbp2300192, 2024

work page 2024
[10]

Map: Evaluation and multi-agent enhancement of large language models for inpatient pathways

Zhen Chen, Zhihao Peng, Xusheng Liang, Cheng Wang, Peigan Liang, Linsheng Zeng, Minjie Ju, and Yixuan Yuan. Map: Evaluation and multi-agent enhancement of large language models for inpatient pathways. arXiv preprint arXiv:2503.13205, 2025

work page arXiv 2025
[11]

Mixture-of-agents enhances large language model capabilities

Junlin Wang, Jue WANG, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[12]

Camel: Communicative agents for" mind" exploration of large language model society

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems, 36:51991–52008, 2023

work page 2023
[13]

Encouraging divergent thinking in large language models through multi- agent debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi- agent debate. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17889–17904, 2024

work page 2024
[14]

Chateval: Towards better LLM-based evaluators through multi-agent debate

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better LLM-based evaluators through multi-agent debate. In The Twelfth International Conference on Learning Representations, 2024. 10

work page 2024
[15]

Exploring collaboration mechanisms for LLM agents: A social psychology view

Jintian Zhang, Xin Xu, Ningyu Zhang, Ruibo Liu, Bryan Hooi, and Shumin Deng. Exploring collaboration mechanisms for LLM agents: A social psychology view. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14544–14607, Bangkok, Thailand, August 2024. Association for Computational Linguistics

work page 2024
[16]

Multi-llm debate: Framework, principals, and interventions

Andrew Estornell and Yang Liu. Multi-llm debate: Framework, principals, and interventions. Advances in Neural Information Processing Systems, 37:28938–28964, 2024

work page 2024
[17]

Don’t hallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration

Shangbin Feng, Weijia Shi, Yike Wang, Wenxuan Ding, Vidhisha Balachandran, and Yulia Tsvetkov. Don’t hallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration. In ACL (1), 2024

work page 2024
[18]

Improv- ing factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. In Forty-first International Conference on Machine Learning, 2023

work page 2023
[19]

Rethinking mixture-of-agents: Is mixing different large language models beneficial? arXiv preprint arXiv:2502.00674, 2025

Wenzhe Li, Yong Lin, Mengzhou Xia, and Chi Jin. Rethinking mixture-of-agents: Is mixing different large language models beneficial? arXiv preprint arXiv:2502.00674, 2025

work page arXiv 2025
[20]

Reducing the global burden of tuberculosis: the contribution of improved diagnostics

Emmett Keeler, Mark D Perkins, Peter Small, Christy Hanson, Steven Reed, Jane Cunningham, Julia E Aledort, Lee Hillborne, Maria E Rafael, Federico Girosi, et al. Reducing the global burden of tuberculosis: the contribution of improved diagnostics. Nature, 444(Suppl 1):49–57, 2006

work page 2006
[21]

Artificial intelligence system reduces false-positive findings in the interpretation of breast ultrasound exams

Yiqiu Shen, Farah E Shamout, Jamie R Oliver, Jan Witowski, Kawshik Kannan, Jungkyu Park, Nan Wu, Connor Huddleston, Stacey Wolfson, Alexandra Millet, et al. Artificial intelligence system reduces false-positive findings in the interpretation of breast ultrasound exams. Nature communications, 12(1):5645, 2021

work page 2021
[22]

Enhancing the reliability and accuracy of ai-enabled diagnosis via complementarity-driven deferral to clinicians

Krishnamurthy Dvijotham, Jim Winkens, Melih Barsbey, Sumedh Ghaisas, Robert Stanforth, Nick Pawlowski, Patricia Strachan, Zahra Ahmed, Shekoofeh Azizi, Yoram Bachrach, et al. Enhancing the reliability and accuracy of ai-enabled diagnosis via complementarity-driven deferral to clinicians. Nature Medicine, 29(7):1814–1820, 2023

work page 2023
[23]

Mdagents: An adaptive collaboration of llms for medical decision-making

Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik Siu Chan, Xuhai Xu, Daniel McDuff, Hyeon- hoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, Hae Park, et al. Mdagents: An adaptive collaboration of llms for medical decision-making. Advances in Neural Information Processing Systems, 37:79410–79452, 2024

work page 2024
[24]

Mitigating overconfidence in large language models: A behavioral lens on confidence estimation and calibration

Bingbing Wen, Chenjun Xu, Robert Wolfe, Lucy Lu Wang, Bill Howe, et al. Mitigating overconfidence in large language models: A behavioral lens on confidence estimation and calibration. In NeurIPS 2024 Workshop on Behavioral Machine Learning, 2024

work page 2024
[25]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[26]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Ad- vances in neural information processing systems, 36:11809–11822, 2023

work page 2023
[28]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690, 2024

work page 2024
[29]

Skeleton-of- thought: Large language models can do parallel decoding

Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, and Yu Wang. Skeleton-of- thought: Large language models can do parallel decoding. Proceedings ENLSP-III, 2023. 11

work page 2023
[30]

A survey on large language model based autonomous agents

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345, 2024

work page 2024
[31]

Rethinking the bounds of llm reasoning: Are multi-agent discussions the key? In 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024, pages 6106–6131

Qineng Wang, Zihao Wang, Ying Su, Hanghang Tong, and Yangqiu Song. Rethinking the bounds of llm reasoning: Are multi-agent discussions the key? In 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024, pages 6106–6131. Association for Computational Linguistics (ACL), 2024

work page 2024
[32]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

Medagents: Large language models as collaborators for zero-shot medical reasoning

Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. arXiv preprint arXiv:2311.10537, 2023

work page arXiv 2023
[34]

InarXiv preprint arXiv:2309.13007

Justin Chih-Yao Chen, Swarnadeep Saha, and Mohit Bansal. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. arXiv preprint arXiv:2309.13007, 2023

work page arXiv 2023
[35]

arXiv preprint arXiv:2305.10142 , year =

Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. Improving language model negotiation with self-play and in-context learning from ai feedback. arXiv preprint arXiv:2305.10142, 2023

work page arXiv 2023
[36]

rapidfuzz/rapidfuzz: Release 3.8.1, April 2024

Max Bachmann. rapidfuzz/rapidfuzz: Release 3.8.1, April 2024

work page 2024
[37]

Binary codes capable of correcting deletions, insertions, and reversals

Vladimir I Levenshtein et al. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710. Soviet Union, 1966

work page 1966
[38]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

work page 2024
[39]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Qwq-32b: Embracing the power of reinforcement learning, March 2025

Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025

work page 2025
[42]

Open thoughts, February 2025

OpenThoughts Team. Open thoughts, February 2025

work page 2025
[43]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Introducing meta llama 3: The most capable openly available llm to date

AI Meta. Introducing meta llama 3: The most capable openly available llm to date. Meta AI, 2(5):6, 2024

work page 2024
[45]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Introducing dbrx: A new state-of-the-art open llm, 2024

The Mosaic Research Team. Introducing dbrx: A new state-of-the-art open llm, 2024

work page 2024
[47]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation,

David MW Powers. Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. arXiv preprint arXiv:2010.16061, 2020

work page arXiv 2010
[50]

Statistical problems in assessing methods of medical diagnosis, with special reference to x-ray techniques

Jacob Yerushalmy. Statistical problems in assessing methods of medical diagnosis, with special reference to x-ray techniques. Public Health Reports (1896-1970), pages 1432–1449, 1947

work page 1970
[51]

Sensitivity and specificity revisited: significance of the terms in analytic and diagnostic language

AJ Saah and DR Hoover. Sensitivity and specificity revisited: significance of the terms in analytic and diagnostic language. In Annales de Dermatologie et de Venereologie, volume 125, pages 291–294, 1998

work page 1998
[52]

Comparison of the predicted and observed secondary structure of t4 phage lysozyme

Brian W Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2):442–451, 1975

work page 1975
[53]

Interrater reliability: the kappa statistic

Mary L McHugh. Interrater reliability: the kappa statistic. Biochemia medica, 22(3):276–282, 2012

work page 2012
[54]

Llm-topla: Efficient llm ensemble by maximising diversity

Selim Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, and Ling Liu. Llm-topla: Efficient llm ensemble by maximising diversity. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 11951–11966, 2024. 13

work page 2024

[1] [1]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[2] [2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Are more llm calls all you need? towards the scaling properties of compound ai systems

Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei A Zaharia, and James Y Zou. Are more llm calls all you need? towards the scaling properties of compound ai systems. Advances in Neural Information Processing Systems, 37:45767–45790, 2024

work page 2024

[4] [4]

Can large language models provide useful feedback on research papers? a large-scale empirical analysis

Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Yi Ding, Xinyu Yang, Kailas V odrahalli, Siyu He, Daniel Scott Smith, Yian Yin, et al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis. NEJM AI, 1(8):AIoa2400196, 2024

work page 2024

[5] [5]

Optimizing generative ai by backpropagating language model feedback

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. Optimizing generative ai by backpropagating language model feedback. Nature, 639(8055):609–616, 2025

work page 2025

[6] [6]

Large language models in medicine.Nature medicine, 29(8):1930–1940, 2023

Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine.Nature medicine, 29(8):1930–1940, 2023

work page 1930

[7] [7]

Knowledge-infused prompting improves clinical text generation with large language models

Ran Xu, Hejie Cui, Yue Yu, Xuan Kan, Wenqi Shi, Yuchen Zhuang, Wei Jin, Joyce Ho, and Carl Yang. Knowledge-infused prompting improves clinical text generation with large language models. In NeurIPS 2023 Workshop on Synthetic Data Generation with Generative AI, 2023

work page 2023

[8] [8]

Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning

Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan Ilgen, Emma Pierson, Pang Wei W Koh, and Yulia Tsvetkov. Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning. Advances in Neural Information Processing Systems , 37:28858–28888, 2024

work page 2024

[9] [9]

Gpt versus resident physicians—a benchmark based on official board scores

Uriel Katz, Eran Cohen, Eliya Shachar, Jonathan Somer, Adam Fink, Eli Morse, Beki Shreiber, and Ido Wolf. Gpt versus resident physicians—a benchmark based on official board scores. NEJM AI, 1(5):AIdbp2300192, 2024

work page 2024

[10] [10]

Map: Evaluation and multi-agent enhancement of large language models for inpatient pathways

Zhen Chen, Zhihao Peng, Xusheng Liang, Cheng Wang, Peigan Liang, Linsheng Zeng, Minjie Ju, and Yixuan Yuan. Map: Evaluation and multi-agent enhancement of large language models for inpatient pathways. arXiv preprint arXiv:2503.13205, 2025

work page arXiv 2025

[11] [11]

Mixture-of-agents enhances large language model capabilities

Junlin Wang, Jue WANG, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025

[12] [12]

Camel: Communicative agents for" mind" exploration of large language model society

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems, 36:51991–52008, 2023

work page 2023

[13] [13]

Encouraging divergent thinking in large language models through multi- agent debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi- agent debate. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17889–17904, 2024

work page 2024

[14] [14]

Chateval: Towards better LLM-based evaluators through multi-agent debate

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better LLM-based evaluators through multi-agent debate. In The Twelfth International Conference on Learning Representations, 2024. 10

work page 2024

[15] [15]

Exploring collaboration mechanisms for LLM agents: A social psychology view

Jintian Zhang, Xin Xu, Ningyu Zhang, Ruibo Liu, Bryan Hooi, and Shumin Deng. Exploring collaboration mechanisms for LLM agents: A social psychology view. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14544–14607, Bangkok, Thailand, August 2024. Association for Computational Linguistics

work page 2024

[16] [16]

Multi-llm debate: Framework, principals, and interventions

Andrew Estornell and Yang Liu. Multi-llm debate: Framework, principals, and interventions. Advances in Neural Information Processing Systems, 37:28938–28964, 2024

work page 2024

[17] [17]

Don’t hallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration

Shangbin Feng, Weijia Shi, Yike Wang, Wenxuan Ding, Vidhisha Balachandran, and Yulia Tsvetkov. Don’t hallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration. In ACL (1), 2024

work page 2024

[18] [18]

Improv- ing factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. In Forty-first International Conference on Machine Learning, 2023

work page 2023

[19] [19]

Rethinking mixture-of-agents: Is mixing different large language models beneficial? arXiv preprint arXiv:2502.00674, 2025

Wenzhe Li, Yong Lin, Mengzhou Xia, and Chi Jin. Rethinking mixture-of-agents: Is mixing different large language models beneficial? arXiv preprint arXiv:2502.00674, 2025

work page arXiv 2025

[20] [20]

Reducing the global burden of tuberculosis: the contribution of improved diagnostics

Emmett Keeler, Mark D Perkins, Peter Small, Christy Hanson, Steven Reed, Jane Cunningham, Julia E Aledort, Lee Hillborne, Maria E Rafael, Federico Girosi, et al. Reducing the global burden of tuberculosis: the contribution of improved diagnostics. Nature, 444(Suppl 1):49–57, 2006

work page 2006

[21] [21]

Artificial intelligence system reduces false-positive findings in the interpretation of breast ultrasound exams

Yiqiu Shen, Farah E Shamout, Jamie R Oliver, Jan Witowski, Kawshik Kannan, Jungkyu Park, Nan Wu, Connor Huddleston, Stacey Wolfson, Alexandra Millet, et al. Artificial intelligence system reduces false-positive findings in the interpretation of breast ultrasound exams. Nature communications, 12(1):5645, 2021

work page 2021

[22] [22]

Enhancing the reliability and accuracy of ai-enabled diagnosis via complementarity-driven deferral to clinicians

Krishnamurthy Dvijotham, Jim Winkens, Melih Barsbey, Sumedh Ghaisas, Robert Stanforth, Nick Pawlowski, Patricia Strachan, Zahra Ahmed, Shekoofeh Azizi, Yoram Bachrach, et al. Enhancing the reliability and accuracy of ai-enabled diagnosis via complementarity-driven deferral to clinicians. Nature Medicine, 29(7):1814–1820, 2023

work page 2023

[23] [23]

Mdagents: An adaptive collaboration of llms for medical decision-making

Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik Siu Chan, Xuhai Xu, Daniel McDuff, Hyeon- hoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, Hae Park, et al. Mdagents: An adaptive collaboration of llms for medical decision-making. Advances in Neural Information Processing Systems, 37:79410–79452, 2024

work page 2024

[24] [24]

Mitigating overconfidence in large language models: A behavioral lens on confidence estimation and calibration

Bingbing Wen, Chenjun Xu, Robert Wolfe, Lucy Lu Wang, Bill Howe, et al. Mitigating overconfidence in large language models: A behavioral lens on confidence estimation and calibration. In NeurIPS 2024 Workshop on Behavioral Machine Learning, 2024

work page 2024

[25] [25]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022

[26] [26]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[27] [27]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Ad- vances in neural information processing systems, 36:11809–11822, 2023

work page 2023

[28] [28]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690, 2024

work page 2024

[29] [29]

Skeleton-of- thought: Large language models can do parallel decoding

Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, and Yu Wang. Skeleton-of- thought: Large language models can do parallel decoding. Proceedings ENLSP-III, 2023. 11

work page 2023

[30] [30]

A survey on large language model based autonomous agents

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345, 2024

work page 2024

[31] [31]

Rethinking the bounds of llm reasoning: Are multi-agent discussions the key? In 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024, pages 6106–6131

Qineng Wang, Zihao Wang, Ying Su, Hanghang Tong, and Yangqiu Song. Rethinking the bounds of llm reasoning: Are multi-agent discussions the key? In 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024, pages 6106–6131. Association for Computational Linguistics (ACL), 2024

work page 2024

[32] [32]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[33] [33]

Medagents: Large language models as collaborators for zero-shot medical reasoning

Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. arXiv preprint arXiv:2311.10537, 2023

work page arXiv 2023

[34] [34]

InarXiv preprint arXiv:2309.13007

Justin Chih-Yao Chen, Swarnadeep Saha, and Mohit Bansal. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. arXiv preprint arXiv:2309.13007, 2023

work page arXiv 2023

[35] [35]

arXiv preprint arXiv:2305.10142 , year =

Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. Improving language model negotiation with self-play and in-context learning from ai feedback. arXiv preprint arXiv:2305.10142, 2023

work page arXiv 2023

[36] [36]

rapidfuzz/rapidfuzz: Release 3.8.1, April 2024

Max Bachmann. rapidfuzz/rapidfuzz: Release 3.8.1, April 2024

work page 2024

[37] [37]

Binary codes capable of correcting deletions, insertions, and reversals

Vladimir I Levenshtein et al. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710. Soviet Union, 1966

work page 1966

[38] [38]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

work page 2024

[39] [39]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Qwq-32b: Embracing the power of reinforcement learning, March 2025

Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025

work page 2025

[42] [42]

Open thoughts, February 2025

OpenThoughts Team. Open thoughts, February 2025

work page 2025

[43] [43]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Introducing meta llama 3: The most capable openly available llm to date

AI Meta. Introducing meta llama 3: The most capable openly available llm to date. Meta AI, 2(5):6, 2024

work page 2024

[45] [45]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [46]

Introducing dbrx: A new state-of-the-art open llm, 2024

The Mosaic Research Team. Introducing dbrx: A new state-of-the-art open llm, 2024

work page 2024

[47] [47]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation,

David MW Powers. Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. arXiv preprint arXiv:2010.16061, 2020

work page arXiv 2010

[50] [50]

Statistical problems in assessing methods of medical diagnosis, with special reference to x-ray techniques

Jacob Yerushalmy. Statistical problems in assessing methods of medical diagnosis, with special reference to x-ray techniques. Public Health Reports (1896-1970), pages 1432–1449, 1947

work page 1970

[51] [51]

Sensitivity and specificity revisited: significance of the terms in analytic and diagnostic language

AJ Saah and DR Hoover. Sensitivity and specificity revisited: significance of the terms in analytic and diagnostic language. In Annales de Dermatologie et de Venereologie, volume 125, pages 291–294, 1998

work page 1998

[52] [52]

Comparison of the predicted and observed secondary structure of t4 phage lysozyme

Brian W Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2):442–451, 1975

work page 1975

[53] [53]

Interrater reliability: the kappa statistic

Mary L McHugh. Interrater reliability: the kappa statistic. Biochemia medica, 22(3):276–282, 2012

work page 2012

[54] [54]

Llm-topla: Efficient llm ensemble by maximising diversity

Selim Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, and Ling Liu. Llm-topla: Efficient llm ensemble by maximising diversity. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 11951–11966, 2024. 13

work page 2024