arxiv: 2605.06177 · v1 · submitted 2026-05-07 · 💻 cs.AI

Recognition: unknown

BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

Jinge Wu , Hongjian Zhou , Mingde Zeng , Jiayuan Zhu , Junde Wu , Jiazhen Pan , Sean Wu , Honghan Wu

show 2 more authors

Fenglin Liu David A. Clifton

Authors on Pith no claims yet

Pith reviewed 2026-05-08 10:19 UTC · model grok-4.3

classification 💻 cs.AI

keywords biomedical deep research agentsevaluation toolkitagent harnessescontext managementbiomedical benchmarksopen-source toolkitfoundation model integrationperformance evaluation

0 comments

The pith

BioMedArena decouples biomedical agent evaluation into six layers so new models integrate in minutes and deliver consistent performance gains across shared benchmarks and tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that inconsistent results across biomedical agent papers stem from differing harnesses, tool registries, and evaluation setups that impose a heavy per-paper engineering cost. It supplies an open toolkit that splits evaluation into six independent layers: benchmark loading, tool exposure, tool selection, execution mode, context management, and scoring. Users add new foundation models, benchmarks, or tools by writing short provider adapters. The release ships 147 benchmarks and 75 tools together with six harnesses that apply distinct context-management strategies. These harnesses, when paired with twelve backbone models, produce state-of-the-art results on eight representative benchmarks and raise average accuracy by 15.03 percentage points over prior best scores.

Core claim

BioMedArena decouples biomedical agent evaluation into six layers and supplies standardized access to 147 benchmarks and 75 tools across nine families. By registering short provider adapters, users can add models, benchmarks, or tools without extensive custom code. The release includes six harnesses implementing distinct context-management strategies; when applied to twelve backbone models these yield competitive research performance and state-of-the-art scores on eight representative benchmarks, raising average accuracy by 15.03 percentage points relative to earlier reported results.

What carries the argument

The six-layer decoupling of agent evaluation, consisting of benchmark loading, tool exposure, tool selection, execution mode, context management, and scoring.

If this is right

Adding a new foundation model requires only a few-line adapter instead of weeks of engineering.
Different models can be evaluated on identical benchmarks and tools for direct head-to-head comparison.
The six context-management strategies produce measurable gains in agent performance on biomedical tasks.
Researchers obtain per-task traces and configurations that support reproducible experiments.
New benchmarks or tools can be registered while preserving the same evaluation surface for all prior models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use of the toolkit could reduce contradictory performance claims by enforcing a common evaluation surface across papers.
The decoupling pattern could be applied to agent evaluation in non-biomedical scientific domains if similar registries of tasks and tools are created.
Testing whether the reported performance lift remains when the harnesses are run on tasks outside the original eight benchmarks would clarify the generality of the gains.
Releasing the full set of traces allows direct inspection of where each context strategy succeeds or fails on individual biomedical questions.

Load-bearing premise

The chosen benchmarks, tools, and harnesses produce fair comparisons that are free of hidden biases from selection or implementation details.

What would settle it

Independent re-implementation of the six harnesses on the same eight benchmarks yields average scores no higher than the previous state-of-the-art.

Figures

Figures reproduced from arXiv: 2605.06177 by David A. Clifton, Fenglin Liu, Honghan Wu, Hongjian Zhou, Jiayuan Zhu, Jiazhen Pan, Jinge Wu, Junde Wu, Mingde Zeng, Sean Wu.

**Figure 1.** Figure 1: Performance gains under BioMedArena across 8 representative biomedical benchmarks, view at source ↗

**Figure 2.** Figure 2: Overview of the BioMedArena toolkit: a unified biomedical benchmark interface, a tool view at source ↗

**Figure 3.** Figure 3: Dataflow of a biomedical deep research agent in BioMedArena. A natural-language view at source ↗

**Figure 4.** Figure 4: MUTUAL-EVOLVE workflow. For each question, N parallel solvers at distinct temperatures first explore privately, then share findings through a Global Workspace at iteration T. The workspace has four typed banks (guide, tool, skill, error); solvers read it every K iterations and may terminate at different end iterations ei . Once all solvers finish, each performs a text-only final confirmation over the full … view at source ↗

**Figure 5.** Figure 5: unpacks the LAB-Bench 2 Overall number into its 7 text-only subsets, plotting per-subset view at source ↗

**Figure 6.** Figure 6: LAB-Bench 2 per-subset accuracy heatmap (%) across 6 backbones (2 Gemini, 4 Claude) view at source ↗

**Figure 7.** Figure 7: Tool registry organized by biomedical skill family. The 33 category tags group into 9 view at source ↗

read the original abstract

Building a deep research agent today is an exercise in glue code: the same backbone evaluated on the same benchmark can report different accuracies in different papers because harness and tool registry all differ, and integrating a new foundation model into a comparable evaluation surface costs weeks of model-specific engineering. We call this the per-paper engineering tax and release BioMedArena, an open-source toolkit that not only alleviates it but also provides an arena for fair comparison of different foundation models when evaluating them as deep-research agents. BioMedArena decouples six layers of biomedical agent evaluation -- benchmark loading, tool exposure, tool selection, execution mode, context management, and scoring -- and exposes 147 biomedical benchmarks and 75 biomedical tools across 9 functional families. Adding a new model, benchmark, or tool reduces to registering a few-line provider adapter. We further provide 6 agent harnesses with 6 context-management strategies, which provide 12 backbones with competitive research capabilities and significantly improved performance, achieving state-of-the-art (SOTA) results on 8 representative biomedical benchmarks, with an average lift of +15.03 percentage points over prior SOTA. The toolkit, configurations, and per-task traces are available at https://github.com/AI-in-Health/BioMedArena

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BioMedArena is a useful open toolkit that standardizes biomedical agent setups with its six-layer split and bundled resources, but the +15 pp SOTA lift needs explicit baseline re-runs to hold up.

read the letter

The core takeaway is that this paper ships a practical engineering resource for biomedical agent work rather than a new theoretical result. The six-layer decoupling of benchmark loading, tool exposure, selection, execution mode, context management, and scoring, plus the collection of 147 benchmarks and 75 tools across nine families, directly targets the per-paper engineering tax described in the abstract. Registering adapters for new models or tasks is presented as a few lines of code, and the six harnesses with different context strategies are made available for 12 backbones. That combination of scale and modularity looks new as a unified biomedical-specific release, even if broader agent toolkits exist elsewhere. The GitHub availability and per-task traces are concrete positives that could let others reproduce or extend the setup without starting from scratch each time. Credit is due for addressing a real friction point in the subfield where evaluations have been hard to compare across papers. The soft spot is the performance section. The abstract claims state-of-the-art results on eight benchmarks with an average 15.03 percentage point lift, but it does not show whether the prior SOTA numbers were re-implemented inside the new harnesses using the same tool registry and context strategies or simply copied from heterogeneous earlier papers. Without that side-by-side check, the lift could reflect cleaner standardization more than superior agent design. The stress-test note on baseline equivalence is on point here, and any review would need to see the exact matching details and any statistical tests. This work is aimed at researchers building or comparing foundation models as biomedical research agents who want a shared evaluation surface. A reader in AI-for-health who spends time on agent harnesses would get immediate value from the code and the benchmark aggregation. It deserves serious referee time because the artifact is substantial and the problem it tackles is practical, even if the quantitative claims require tightening. I would send it to peer review with a request to clarify the baseline re-implementation.

Referee Report

3 major / 3 minor

Summary. The paper presents BioMedArena, an open-source toolkit that decouples biomedical deep research agent evaluation into six layers (benchmark loading, tool exposure, tool selection, execution mode, context management, and scoring). It registers 147 benchmarks and 75 tools across 9 families, supplies 6 agent harnesses with 6 context-management strategies (yielding 12 backbones), and reports that these achieve state-of-the-art results on 8 representative biomedical benchmarks with an average lift of +15.03 percentage points over prior SOTA. The toolkit reduces model integration to registering short provider adapters and releases configurations plus per-task traces.

Significance. If the performance claims hold under the standardized harnesses, the work would meaningfully lower the per-paper engineering tax in biomedical agent research and enable more reproducible head-to-head comparisons across foundation models. The open-source release with explicit adapters, benchmark/tool registries, and traces is a concrete community resource that directly addresses the reproducibility issues the authors identify.

major comments (3)

[Abstract] Abstract: The headline SOTA claim (average +15.03 pp lift on 8 benchmarks) is load-bearing for the paper's contribution yet rests on an unverified assumption of baseline equivalence. The manuscript motivates the toolkit by noting that 'the same backbone evaluated on the same benchmark can report different accuracies in different papers because harness and tool registry all differ,' but does not state whether the cited prior SOTA numbers were re-executed inside BioMedArena's six-layer framework (with identical tool exposure, context strategies, and scoring) or simply copied from the original heterogeneous papers. Without this, the lift cannot be attributed to the new agent designs rather than standardization alone.
[Evaluation] Evaluation section (or §4/§5): No information is supplied on baseline selection criteria, statistical testing (e.g., paired t-tests or bootstrap confidence intervals on the per-benchmark lifts), or variance across the 12 backbones. The claim of 'significantly improved performance' therefore lacks the quantitative support needed to substantiate superiority over prior work.
[Agent harnesses] Agent harnesses and context-management strategies: The mapping from '6 agent harnesses with 6 context-management strategies' to '12 backbones' is stated without an accompanying ablation or per-strategy breakdown. It is therefore unclear which of the six layers (particularly context management) drives the reported gains and whether the improvements generalize across all 147 registered benchmarks or only the selected 8.

minor comments (3)

A table or appendix listing all 147 benchmarks and 75 tools (or at least the 9 functional families with representative examples) would improve usability and allow readers to assess coverage.
The GitHub repository link should be accompanied by a brief description of the exact commit or release tag used for the reported experiments to support reproducibility.
[Abstract] Clarify the exact arithmetic behind '12 backbones' (6 harnesses × 6 strategies?) and whether every combination was evaluated on every benchmark.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments help clarify how to better substantiate our performance claims and the toolkit's contributions. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: The headline SOTA claim (average +15.03 pp lift on 8 benchmarks) is load-bearing for the paper's contribution yet rests on an unverified assumption of baseline equivalence. The manuscript motivates the toolkit by noting that 'the same backbone evaluated on the same benchmark can report different accuracies in different papers because harness and tool registry all differ,' but does not state whether the cited prior SOTA numbers were re-executed inside BioMedArena's six-layer framework (with identical tool exposure, context strategies, and scoring) or simply copied from the original heterogeneous papers. Without this, the lift cannot be attributed to the new agent designs rather than standardization alone.

Authors: We appreciate the referee highlighting this ambiguity. The prior SOTA numbers were sourced from the original publications rather than re-executed inside BioMedArena, which is standard practice but means the reported lift incorporates both our harness improvements and standardization effects. We will revise the abstract and evaluation section to explicitly state that comparisons use originally reported figures and to emphasize that the toolkit's primary value is enabling consistent, reproducible evaluations going forward. revision: yes
Referee: [Evaluation] Evaluation section (or §4/§5): No information is supplied on baseline selection criteria, statistical testing (e.g., paired t-tests or bootstrap confidence intervals on the per-benchmark lifts), or variance across the 12 backbones. The claim of 'significantly improved performance' therefore lacks the quantitative support needed to substantiate superiority over prior work.

Authors: We agree that these details are needed to support the claims. In the revised manuscript we will add: (i) explicit criteria for selecting the 8 benchmarks as representative of the 147 (covering diverse biomedical domains and task types), (ii) statistical tests including paired t-tests or bootstrap confidence intervals on the per-benchmark lifts, and (iii) variance information or per-backbone results across the 12 backbones. These additions will provide the required quantitative grounding. revision: yes
Referee: [Agent harnesses] Agent harnesses and context-management strategies: The mapping from '6 agent harnesses with 6 context-management strategies' to '12 backbones' is stated without an accompanying ablation or per-strategy breakdown. It is therefore unclear which of the six layers (particularly context management) drives the reported gains and whether the improvements generalize across all 147 registered benchmarks or only the selected 8.

Authors: The 12 backbones result from a full combination of the 6 harnesses and 6 context-management strategies. We will add an ablation study breaking down results by harness and context strategy to identify driving components. The 8 benchmarks were selected as representative of the full set of 147; we will clarify this selection criterion and note that the toolkit supports evaluation on all registered benchmarks, with the reported results serving as a demonstration on key tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: toolkit release with empirical SOTA claims, no self-referential derivations

full rationale

The paper describes an open-source toolkit that decouples six evaluation layers and provides harnesses for biomedical agents, reporting an average +15.03 pp lift over prior SOTA on 8 benchmarks. No mathematical derivations, equations, fitted parameters, or predictions that reduce to inputs by construction are present. The SOTA claims are empirical outcomes from the provided implementations rather than self-definitional or self-citation load-bearing steps. The work is self-contained as a software and benchmarking contribution evaluated against external benchmarks, with no evidence of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As an engineering release of an evaluation toolkit, the paper introduces no free parameters, mathematical axioms, or new invented entities; it aggregates existing benchmarks and tools under a new organizational structure.

pith-pipeline@v0.9.0 · 5557 in / 1230 out tokens · 23755 ms · 2026-05-08T10:19:15.740706+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 12 canonical work pages · 2 internal anchors

[1]

Introducing Claude Opus 4.5

Anthropic. Introducing Claude Opus 4.5. https://www.anthropic.com/news/ claude-opus-4-5, 2025

2025
[2]

Introducing Claude Sonnet 4.5

Anthropic. Introducing Claude Sonnet 4.5. https://www.anthropic.com/news/ claude-sonnet-4-5, 2025

2025
[3]

Introducing Claude Opus 4.6

Anthropic. Introducing Claude Opus 4.6. https://www.anthropic.com/news/ claude-opus-4-6, 2026

2026
[4]

Claude Sonnet 4.6: Research model card

Anthropic. Claude Sonnet 4.6: Research model card. https://www.anthropic.com/ research/claude-sonnet-4-6, 2026

2026
[5]

Trinity-Large: An open-weight reasoning model from Arcee AI

Arcee AI. Trinity-Large: An open-weight reasoning model from Arcee AI. https: //huggingface.co/arcee-ai, 2025

2025
[6]

HealthBench: Evaluating large language models towards improved human health.https://github.com/openai/healthbench, 2025

Akshay Arora et al. HealthBench: Evaluating large language models towards improved human health.https://github.com/openai/healthbench, 2025. OpenAI

2025
[7]

MedHELM: Holistic evaluation of large language models for medical tasks

Suhana Bedi et al. MedHELM: Holistic evaluation of large language models for medical tasks. arXiv preprint, 2025. Stanford CRFM; also appears in Nature Medicine 2026

2025
[8]

Benchmarking large lan- guage models on answering and explaining challenging medical questions

Hanjie Chen et al. Benchmarking large language models on answering and explaining challeng- ing medical questions.arXiv preprint arXiv:2402.18060, 2024. Medbullets benchmark; op4 = 4-option subset

work page arXiv 2024
[9]

Tenenbaum, and Igor Mordatch

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InInternational Conference on Machine Learning (ICML), 2024

2024
[10]

Openseeker: Democratizing frontier search agents by fully open-sourcing training data.arXiv preprint arXiv:2603.15594, 2026

Yuwen Du, Rui Ye, Shuo Tang, Xinyu Zhu, Yijun Lu, Yuzhu Cai, and Siheng Chen. Openseeker: Democratizing frontier search agents by fully open-sourcing training data.arXiv preprint arXiv:2603.15594, 2026

work page arXiv 2026
[11]

Edison literature high: A PaperQA3-backed deep research agent for biomedical literature

Edison Scientific. Edison literature high: A PaperQA3-backed deep research agent for biomedical literature. https://edisonscientific.com/articles/ edison-literature-agent, 2026. Model release literature-20260216-high, February 2026

2026
[12]

LAB-Bench 2

FutureHouse. LAB-Bench 2. https://huggingface.co/datasets/futurehouse/ labbench2, 2025. Gated dataset; successor to LAB-Bench [21]

2025
[13]

A framework for few-shot language model evaluation

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

2023
[14]

Introducing Gemini 3 Flash

Google. Introducing Gemini 3 Flash. https://blog.google/products/gemini/ gemini-3-flash/, 2026

2026
[15]

Gemini 3.1 Pro

Google DeepMind. Gemini 3.1 Pro. https://blog.google/innovation-and-ai/ models-and-research/gemini-models/gemini-3-1-pro/, 2026

2026
[16]

MetaGPT: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. MetaGPT: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations (ICLR), 2024

2024
[17]

Biomni: A general-purpose biomedical AI agent.bioRxiv preprint, 2025

Kexin Huang et al. Biomni: A general-purpose biomedical AI agent.bioRxiv preprint, 2025. Stanford

2025
[18]

InternLM2-Protein-7B: A protein language model.arXiv preprint arXiv:2406.05540, 2024

InternLM2-Protein Authors. InternLM2-Protein-7B: A protein language model.arXiv preprint arXiv:2406.05540, 2024. Reported state-of-the-art on ProteinLMBench

work page arXiv 2024
[19]

MedAgentBench: A realistic virtual EHR environment to benchmark medical LLM agents.arXiv preprint arXiv:2501.14654, 2024

Yixing Jiang et al. MedAgentBench: A realistic virtual EHR environment to benchmark medical LLM agents.arXiv preprint arXiv:2501.14654, 2024. Stanford ML Group

work page arXiv 2024
[20]

Holistic agent leaderboard: The missing infrastructure for AI agent evaluation.arXiv preprint arXiv:2510.11977, 2025

Sayash Kapoor, Benedikt Stroebl, Peter Kirgis, Nitya Nadgir, Zachary S Siegel, Boyi Wei, Tianci Xue, Ziru Chen, Felix Chen, Saiteja Utpala, et al. Holistic agent leaderboard: The missing infrastructure for ai agent evaluation.arXiv preprint arXiv:2510.11977, 2025

work page arXiv 2025
[21]

Laurent and Joseph D

Jon M. Laurent, Joseph D. Janizek, Michael Ruzo, Michaela M. Hinks, Michael J. Hammerling, Siddharth Narayanan, Manvitha Ponnapati, Andrew D. White, and Samuel G. Rodriques. LAB-Bench: Measuring capabilities of language models for biology research.arXiv preprint arXiv:2407.10362, 2024. FutureHouse

work page arXiv 2024
[22]

Manning, Christopher Ré, Diana Acosta-Navas, Drew A

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu...

2023
[23]

Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993, 2025

MedReason Authors. MedReason: A step-reasoning medical llm.arXiv preprint arXiv:2504.00993, 2025. Reported state-of-the-art on Medbullets (op4)

work page arXiv 2025
[24]

Muse Spark (meta): reported HealthBench Hard score

Meta AI. Muse Spark (meta): reported HealthBench Hard score. https://venturebeat.com/technology/ goodbye-llama-meta-launches-new-proprietary-ai-model-muse-spark-first-since ,
[25]

Third-party report of HealthBench Hard score for Meta’s Muse Spark model
[26]

MiroThinker-1.7 & H1: Towards heavy-duty research agents via ver- ification

MiroThinker Authors. MiroThinker-1.7 & H1: Towards heavy-duty research agents via ver- ification. https://arxiv.org/pdf/2603.15726, 2026. Reported state-of-the-art on the SuperChem text-only subset

work page arXiv 2026
[27]

BixBench: A comprehensive benchmark for LLM-based agents in computational biology.arXiv preprint, 2025

Ludovico Mitchener et al. BixBench: A comprehensive benchmark for LLM-based agents in computational biology.arXiv preprint, 2025. FutureHouse

2025
[28]

CrewAI: Framework for orchestrating role- playing, autonomous AI agents

João Moura Moura and CrewAI contributors. CrewAI: Framework for orchestrating role- playing, autonomous AI agents. https://github.com/joaomdmoura/crewai, 2023. Soft- ware framework

2023
[29]

NVIDIA Nemotron-3 Super 120B-A12B: A mixture-of-experts reasoning model

NVIDIA. NVIDIA Nemotron-3 Super 120B-A12B: A mixture-of-experts reasoning model. https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 , 2025. 11

2025
[30]

GPT-5.4 model documentation

OpenAI. GPT-5.4 model documentation. https://developers.openai.com/api/docs/ models/gpt-5.4, 2025

2025
[31]

GPT-5.5 system card

OpenAI. GPT-5.5 system card. https://openai.com/index/gpt-5-5-system-card ,
[32]

SUPERChem: A benchmark for advanced chemical reasoning.arXiv preprint arXiv:2512.01274, 2025

Peking University Chemistry Group. SUPERChem: A benchmark for advanced chemical reasoning.arXiv preprint arXiv:2512.01274, 2025. 500-question chemistry benchmark; reports GPT-5 (High) at 38.5%

work page arXiv 2025
[33]

Humanity’s last exam

Long Phan et al. Humanity’s last exam. https://lastexam.ai/, 2025. Center for AI Safety; Scale AI

2025
[34]

INTELLECT-3.1: An open reasoning model from Prime Intellect

Prime Intellect. INTELLECT-3.1: An open reasoning model from Prime Intellect. https: //huggingface.co/PrimeIntellect/INTELLECT-3.1, 2026

2026
[35]

ProteinLMBench: A benchmark for protein language models

ProteinLMBench Authors. ProteinLMBench: A benchmark for protein language models. https://huggingface.co/datasets/tsynbio/ProteinLMBench, 2024. Bibliographic details to be confirmed

2024
[36]

Qwen3-235B: An open-weight mixture-of-experts model from Qwen Team, Alibaba.https://huggingface.co/Qwen, 2026

Qwen Team, Alibaba. Qwen3-235B: An open-weight mixture-of-experts model from Qwen Team, Alibaba.https://huggingface.co/Qwen, 2026

2026
[37]

arXiv preprint arXiv:2405.07960 , year =

Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. AgentClinic: A multimodal agent benchmark to evaluate AI in simulated clinical environments. arXiv preprint arXiv:2405.07960, 2024

work page arXiv 2024
[38]

MedXpertQA Text: Gemini 3.1 pro re- ported state-of-the-art

Third-party report. MedXpertQA Text: Gemini 3.1 pro re- ported state-of-the-art. https://medium.com/@mrAryanKumar/ 5-surprising-truths-about-metas-14-billion-muse-spark-comeback-1efe8f76cc28 ,
[39]

Reported third-party SOTA for Gemini 3.1 Pro on MedXpertQA text-only subset
[40]

Inspect: A framework for large language model evaluations

UK AI Safety Institute. Inspect: A framework for large language model evaluations. https: //inspect.aisi.org.uk/, 2024. Open-source evaluation framework

2024
[41]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review arXiv 2022
[42]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation.arXiv preprint arXiv:2308.08155, 2023

work page internal anchor Pith review arXiv 2023
[43]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

2023
[44]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and chatbot arena. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[45]

GLM-4.5: An open-source foundation model from Zhipu AI

Zhipu AI. GLM-4.5: An open-source foundation model from Zhipu AI. https:// huggingface.co/zai-org/GLM-4.5, 2025

2025
[46]

Continue investigating

Yuxin Zuo et al. MedXpertQA: Benchmarking expert-level medical reasoning and understand- ing.arXiv preprint, 2024. Tsinghua C3I. 12 A Limitations First, our tool surface is largely text-based, and benchmarks whose answers depend on grounded images or large-table retrieval (e.g. the FigQA2 and TableQA2 subsets of LAB-Bench 2) retain visible headroom in our...

2024