pith. machine review for the scientific record. sign in

arxiv: 2605.06177 · v1 · submitted 2026-05-07 · 💻 cs.AI

Recognition: unknown

BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-08 10:19 UTC · model grok-4.3

classification 💻 cs.AI
keywords biomedical deep research agentsevaluation toolkitagent harnessescontext managementbiomedical benchmarksopen-source toolkitfoundation model integrationperformance evaluation
0
0 comments X

The pith

BioMedArena decouples biomedical agent evaluation into six layers so new models integrate in minutes and deliver consistent performance gains across shared benchmarks and tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that inconsistent results across biomedical agent papers stem from differing harnesses, tool registries, and evaluation setups that impose a heavy per-paper engineering cost. It supplies an open toolkit that splits evaluation into six independent layers: benchmark loading, tool exposure, tool selection, execution mode, context management, and scoring. Users add new foundation models, benchmarks, or tools by writing short provider adapters. The release ships 147 benchmarks and 75 tools together with six harnesses that apply distinct context-management strategies. These harnesses, when paired with twelve backbone models, produce state-of-the-art results on eight representative benchmarks and raise average accuracy by 15.03 percentage points over prior best scores.

Core claim

BioMedArena decouples biomedical agent evaluation into six layers and supplies standardized access to 147 benchmarks and 75 tools across nine families. By registering short provider adapters, users can add models, benchmarks, or tools without extensive custom code. The release includes six harnesses implementing distinct context-management strategies; when applied to twelve backbone models these yield competitive research performance and state-of-the-art scores on eight representative benchmarks, raising average accuracy by 15.03 percentage points relative to earlier reported results.

What carries the argument

The six-layer decoupling of agent evaluation, consisting of benchmark loading, tool exposure, tool selection, execution mode, context management, and scoring.

If this is right

  • Adding a new foundation model requires only a few-line adapter instead of weeks of engineering.
  • Different models can be evaluated on identical benchmarks and tools for direct head-to-head comparison.
  • The six context-management strategies produce measurable gains in agent performance on biomedical tasks.
  • Researchers obtain per-task traces and configurations that support reproducible experiments.
  • New benchmarks or tools can be registered while preserving the same evaluation surface for all prior models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread use of the toolkit could reduce contradictory performance claims by enforcing a common evaluation surface across papers.
  • The decoupling pattern could be applied to agent evaluation in non-biomedical scientific domains if similar registries of tasks and tools are created.
  • Testing whether the reported performance lift remains when the harnesses are run on tasks outside the original eight benchmarks would clarify the generality of the gains.
  • Releasing the full set of traces allows direct inspection of where each context strategy succeeds or fails on individual biomedical questions.

Load-bearing premise

The chosen benchmarks, tools, and harnesses produce fair comparisons that are free of hidden biases from selection or implementation details.

What would settle it

Independent re-implementation of the six harnesses on the same eight benchmarks yields average scores no higher than the previous state-of-the-art.

Figures

Figures reproduced from arXiv: 2605.06177 by David A. Clifton, Fenglin Liu, Honghan Wu, Hongjian Zhou, Jiayuan Zhu, Jiazhen Pan, Jinge Wu, Junde Wu, Mingde Zeng, Sean Wu.

Figure 1
Figure 1. Figure 1: Performance gains under BioMedArena across 8 representative biomedical benchmarks, view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the BioMedArena toolkit: a unified biomedical benchmark interface, a tool view at source ↗
Figure 3
Figure 3. Figure 3: Dataflow of a biomedical deep research agent in BioMedArena. A natural-language view at source ↗
Figure 4
Figure 4. Figure 4: MUTUAL-EVOLVE workflow. For each question, N parallel solvers at distinct temperatures first explore privately, then share findings through a Global Workspace at iteration T. The workspace has four typed banks (guide, tool, skill, error); solvers read it every K iterations and may terminate at different end iterations ei . Once all solvers finish, each performs a text-only final confirmation over the full … view at source ↗
Figure 5
Figure 5. Figure 5: unpacks the LAB-Bench 2 Overall number into its 7 text-only subsets, plotting per-subset view at source ↗
Figure 6
Figure 6. Figure 6: LAB-Bench 2 per-subset accuracy heatmap (%) across 6 backbones (2 Gemini, 4 Claude) view at source ↗
Figure 7
Figure 7. Figure 7: Tool registry organized by biomedical skill family. The 33 category tags group into 9 view at source ↗
read the original abstract

Building a deep research agent today is an exercise in glue code: the same backbone evaluated on the same benchmark can report different accuracies in different papers because harness and tool registry all differ, and integrating a new foundation model into a comparable evaluation surface costs weeks of model-specific engineering. We call this the per-paper engineering tax and release BioMedArena, an open-source toolkit that not only alleviates it but also provides an arena for fair comparison of different foundation models when evaluating them as deep-research agents. BioMedArena decouples six layers of biomedical agent evaluation -- benchmark loading, tool exposure, tool selection, execution mode, context management, and scoring -- and exposes 147 biomedical benchmarks and 75 biomedical tools across 9 functional families. Adding a new model, benchmark, or tool reduces to registering a few-line provider adapter. We further provide 6 agent harnesses with 6 context-management strategies, which provide 12 backbones with competitive research capabilities and significantly improved performance, achieving state-of-the-art (SOTA) results on 8 representative biomedical benchmarks, with an average lift of +15.03 percentage points over prior SOTA. The toolkit, configurations, and per-task traces are available at https://github.com/AI-in-Health/BioMedArena

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper presents BioMedArena, an open-source toolkit that decouples biomedical deep research agent evaluation into six layers (benchmark loading, tool exposure, tool selection, execution mode, context management, and scoring). It registers 147 benchmarks and 75 tools across 9 families, supplies 6 agent harnesses with 6 context-management strategies (yielding 12 backbones), and reports that these achieve state-of-the-art results on 8 representative biomedical benchmarks with an average lift of +15.03 percentage points over prior SOTA. The toolkit reduces model integration to registering short provider adapters and releases configurations plus per-task traces.

Significance. If the performance claims hold under the standardized harnesses, the work would meaningfully lower the per-paper engineering tax in biomedical agent research and enable more reproducible head-to-head comparisons across foundation models. The open-source release with explicit adapters, benchmark/tool registries, and traces is a concrete community resource that directly addresses the reproducibility issues the authors identify.

major comments (3)
  1. [Abstract] Abstract: The headline SOTA claim (average +15.03 pp lift on 8 benchmarks) is load-bearing for the paper's contribution yet rests on an unverified assumption of baseline equivalence. The manuscript motivates the toolkit by noting that 'the same backbone evaluated on the same benchmark can report different accuracies in different papers because harness and tool registry all differ,' but does not state whether the cited prior SOTA numbers were re-executed inside BioMedArena's six-layer framework (with identical tool exposure, context strategies, and scoring) or simply copied from the original heterogeneous papers. Without this, the lift cannot be attributed to the new agent designs rather than standardization alone.
  2. [Evaluation] Evaluation section (or §4/§5): No information is supplied on baseline selection criteria, statistical testing (e.g., paired t-tests or bootstrap confidence intervals on the per-benchmark lifts), or variance across the 12 backbones. The claim of 'significantly improved performance' therefore lacks the quantitative support needed to substantiate superiority over prior work.
  3. [Agent harnesses] Agent harnesses and context-management strategies: The mapping from '6 agent harnesses with 6 context-management strategies' to '12 backbones' is stated without an accompanying ablation or per-strategy breakdown. It is therefore unclear which of the six layers (particularly context management) drives the reported gains and whether the improvements generalize across all 147 registered benchmarks or only the selected 8.
minor comments (3)
  1. A table or appendix listing all 147 benchmarks and 75 tools (or at least the 9 functional families with representative examples) would improve usability and allow readers to assess coverage.
  2. The GitHub repository link should be accompanied by a brief description of the exact commit or release tag used for the reported experiments to support reproducibility.
  3. [Abstract] Clarify the exact arithmetic behind '12 backbones' (6 harnesses × 6 strategies?) and whether every combination was evaluated on every benchmark.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments help clarify how to better substantiate our performance claims and the toolkit's contributions. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline SOTA claim (average +15.03 pp lift on 8 benchmarks) is load-bearing for the paper's contribution yet rests on an unverified assumption of baseline equivalence. The manuscript motivates the toolkit by noting that 'the same backbone evaluated on the same benchmark can report different accuracies in different papers because harness and tool registry all differ,' but does not state whether the cited prior SOTA numbers were re-executed inside BioMedArena's six-layer framework (with identical tool exposure, context strategies, and scoring) or simply copied from the original heterogeneous papers. Without this, the lift cannot be attributed to the new agent designs rather than standardization alone.

    Authors: We appreciate the referee highlighting this ambiguity. The prior SOTA numbers were sourced from the original publications rather than re-executed inside BioMedArena, which is standard practice but means the reported lift incorporates both our harness improvements and standardization effects. We will revise the abstract and evaluation section to explicitly state that comparisons use originally reported figures and to emphasize that the toolkit's primary value is enabling consistent, reproducible evaluations going forward. revision: yes

  2. Referee: [Evaluation] Evaluation section (or §4/§5): No information is supplied on baseline selection criteria, statistical testing (e.g., paired t-tests or bootstrap confidence intervals on the per-benchmark lifts), or variance across the 12 backbones. The claim of 'significantly improved performance' therefore lacks the quantitative support needed to substantiate superiority over prior work.

    Authors: We agree that these details are needed to support the claims. In the revised manuscript we will add: (i) explicit criteria for selecting the 8 benchmarks as representative of the 147 (covering diverse biomedical domains and task types), (ii) statistical tests including paired t-tests or bootstrap confidence intervals on the per-benchmark lifts, and (iii) variance information or per-backbone results across the 12 backbones. These additions will provide the required quantitative grounding. revision: yes

  3. Referee: [Agent harnesses] Agent harnesses and context-management strategies: The mapping from '6 agent harnesses with 6 context-management strategies' to '12 backbones' is stated without an accompanying ablation or per-strategy breakdown. It is therefore unclear which of the six layers (particularly context management) drives the reported gains and whether the improvements generalize across all 147 registered benchmarks or only the selected 8.

    Authors: The 12 backbones result from a full combination of the 6 harnesses and 6 context-management strategies. We will add an ablation study breaking down results by harness and context strategy to identify driving components. The 8 benchmarks were selected as representative of the full set of 147; we will clarify this selection criterion and note that the toolkit supports evaluation on all registered benchmarks, with the reported results serving as a demonstration on key tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: toolkit release with empirical SOTA claims, no self-referential derivations

full rationale

The paper describes an open-source toolkit that decouples six evaluation layers and provides harnesses for biomedical agents, reporting an average +15.03 pp lift over prior SOTA on 8 benchmarks. No mathematical derivations, equations, fitted parameters, or predictions that reduce to inputs by construction are present. The SOTA claims are empirical outcomes from the provided implementations rather than self-definitional or self-citation load-bearing steps. The work is self-contained as a software and benchmarking contribution evaluated against external benchmarks, with no evidence of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As an engineering release of an evaluation toolkit, the paper introduces no free parameters, mathematical axioms, or new invented entities; it aggregates existing benchmarks and tools under a new organizational structure.

pith-pipeline@v0.9.0 · 5557 in / 1230 out tokens · 23755 ms · 2026-05-08T10:19:15.740706+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 12 canonical work pages · 2 internal anchors

  1. [1]

    Introducing Claude Opus 4.5

    Anthropic. Introducing Claude Opus 4.5. https://www.anthropic.com/news/ claude-opus-4-5, 2025

  2. [2]

    Introducing Claude Sonnet 4.5

    Anthropic. Introducing Claude Sonnet 4.5. https://www.anthropic.com/news/ claude-sonnet-4-5, 2025

  3. [3]

    Introducing Claude Opus 4.6

    Anthropic. Introducing Claude Opus 4.6. https://www.anthropic.com/news/ claude-opus-4-6, 2026

  4. [4]

    Claude Sonnet 4.6: Research model card

    Anthropic. Claude Sonnet 4.6: Research model card. https://www.anthropic.com/ research/claude-sonnet-4-6, 2026

  5. [5]

    Trinity-Large: An open-weight reasoning model from Arcee AI

    Arcee AI. Trinity-Large: An open-weight reasoning model from Arcee AI. https: //huggingface.co/arcee-ai, 2025

  6. [6]

    HealthBench: Evaluating large language models towards improved human health.https://github.com/openai/healthbench, 2025

    Akshay Arora et al. HealthBench: Evaluating large language models towards improved human health.https://github.com/openai/healthbench, 2025. OpenAI

  7. [7]

    MedHELM: Holistic evaluation of large language models for medical tasks

    Suhana Bedi et al. MedHELM: Holistic evaluation of large language models for medical tasks. arXiv preprint, 2025. Stanford CRFM; also appears in Nature Medicine 2026

  8. [8]

    Benchmarking large lan- guage models on answering and explaining challenging medical questions

    Hanjie Chen et al. Benchmarking large language models on answering and explaining challeng- ing medical questions.arXiv preprint arXiv:2402.18060, 2024. Medbullets benchmark; op4 = 4-option subset

  9. [9]

    Tenenbaum, and Igor Mordatch

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InInternational Conference on Machine Learning (ICML), 2024

  10. [10]

    Openseeker: Democratizing frontier search agents by fully open-sourcing training data.arXiv preprint arXiv:2603.15594, 2026

    Yuwen Du, Rui Ye, Shuo Tang, Xinyu Zhu, Yijun Lu, Yuzhu Cai, and Siheng Chen. Openseeker: Democratizing frontier search agents by fully open-sourcing training data.arXiv preprint arXiv:2603.15594, 2026

  11. [11]

    Edison literature high: A PaperQA3-backed deep research agent for biomedical literature

    Edison Scientific. Edison literature high: A PaperQA3-backed deep research agent for biomedical literature. https://edisonscientific.com/articles/ edison-literature-agent, 2026. Model release literature-20260216-high, February 2026

  12. [12]

    LAB-Bench 2

    FutureHouse. LAB-Bench 2. https://huggingface.co/datasets/futurehouse/ labbench2, 2025. Gated dataset; successor to LAB-Bench [21]

  13. [13]

    A framework for few-shot language model evaluation

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

  14. [14]

    Introducing Gemini 3 Flash

    Google. Introducing Gemini 3 Flash. https://blog.google/products/gemini/ gemini-3-flash/, 2026

  15. [15]

    Gemini 3.1 Pro

    Google DeepMind. Gemini 3.1 Pro. https://blog.google/innovation-and-ai/ models-and-research/gemini-models/gemini-3-1-pro/, 2026

  16. [16]

    MetaGPT: Meta programming for a multi-agent collaborative framework

    Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. MetaGPT: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations (ICLR), 2024

  17. [17]

    Biomni: A general-purpose biomedical AI agent.bioRxiv preprint, 2025

    Kexin Huang et al. Biomni: A general-purpose biomedical AI agent.bioRxiv preprint, 2025. Stanford

  18. [18]

    InternLM2-Protein-7B: A protein language model.arXiv preprint arXiv:2406.05540, 2024

    InternLM2-Protein Authors. InternLM2-Protein-7B: A protein language model.arXiv preprint arXiv:2406.05540, 2024. Reported state-of-the-art on ProteinLMBench

  19. [19]

    MedAgentBench: A realistic virtual EHR environment to benchmark medical LLM agents.arXiv preprint arXiv:2501.14654, 2024

    Yixing Jiang et al. MedAgentBench: A realistic virtual EHR environment to benchmark medical LLM agents.arXiv preprint arXiv:2501.14654, 2024. Stanford ML Group

  20. [20]

    Holistic agent leaderboard: The missing infrastructure for AI agent evaluation.arXiv preprint arXiv:2510.11977, 2025

    Sayash Kapoor, Benedikt Stroebl, Peter Kirgis, Nitya Nadgir, Zachary S Siegel, Boyi Wei, Tianci Xue, Ziru Chen, Felix Chen, Saiteja Utpala, et al. Holistic agent leaderboard: The missing infrastructure for ai agent evaluation.arXiv preprint arXiv:2510.11977, 2025

  21. [21]

    Laurent and Joseph D

    Jon M. Laurent, Joseph D. Janizek, Michael Ruzo, Michaela M. Hinks, Michael J. Hammerling, Siddharth Narayanan, Manvitha Ponnapati, Andrew D. White, and Samuel G. Rodriques. LAB-Bench: Measuring capabilities of language models for biology research.arXiv preprint arXiv:2407.10362, 2024. FutureHouse

  22. [22]

    Manning, Christopher Ré, Diana Acosta-Navas, Drew A

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu...

  23. [23]

    Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993, 2025

    MedReason Authors. MedReason: A step-reasoning medical llm.arXiv preprint arXiv:2504.00993, 2025. Reported state-of-the-art on Medbullets (op4)

  24. [24]

    Muse Spark (meta): reported HealthBench Hard score

    Meta AI. Muse Spark (meta): reported HealthBench Hard score. https://venturebeat.com/technology/ goodbye-llama-meta-launches-new-proprietary-ai-model-muse-spark-first-since ,

  25. [25]

    Third-party report of HealthBench Hard score for Meta’s Muse Spark model

  26. [26]

    MiroThinker-1.7 & H1: Towards heavy-duty research agents via ver- ification

    MiroThinker Authors. MiroThinker-1.7 & H1: Towards heavy-duty research agents via ver- ification. https://arxiv.org/pdf/2603.15726, 2026. Reported state-of-the-art on the SuperChem text-only subset

  27. [27]

    BixBench: A comprehensive benchmark for LLM-based agents in computational biology.arXiv preprint, 2025

    Ludovico Mitchener et al. BixBench: A comprehensive benchmark for LLM-based agents in computational biology.arXiv preprint, 2025. FutureHouse

  28. [28]

    CrewAI: Framework for orchestrating role- playing, autonomous AI agents

    João Moura Moura and CrewAI contributors. CrewAI: Framework for orchestrating role- playing, autonomous AI agents. https://github.com/joaomdmoura/crewai, 2023. Soft- ware framework

  29. [29]

    NVIDIA Nemotron-3 Super 120B-A12B: A mixture-of-experts reasoning model

    NVIDIA. NVIDIA Nemotron-3 Super 120B-A12B: A mixture-of-experts reasoning model. https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 , 2025. 11

  30. [30]

    GPT-5.4 model documentation

    OpenAI. GPT-5.4 model documentation. https://developers.openai.com/api/docs/ models/gpt-5.4, 2025

  31. [31]

    GPT-5.5 system card

    OpenAI. GPT-5.5 system card. https://openai.com/index/gpt-5-5-system-card ,

  32. [32]

    SUPERChem: A benchmark for advanced chemical reasoning.arXiv preprint arXiv:2512.01274, 2025

    Peking University Chemistry Group. SUPERChem: A benchmark for advanced chemical reasoning.arXiv preprint arXiv:2512.01274, 2025. 500-question chemistry benchmark; reports GPT-5 (High) at 38.5%

  33. [33]

    Humanity’s last exam

    Long Phan et al. Humanity’s last exam. https://lastexam.ai/, 2025. Center for AI Safety; Scale AI

  34. [34]

    INTELLECT-3.1: An open reasoning model from Prime Intellect

    Prime Intellect. INTELLECT-3.1: An open reasoning model from Prime Intellect. https: //huggingface.co/PrimeIntellect/INTELLECT-3.1, 2026

  35. [35]

    ProteinLMBench: A benchmark for protein language models

    ProteinLMBench Authors. ProteinLMBench: A benchmark for protein language models. https://huggingface.co/datasets/tsynbio/ProteinLMBench, 2024. Bibliographic details to be confirmed

  36. [36]

    Qwen3-235B: An open-weight mixture-of-experts model from Qwen Team, Alibaba.https://huggingface.co/Qwen, 2026

    Qwen Team, Alibaba. Qwen3-235B: An open-weight mixture-of-experts model from Qwen Team, Alibaba.https://huggingface.co/Qwen, 2026

  37. [37]

    arXiv preprint arXiv:2405.07960 , year =

    Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. AgentClinic: A multimodal agent benchmark to evaluate AI in simulated clinical environments. arXiv preprint arXiv:2405.07960, 2024

  38. [38]

    MedXpertQA Text: Gemini 3.1 pro re- ported state-of-the-art

    Third-party report. MedXpertQA Text: Gemini 3.1 pro re- ported state-of-the-art. https://medium.com/@mrAryanKumar/ 5-surprising-truths-about-metas-14-billion-muse-spark-comeback-1efe8f76cc28 ,

  39. [39]

    Reported third-party SOTA for Gemini 3.1 Pro on MedXpertQA text-only subset

  40. [40]

    Inspect: A framework for large language model evaluations

    UK AI Safety Institute. Inspect: A framework for large language model evaluations. https: //inspect.aisi.org.uk/, 2024. Open-source evaluation framework

  41. [41]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

  42. [42]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation.arXiv preprint arXiv:2308.08155, 2023

  43. [43]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

  44. [44]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and chatbot arena. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  45. [45]

    GLM-4.5: An open-source foundation model from Zhipu AI

    Zhipu AI. GLM-4.5: An open-source foundation model from Zhipu AI. https:// huggingface.co/zai-org/GLM-4.5, 2025

  46. [46]

    Continue investigating

    Yuxin Zuo et al. MedXpertQA: Benchmarking expert-level medical reasoning and understand- ing.arXiv preprint, 2024. Tsinghua C3I. 12 A Limitations First, our tool surface is largely text-based, and benchmarks whose answers depend on grounded images or large-table retrieval (e.g. the FigQA2 and TableQA2 subsets of LAB-Bench 2) retain visible headroom in our...