hub

Medagentsbench: Benchmarking thinking models and agent frameworks for complex medical reasoning

· 2025 · arXiv 2503.07459

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 baseline 1 dataset 1

citation-polarity summary

background 2 baseline 1

representative citing papers

Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks

cs.CV · 2025-09-26 · unverdicted · novelty 7.0

Neural-MedBench reveals sharp performance drops in state-of-the-art VLMs on reasoning-intensive neurology tasks compared to conventional classification benchmarks, with reasoning failures dominating errors.

MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

cs.CV · 2026-05-18 · conditional · novelty 6.0

MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

cs.CL · 2026-05-15 · unverdicted · novelty 6.0

CHI-Bench shows current AI agents achieve at most 28% success on long-horizon healthcare workflows that require dense policy adherence, multi-role handoffs, and multi-turn interactions.

Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks

cs.CL · 2026-05-02 · unverdicted · novelty 6.0

Medmarks introduces 30 open benchmarks for medical LLM tasks and evaluates 61 models, finding frontier reasoning models lead while medically fine-tuned ones outperform generalists and all show answer-order bias.

MultiDx: A Multi-Source Knowledge Integration Framework towards Diagnostic Reasoning

cs.CL · 2026-04-27 · unverdicted · novelty 5.0

MultiDx integrates evidence from web search, SOAP cases, and clinical databases in a two-stage process to improve LLM diagnostic reasoning and alignment with clinical trajectories.

From Exposure to Internalization: Dual-Stream Calibration for In-context Clinical Reasoning

q-bio.QM · 2026-04-07 · unverdicted · novelty 5.0

Dual-Stream Calibration uses entropy minimization and iterative meta-learning at test time to internalize clinical evidence and outperform standard in-context learning baselines on medical tasks.

MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

cs.CL · 2026-02-13 · unverdicted · novelty 4.0

MedXIAOHE is a medical MLLM that claims state-of-the-art benchmark performance through specialized pretraining to cover long-tail diseases and RL-based reasoning training.

From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

cs.AI · 2025-04-28 · accept · novelty 4.0

A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.

From System 1 to System 2: A Survey of Reasoning Large Language Models

cs.AI · 2025-02-24 · accept · novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

cs.CV · 2025-03-16 · unverdicted · novelty 2.0

The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

citing papers explorer

Showing 10 of 10 citing papers.

Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks cs.CV · 2025-09-26 · unverdicted · none · ref 27
Neural-MedBench reveals sharp performance drops in state-of-the-art VLMs on reasoning-intensive neurology tasks compared to conventional classification benchmarks, with reasoning failures dominating errors.
MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents cs.CV · 2026-05-18 · conditional · none · ref 54
MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.
CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows? cs.CL · 2026-05-15 · unverdicted · none · ref 53
CHI-Bench shows current AI agents achieve at most 28% success on long-horizon healthcare workflows that require dense policy adherence, multi-role handoffs, and multi-turn interactions.
Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks cs.CL · 2026-05-02 · unverdicted · none · ref 4
Medmarks introduces 30 open benchmarks for medical LLM tasks and evaluates 61 models, finding frontier reasoning models lead while medically fine-tuned ones outperform generalists and all show answer-order bias.
MultiDx: A Multi-Source Knowledge Integration Framework towards Diagnostic Reasoning cs.CL · 2026-04-27 · unverdicted · none · ref 4
MultiDx integrates evidence from web search, SOAP cases, and clinical databases in a two-stage process to improve LLM diagnostic reasoning and alignment with clinical trajectories.
From Exposure to Internalization: Dual-Stream Calibration for In-context Clinical Reasoning q-bio.QM · 2026-04-07 · unverdicted · none · ref 59
Dual-Stream Calibration uses entropy minimization and iterative meta-learning at test time to internalize clinical evidence and outperform standard in-context learning baselines on medical tasks.
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs cs.CL · 2026-02-13 · unverdicted · none · ref 57
MedXIAOHE is a medical MLLM that claims state-of-the-art benchmark performance through specialized pretraining to cover long-tail diseases and RL-based reasoning training.
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review cs.AI · 2025-04-28 · accept · none · ref 107
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
From System 1 to System 2: A Survey of Reasoning Large Language Models cs.AI · 2025-02-24 · accept · none · ref 119
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey cs.CV · 2025-03-16 · unverdicted · none · ref 262
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

Medagentsbench: Benchmarking thinking models and agent frameworks for complex medical reasoning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer