Neural-MedBench reveals sharp performance drops in state-of-the-art VLMs on reasoning-intensive neurology tasks compared to conventional classification benchmarks, with reasoning failures dominating errors.
hub
Medagentsbench: Benchmarking thinking models and agent frameworks for complex medical reasoning
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.
CHI-Bench shows current AI agents achieve at most 28% success on long-horizon healthcare workflows that require dense policy adherence, multi-role handoffs, and multi-turn interactions.
Medmarks introduces 30 open benchmarks for medical LLM tasks and evaluates 61 models, finding frontier reasoning models lead while medically fine-tuned ones outperform generalists and all show answer-order bias.
MultiDx integrates evidence from web search, SOAP cases, and clinical databases in a two-stage process to improve LLM diagnostic reasoning and alignment with clinical trajectories.
Dual-Stream Calibration uses entropy minimization and iterative meta-learning at test time to internalize clinical evidence and outperform standard in-context learning baselines on medical tasks.
MedXIAOHE is a medical MLLM that claims state-of-the-art benchmark performance through specialized pretraining to cover long-tail diseases and RL-based reasoning training.
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.
citing papers explorer
-
Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks
Neural-MedBench reveals sharp performance drops in state-of-the-art VLMs on reasoning-intensive neurology tasks compared to conventional classification benchmarks, with reasoning failures dominating errors.
-
MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents
MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.
-
CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?
CHI-Bench shows current AI agents achieve at most 28% success on long-horizon healthcare workflows that require dense policy adherence, multi-role handoffs, and multi-turn interactions.
-
Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks
Medmarks introduces 30 open benchmarks for medical LLM tasks and evaluates 61 models, finding frontier reasoning models lead while medically fine-tuned ones outperform generalists and all show answer-order bias.
-
MultiDx: A Multi-Source Knowledge Integration Framework towards Diagnostic Reasoning
MultiDx integrates evidence from web search, SOAP cases, and clinical databases in a two-stage process to improve LLM diagnostic reasoning and alignment with clinical trajectories.
-
From Exposure to Internalization: Dual-Stream Calibration for In-context Clinical Reasoning
Dual-Stream Calibration uses entropy minimization and iterative meta-learning at test time to internalize clinical evidence and outperform standard in-context learning baselines on medical tasks.
-
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
MedXIAOHE is a medical MLLM that claims state-of-the-art benchmark performance through specialized pretraining to cover long-tail diseases and RL-based reasoning training.
-
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
-
From System 1 to System 2: A Survey of Reasoning Large Language Models
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
-
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.