NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.
hub
Can generalist foundation models outcompete special-purpose tuning? case study in medicine
25 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.
MADE creates a contamination-resistant living benchmark for multi-label classification of medical device adverse events, with evaluations revealing model-specific trade-offs in accuracy and uncertainty quantification.
TextGrad performs automatic differentiation for compound AI systems by backpropagating natural-language feedback from LLMs to optimize variables ranging from code to molecular structures.
MedEvoEval is an executable longitudinal evaluation framework that converts medical cases into action-gated simulated episodes to track how doctor agents evolve decision-making, resource use, and experience across multiple encounters.
MedHarm benchmark shows aligned LLMs and guardrails can still produce unsafe responses on high-risk medical queries, indicating medical safety requires domain-specific testing.
PulmoFoundation achieves 92.3% average AUC on 32 lung pathology tasks in prospective validation and raises pathologist accuracy from 83.8% to 91.7% in a crossover RCT.
AgentCo-op retrieves and assembles existing agents and tools into interoperable workflows for open-world scientific tasks, showing effectiveness in genomics case studies and competitive benchmark results with lower costs.
PrivScope enforces task-scoped disclosure at the local-cloud boundary in hybrid agents, eliminating profile leakage and halving re-identification risk on medical workflows while preserving task success.
AgentSlimming compresses graph-structured multi-agent systems by estimating agent importance and removing or replacing low-value agents, cutting token costs by up to 78.9% with negligible performance loss.
Large real-world deployment found conversational AI agents for everyday symptom assessment more accurate than clinicians and improved by structured interviewing.
Small open LLMs produce highly variable medical answers even at low temperature, with self-agreement at most 0.20 and 87-97% unique outputs per model across 10 runs.
HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.
LLMs show improving recommendation accuracy with more observed choices under the disappointment aversion model, but learning success is heterogeneous across models and preference parameters.
Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
LLMs show strong exam performance on medical tasks but exhibit a clear gap in accuracy on authentic clinical decision-making as measured by the new MR-Bench benchmark and unified evaluations.
MedIRT applies Item Response Theory to medical LLM benchmarks to separate latent competency from item difficulty and discrimination, producing more stable rankings and revealing domain heterogeneity than accuracy alone.
SPEAR proposes structured prompt views, runtime adaptive refinement, and policy rules to make prompts first-class, versioned, and evolvable components in complex LLM applications.
GPT-4o is OpenAI's end-to-end multimodal model with human-like audio latency, improved non-English text performance, stronger vision and audio understanding, and accompanying safety evaluations.
SURGENT is a multi-agent surgical assistance system with novel memory management that outperforms baseline LLMs on case analysis, plan simulation, safety monitoring, risk assessment, and rehabilitation guidance.
APP is a multi-turn LLM framework for medical dialogue that combines empathetic questioning, Bayesian active learning, and guideline-based reasoning, outperforming baselines on a new simulated-patient benchmark in accuracy, uncertainty reduction, and user experience.
This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.
citing papers explorer
-
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments
AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.
-
TextGrad: Automatic "Differentiation" via Text
TextGrad performs automatic differentiation for compound AI systems by backpropagating natural-language feedback from LLMs to optimize variables ranging from code to molecular structures.
-
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
-
Capabilities of Gemini Models in Medicine
Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.
-
Can an LLM Learn Preferences from Choice Data?
LLMs show improving recommendation accuracy with more observed choices under the disappointment aversion model, but learning success is heterogeneous across models and preference parameters.
-
GPT-4o System Card
GPT-4o is OpenAI's end-to-end multimodal model with human-like audio latency, improved non-English text performance, stronger vision and audio understanding, and accompanying safety evaluations.