hub

Can generalist foundation models outcompete special-purpose tuning? case study in medicine

Can generalist foundation models outcompete special-purpose tuning? A case study in medicine , author= · 2023 · arXiv 2311.16452

25 Pith papers cite this work. Polarity classification is still indexing.

25 Pith papers citing it

read on arXiv browse 25 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

cs.CV · 2026-05-19 · accept · novelty 8.0

NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

cs.HC · 2024-05-13 · conditional · novelty 8.0

AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.

MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events

cs.CL · 2026-04-16 · unverdicted · novelty 7.0

MADE creates a contamination-resistant living benchmark for multi-label classification of medical device adverse events, with evaluations revealing model-specific trade-offs in accuracy and uncertainty quantification.

TextGrad: Automatic "Differentiation" via Text

cs.CL · 2024-06-11 · unverdicted · novelty 7.0

TextGrad performs automatic differentiation for compound AI systems by backpropagating natural-language feedback from LLMs to optimize variables ranging from code to molecular structures.

MedEvoEval: Evaluating Continual Evolution of Doctor Agents through Simulated Clinical Episodes

cs.AI · 2026-06-27 · unverdicted · novelty 6.0

MedEvoEval is an executable longitudinal evaluation framework that converts medical cases into action-gated simulated episodes to track how doctor agents evolve decision-making, resource use, and experience across multiple encounters.

When Medical Safety Alignment Fails: A Benchmark for Evaluating LLMs on High-Risk Medical Queries

cs.CY · 2026-05-26 · unverdicted · novelty 6.0

MedHarm benchmark shows aligned LLMs and guardrails can still produce unsafe responses on high-risk medical queries, indicating medical safety requires domain-specific testing.

A Clinically Validated Foundation Model for Comprehensive Lung Pathology Interpretation

eess.IV · 2026-05-25 · unverdicted · novelty 6.0

PulmoFoundation achieves 92.3% average AUC on 32 lung pathology tasks in prospective validation and raises pathologist accuracy from 83.8% to 91.7% in a crossover RCT.

AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

AgentCo-op retrieves and assembles existing agents and tools into interoperable workflows for open-world scientific tasks, showing effectiveness in genomics case studies and competitive benchmark results with lower costs.

PrivScope: Task-scoped Disclosure Control for Hybrid Agentic Systems

cs.CR · 2026-05-15 · unverdicted · novelty 6.0

PrivScope enforces task-scoped disclosure at the local-cloud boundary in hybrid agents, eliminating profile leakage and halving re-identification risk on medical workflows while preserving task success.

AgentSlimming: Towards Efficient and Cost-Aware Multi-Agent Systems

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

AgentSlimming compresses graph-structured multi-agent systems by estimating agent importance and removing or replacing low-value agents, cutting token costs by up to 78.9% with negligible performance loss.

SymptomAI: Toward a Conversational AI Agent for Everyday Symptom Assessment

cs.AI · 2026-05-05 · unverdicted · novelty 6.0 · 2 refs

Large real-world deployment found conversational AI agents for everyday symptom assessment more accurate than clinicians and improved by structured interviewing.

Evaluating Small Open LLMs for Medical Question Answering: A Practical Framework

cs.IR · 2026-04-12 · unverdicted · novelty 6.0

Small open LLMs produce highly variable medical answers even at low temperature, with self-agreement at most 0.20 and 87-97% unique outputs per model across 10 runs.

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

cs.CL · 2024-12-25 · unverdicted · novelty 6.0

HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.

Capabilities of Gemini Models in Medicine

cs.AI · 2024-04-29 · unverdicted · novelty 6.0

Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.

Can an LLM Learn Preferences from Choice Data?

econ.GN · 2024-01-14 · unverdicted · novelty 6.0

LLMs show improving recommendation accuracy with more observed choices under the disappointment aversion model, but learning success is heterogeneous across models and preference parameters.

AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks

cs.AI · 2026-05-11 · unverdicted · novelty 5.0 · 2 refs

Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.

Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes

cs.AI · 2026-05-07 · unverdicted · novelty 5.0

Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.

Medical Reasoning with Large Language Models: A Survey and MR-Bench

cs.CL · 2026-03-17 · accept · novelty 5.0

LLMs show strong exam performance on medical tasks but exhibit a clear gap in accuracy on authentic clinical decision-making as measured by the new MR-Bench benchmark and unified evaluations.

Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks

cs.CL · 2025-09-29 · conditional · novelty 5.0

MedIRT applies Item Response Theory to medical LLM benchmarks to separate latent competency from item difficulty and discrimination, producing more stable rankings and revealing domain heterogeneity than accuracy alone.

Making Prompts First-Class Citizens for Adaptive LLM Pipelines

cs.DB · 2025-08-07 · unverdicted · novelty 5.0

SPEAR proposes structured prompt views, runtime adaptive refinement, and policy rules to make prompts first-class, versioned, and evolvable components in complex LLM applications.

GPT-4o System Card

cs.CL · 2024-10-25 · unverdicted · novelty 5.0

GPT-4o is OpenAI's end-to-end multimodal model with human-like audio latency, improved non-English text performance, stronger vision and audio understanding, and accompanying safety evaluations.

SURGENT: A Surgical Multi-Agent Assistance System Across the Perioperative Workflow

cs.CL · 2026-05-28 · unverdicted · novelty 4.0

SURGENT is a multi-agent surgical assistance system with novel memory management that outperforms baseline LLMs on case analysis, plan simulation, safety monitoring, risk assessment, and rehabilitation guidance.

Ask Patients with Patience: Enabling LLMs for Human-Centric Medical Dialogue with Grounded Reasoning

cs.CL · 2025-02-11 · unverdicted · novelty 4.0

APP is a multi-turn LLM framework for medical dialogue that combines empathetic questioning, Bayesian active learning, and guideline-based reasoning, outperforming baselines on a new simulated-patient benchmark in accuracy, uncertainty reduction, and user experience.

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

cs.AI · 2025-03-31 · unverdicted · novelty 2.0

This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.

citing papers explorer

Showing 6 of 6 citing papers after filters.

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments cs.HC · 2024-05-13 · conditional · none · ref 17
AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.
TextGrad: Automatic "Differentiation" via Text cs.CL · 2024-06-11 · unverdicted · none · ref 66
TextGrad performs automatic differentiation for compound AI systems by backpropagating natural-language feedback from LLMs to optimize variables ranging from code to molecular structures.
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs cs.CL · 2024-12-25 · unverdicted · none · ref 47
HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
Capabilities of Gemini Models in Medicine cs.AI · 2024-04-29 · unverdicted · none · ref 179
Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.
Can an LLM Learn Preferences from Choice Data? econ.GN · 2024-01-14 · unverdicted · none · ref 27
LLMs show improving recommendation accuracy with more observed choices under the disappointment aversion model, but learning success is heterogeneous across models and preference parameters.
GPT-4o System Card cs.CL · 2024-10-25 · unverdicted · none · ref 40
GPT-4o is OpenAI's end-to-end multimodal model with human-like audio latency, improved non-English text performance, stronger vision and audio understanding, and accompanying safety evaluations.

Can generalist foundation models outcompete special-purpose tuning? case study in medicine

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer