Can generalist foundation models outcompete special-purpose tuning? case study in medicine,

Nori, H · 2023 · arXiv 2311.16452

21 Pith papers cite this work. Polarity classification is still indexing.

21 Pith papers citing it

read on arXiv browse 21 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

cs.CV · 2026-05-19 · accept · novelty 8.0

NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

cs.HC · 2024-05-13 · conditional · novelty 8.0

AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.

MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events

cs.CL · 2026-04-16 · unverdicted · novelty 7.0

MADE creates a contamination-resistant living benchmark for multi-label classification of medical device adverse events, with evaluations revealing model-specific trade-offs in accuracy and uncertainty quantification.

TextGrad: Automatic "Differentiation" via Text

cs.CL · 2024-06-11 · unverdicted · novelty 7.0

TextGrad performs automatic differentiation for compound AI systems by backpropagating natural-language feedback from LLMs to optimize variables ranging from code to molecular structures.

AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

AgentCo-op retrieves and assembles existing agents and tools into interoperable workflows for open-world scientific tasks, showing effectiveness in genomics case studies and competitive benchmark results with lower costs.

PrivScope: Task-scoped Disclosure Control for Hybrid Agentic Systems

cs.CR · 2026-05-15 · unverdicted · novelty 6.0

PrivScope enforces task-scoped disclosure at the local-cloud boundary in hybrid agents, eliminating profile leakage and halving re-identification risk on medical workflows while preserving task success.

AgentSlimming: Towards Efficient and Cost-Aware Multi-Agent Systems

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

AgentSlimming compresses graph-structured multi-agent systems by estimating agent importance and removing or replacing low-value agents, cutting token costs by up to 78.9% with negligible performance loss.

SymptomAI: Toward a Conversational AI Agent for Everyday Symptom Assessment

cs.AI · 2026-05-05 · unverdicted · novelty 6.0 · 2 refs

Large real-world deployment found conversational AI agents for everyday symptom assessment more accurate than clinicians and improved by structured interviewing.

Evaluating Small Open LLMs for Medical Question Answering: A Practical Framework

cs.IR · 2026-04-12 · unverdicted · novelty 6.0

Small open LLMs produce highly variable medical answers even at low temperature, with self-agreement at most 0.20 and 87-97% unique outputs per model across 10 runs.

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

cs.CL · 2024-12-25 · unverdicted · novelty 6.0

HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.

Capabilities of Gemini Models in Medicine

cs.AI · 2024-04-29 · unverdicted · novelty 6.0

Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.

Can an LLM Learn Preferences from Choice Data?

econ.GN · 2024-01-14 · unverdicted · novelty 6.0

LLMs show improving recommendation accuracy with more observed choices under the disappointment aversion model, but learning success is heterogeneous across models and preference parameters.

AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks

cs.AI · 2026-05-11 · unverdicted · novelty 5.0 · 2 refs

Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.

Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes

cs.AI · 2026-05-07 · unverdicted · novelty 5.0

Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.

Medical Reasoning with Large Language Models: A Survey and MR-Bench

cs.CL · 2026-03-17 · accept · novelty 5.0

LLMs show strong exam performance on medical tasks but exhibit a clear gap in accuracy on authentic clinical decision-making as measured by the new MR-Bench benchmark and unified evaluations.

Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks

cs.CL · 2025-09-29 · conditional · novelty 5.0

MedIRT applies Item Response Theory to medical LLM benchmarks to separate latent competency from item difficulty and discrimination, producing more stable rankings and revealing domain heterogeneity than accuracy alone.

Making Prompts First-Class Citizens for Adaptive LLM Pipelines

cs.DB · 2025-08-07 · unverdicted · novelty 5.0

SPEAR proposes structured prompt views, runtime adaptive refinement, and policy rules to make prompts first-class, versioned, and evolvable components in complex LLM applications.

GPT-4o System Card

cs.CL · 2024-10-25 · unverdicted · novelty 5.0

GPT-4o is OpenAI's end-to-end multimodal model with human-like audio latency, improved non-English text performance, stronger vision and audio understanding, and accompanying safety evaluations.

Ask Patients with Patience: Enabling LLMs for Human-Centric Medical Dialogue with Grounded Reasoning

cs.CL · 2025-02-11 · unverdicted · novelty 4.0

APP is a multi-turn LLM framework for medical dialogue that combines empathetic questioning, Bayesian active learning, and guideline-based reasoning, outperforming baselines on a new simulated-patient benchmark in accuracy, uncertainty reduction, and user experience.

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

cs.AI · 2025-03-31 · unverdicted · novelty 2.0

This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.

Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution

cs.CL · 2026-03-05

citing papers explorer

Showing 21 of 21 citing papers.

NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding cs.CV · 2026-05-19 · accept · none · ref 44
NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments cs.HC · 2024-05-13 · conditional · none · ref 17
AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.
MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events cs.CL · 2026-04-16 · unverdicted · none · ref 5
MADE creates a contamination-resistant living benchmark for multi-label classification of medical device adverse events, with evaluations revealing model-specific trade-offs in accuracy and uncertainty quantification.
TextGrad: Automatic "Differentiation" via Text cs.CL · 2024-06-11 · unverdicted · none · ref 66
TextGrad performs automatic differentiation for compound AI systems by backpropagating natural-language feedback from LLMs to optimize variables ranging from code to molecular structures.
AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows cs.AI · 2026-05-19 · unverdicted · none · ref 18
AgentCo-op retrieves and assembles existing agents and tools into interoperable workflows for open-world scientific tasks, showing effectiveness in genomics case studies and competitive benchmark results with lower costs.
PrivScope: Task-scoped Disclosure Control for Hybrid Agentic Systems cs.CR · 2026-05-15 · unverdicted · none · ref 41
PrivScope enforces task-scoped disclosure at the local-cloud boundary in hybrid agents, eliminating profile leakage and halving re-identification risk on medical workflows while preserving task success.
AgentSlimming: Towards Efficient and Cost-Aware Multi-Agent Systems cs.LG · 2026-05-09 · unverdicted · none · ref 3
AgentSlimming compresses graph-structured multi-agent systems by estimating agent importance and removing or replacing low-value agents, cutting token costs by up to 78.9% with negligible performance loss.
SymptomAI: Toward a Conversational AI Agent for Everyday Symptom Assessment cs.AI · 2026-05-05 · unverdicted · none · ref 6 · 2 links
Large real-world deployment found conversational AI agents for everyday symptom assessment more accurate than clinicians and improved by structured interviewing.
Evaluating Small Open LLMs for Medical Question Answering: A Practical Framework cs.IR · 2026-04-12 · unverdicted · none · ref 17
Small open LLMs produce highly variable medical answers even at low temperature, with self-agreement at most 0.20 and 87-97% unique outputs per model across 10 runs.
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs cs.CL · 2024-12-25 · unverdicted · none · ref 47
HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
Capabilities of Gemini Models in Medicine cs.AI · 2024-04-29 · unverdicted · none · ref 179
Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.
Can an LLM Learn Preferences from Choice Data? econ.GN · 2024-01-14 · unverdicted · none · ref 27
LLMs show improving recommendation accuracy with more observed choices under the disappointment aversion model, but learning success is heterogeneous across models and preference parameters.
AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks cs.AI · 2026-05-11 · unverdicted · none · ref 45 · 2 links
Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes cs.AI · 2026-05-07 · unverdicted · none · ref 51
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
Medical Reasoning with Large Language Models: A Survey and MR-Bench cs.CL · 2026-03-17 · accept · none · ref 71
LLMs show strong exam performance on medical tasks but exhibit a clear gap in accuracy on authentic clinical decision-making as measured by the new MR-Bench benchmark and unified evaluations.
Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks cs.CL · 2025-09-29 · conditional · none · ref 17
MedIRT applies Item Response Theory to medical LLM benchmarks to separate latent competency from item difficulty and discrimination, producing more stable rankings and revealing domain heterogeneity than accuracy alone.
Making Prompts First-Class Citizens for Adaptive LLM Pipelines cs.DB · 2025-08-07 · unverdicted · none · ref 7
SPEAR proposes structured prompt views, runtime adaptive refinement, and policy rules to make prompts first-class, versioned, and evolvable components in complex LLM applications.
GPT-4o System Card cs.CL · 2024-10-25 · unverdicted · none · ref 40
GPT-4o is OpenAI's end-to-end multimodal model with human-like audio latency, improved non-English text performance, stronger vision and audio understanding, and accompanying safety evaluations.
Ask Patients with Patience: Enabling LLMs for Human-Centric Medical Dialogue with Grounded Reasoning cs.CL · 2025-02-11 · unverdicted · none · ref 2
APP is a multi-turn LLM framework for medical dialogue that combines empathetic questioning, Bayesian active learning, and guideline-based reasoning, outperforming baselines on a new simulated-patient benchmark in accuracy, uncertainty reduction, and user experience.
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems cs.AI · 2025-03-31 · unverdicted · none · ref 183
This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.
Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution cs.CL · 2026-03-05 · unreviewed · ref 16

Can generalist foundation models outcompete special-purpose tuning? case study in medicine,

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer