Medmcqa : A large-scale multi-subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, Malaikannan Sankarasubbu · 2022 · arXiv 2203.14371

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

cs.AI · 2026-04-12 · unverdicted · novelty 7.0

LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.

Automatic Replication of LLM Mistakes in Medical Conversations

cs.CL · 2025-12-24 · unverdicted · novelty 7.0

MedMistake automatically generates 3,390 single-shot QA pairs capturing LLM mistakes in medical conversations, with expert validation on a 211-question subset showing performance differences among 12 frontier models.

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

cs.CL · 2026-05-15 · unverdicted · novelty 6.0

CHI-Bench shows current AI agents achieve at most 28% success on long-horizon healthcare workflows that require dense policy adherence, multi-role handoffs, and multi-turn interactions.

Do Self-Evolving Agents Forget? Capability Degradation and Preservation in Lifelong LLM Agent Adaptation

cs.AI · 2026-05-10 · unverdicted · novelty 6.0

Self-evolving LLM agents exhibit capability erosion under continual adaptation, which Capability-Preserving Evolution mitigates by raising retained simple-task performance from 41.8% to 52.8% in workflow evolution under GPT-5.1.

Compared to What? Baselines and Metrics for Counterfactual Prompting

cs.CL · 2026-05-01 · conditional · novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.

Eliciting Medical Reasoning with Knowledge-enhanced Data Synthesis: A Semi-Supervised Reinforcement Learning Approach

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

MedSSR improves LLM medical reasoning on rare diseases by up to 5.93% through knowledge-enhanced question synthesis and semi-supervised RL with self-generated pseudo-labels.

Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support

cs.AI · 2026-05-21 · unverdicted · novelty 5.0

Multi-turn evidence seeking reduces LLM diagnostic accuracy by 12.75% and supporting-evidence quality by 24.36% versus full-context evaluation in a new OSCE-inspired benchmark across 468 cases and 15 models.

Claim-Selective Certification for High-Risk Medical Retrieval-Augmented Generation

cs.CL · 2026-05-21 · unverdicted · novelty 5.0

Claim-selective certification decomposes medical RAG responses into verifiable claims scored against retrieved evidence and mapped via an intent-aware selector to actions, reporting zero UCCR and action accuracy of 0.92 on dev and 0.90 on test.

Galactica: A Large Language Model for Science

cs.CL · 2022-11-16 · unverdicted · novelty 5.0

Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.

citing papers explorer

Showing 9 of 9 citing papers.

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences? cs.AI · 2026-04-12 · unverdicted · none · ref 32
LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.
Automatic Replication of LLM Mistakes in Medical Conversations cs.CL · 2025-12-24 · unverdicted · none · ref 13
MedMistake automatically generates 3,390 single-shot QA pairs capturing LLM mistakes in medical conversations, with expert validation on a 211-question subset showing performance differences among 12 frontier models.
CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows? cs.CL · 2026-05-15 · unverdicted · none · ref 44
CHI-Bench shows current AI agents achieve at most 28% success on long-horizon healthcare workflows that require dense policy adherence, multi-role handoffs, and multi-turn interactions.
Do Self-Evolving Agents Forget? Capability Degradation and Preservation in Lifelong LLM Agent Adaptation cs.AI · 2026-05-10 · unverdicted · none · ref 32
Self-evolving LLM agents exhibit capability erosion under continual adaptation, which Capability-Preserving Evolution mitigates by raising retained simple-task performance from 41.8% to 52.8% in workflow evolution under GPT-5.1.
Compared to What? Baselines and Metrics for Counterfactual Prompting cs.CL · 2026-05-01 · conditional · none · ref 125
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
Eliciting Medical Reasoning with Knowledge-enhanced Data Synthesis: A Semi-Supervised Reinforcement Learning Approach cs.LG · 2026-04-13 · unverdicted · none · ref 5
MedSSR improves LLM medical reasoning on rare diseases by up to 5.93% through knowledge-enhanced question synthesis and semi-supervised RL with self-generated pseudo-labels.
Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support cs.AI · 2026-05-21 · unverdicted · none · ref 6
Multi-turn evidence seeking reduces LLM diagnostic accuracy by 12.75% and supporting-evidence quality by 24.36% versus full-context evaluation in a new OSCE-inspired benchmark across 468 cases and 15 models.
Claim-Selective Certification for High-Risk Medical Retrieval-Augmented Generation cs.CL · 2026-05-21 · unverdicted · none · ref 30
Claim-selective certification decomposes medical RAG responses into verifiable claims scored against retrieved evidence and mapped via an intent-aware selector to actions, reporting zero UCCR and action accuracy of 0.92 on dev and 0.90 on test.
Galactica: A Large Language Model for Science cs.CL · 2022-11-16 · unverdicted · none · ref 220
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.

Medmcqa : A large-scale multi-subject multi-choice dataset for medical domain question answering

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer