hub

Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

Pal, A · 2022 · arXiv 2203.14371

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

cs.AI · 2026-04-12 · unverdicted · novelty 7.0

LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.

Automatic Replication of LLM Mistakes in Medical Conversations

cs.CL · 2025-12-24 · unverdicted · novelty 7.0

MedMistake automatically generates 3,390 single-shot QA pairs capturing LLM mistakes in medical conversations, with expert validation on a 211-question subset showing performance differences among 12 frontier models.

Clinician-Level Agreement Without Clinical Caution: LLM Evaluator Limits in Medical AI Benchmarking

cs.CL · 2026-07-01 · unverdicted · novelty 6.0

LLM evaluators reach clinician-level agreement on a new German medical benchmark but fail to abstain on difficult items and show lineage-dependent scoring biases.

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

cs.AI · 2026-06-02 · conditional · novelty 6.0

Systematic tests of LLM contamination detectors across 27 models show frequent failures from distribution shift and scale, concluding statistical methods cannot replace transparent data provenance.

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

cs.CL · 2026-05-15 · unverdicted · novelty 6.0

CHI-Bench shows current AI agents achieve at most 28% success on long-horizon healthcare workflows that require dense policy adherence, multi-role handoffs, and multi-turn interactions.

Do Self-Evolving Agents Forget? Capability Degradation and Preservation in Lifelong LLM Agent Adaptation

cs.AI · 2026-05-10 · unverdicted · novelty 6.0

Self-evolving LLM agents exhibit capability erosion under continual adaptation, which Capability-Preserving Evolution mitigates by raising retained simple-task performance from 41.8% to 52.8% in workflow evolution under GPT-5.1.

Compared to What? Baselines and Metrics for Counterfactual Prompting

cs.CL · 2026-05-01 · conditional · novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.

Eliciting Medical Reasoning with Knowledge-enhanced Data Synthesis: A Semi-Supervised Reinforcement Learning Approach

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

MedSSR improves LLM medical reasoning on rare diseases by up to 5.93% through knowledge-enhanced question synthesis and semi-supervised RL with self-generated pseudo-labels.

Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages

cs.CL · 2026-05-28 · unverdicted · novelty 5.0

Fine-tuning a Spanish biomedical encoder on Gemini-generated synthetic data for multiple languages yields a bi-encoder that matches or exceeds BioBERT-ST on clinical code retrieval metrics, with further gains from cross-encoder reranking on most languages.

Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support

cs.AI · 2026-05-21 · unverdicted · novelty 5.0

Multi-turn evidence seeking reduces LLM diagnostic accuracy by 12.75% and supporting-evidence quality by 24.36% versus full-context evaluation in a new OSCE-inspired benchmark across 468 cases and 15 models.

Claim-Selective Certification for High-Risk Medical Retrieval-Augmented Generation

cs.CL · 2026-05-21 · unverdicted · novelty 5.0

Claim-selective certification decomposes medical RAG responses into verifiable claims scored against retrieved evidence and mapped via an intent-aware selector to actions, reporting zero UCCR and action accuracy of 0.92 on dev and 0.90 on test.

Galactica: A Large Language Model for Science

cs.CL · 2022-11-16 · unverdicted · novelty 5.0

Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.

MDIA: A Multi-Agent Diagnostic Intelligence Pipeline on HealthBench Professional

cs.AI · 2026-05-23 · unverdicted · novelty 4.0

MDIA, a specialty-routed 7-node multi-agent system, reports 0.6272 accuracy on 525 HealthBench Professional cases using GPT-5.4, outperforming the ChatGPT for Clinicians baseline by 3.72 points and attributing the lift to architectural components.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer