PubMedQA: A Dataset for Biomedical Research Question Answering

Bhuwan Dhingra; Qiao Jin; William W. Cohen; Xinghua Lu; Zhengping Liu

arxiv: 1909.06146 · v1 · pith:MUTIEFSHnew · submitted 2019-09-13 · 💻 cs.CL · cs.LG· q-bio.QM

PubMedQA: A Dataset for Biomedical Research Question Answering

Qiao Jin , Bhuwan Dhingra , Zhengping Liu , William W. Cohen , Xinghua Lu This is my paper

Pith reviewed 2026-05-23 18:39 UTC · model grok-4.3

classification 💻 cs.CL cs.LGq-bio.QM

keywords biomedical question answeringPubMedQAyes/no/maybe answersPubMed abstractsresearch question answeringBioBERTdatasetscientific reasoning

0 comments

The pith

PubMedQA is a dataset of 273k biomedical research questions answered yes/no/maybe from paper abstracts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PubMedQA to fill the gap in datasets that require models to answer real biomedical research questions by reasoning over quantitative details in abstracts. Each instance pairs a question taken from an article title with the abstract body as context and derives the yes/no/maybe label from the abstract's conclusion section. It supplies 1k expert-annotated examples plus large numbers of unlabeled and artificially generated ones to support model training. The best reported model reaches 68.1 percent accuracy, well below single-human performance at 78 percent and above the majority baseline of 55.2 percent. This construction forces any successful system to perform inference over scientific results rather than surface-level retrieval.

Core claim

PubMedQA is the first QA dataset where reasoning over biomedical research texts, especially their quantitative contents, is required to answer the questions. It is built from PubMed abstracts by turning titles into questions, stripping conclusions from the abstracts to serve as context, and summarizing each conclusion as a yes/no/maybe label. The released collection contains 1k expert-annotated instances, 61.2k unlabeled instances, and 211.3k artificially generated instances.

What carries the argument

The question-context-answer triple formed by taking a research title as the question, the abstract body without its conclusion as context, and the conclusion itself as the source of the yes/no/maybe label.

If this is right

Biomedical QA models can now be trained and measured specifically on the task of inferring answers from quantitative results in research abstracts.
Multi-phase fine-tuning of BioBERT plus bag-of-words statistics from the long answer improves accuracy on this task over standard fine-tuning.
The performance gap between the best model at 68.1 percent and human performance at 78 percent remains available for future methods to close.
The mixture of expert labels, unlabeled data, and generated instances supports both fully supervised and semi-supervised training regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dataset could serve as a testbed for whether language models can extract and reconcile conflicting quantitative findings across multiple abstracts.
Training on PubMedQA might improve downstream systems that summarize or answer questions about clinical trial results.
The yes/no/maybe format naturally captures the uncertainty present in many biomedical findings and could be extended to graded confidence scores.

Load-bearing premise

The conclusion section of each abstract reliably answers the research question derived from its title.

What would settle it

A re-annotation study in which independent experts assign different yes/no/maybe labels to more than 20 percent of the 1k expert-annotated instances would show the labels do not consistently reflect the abstracts.

read the original abstract

We introduce PubMedQA, a novel biomedical question answering (QA) dataset collected from PubMed abstracts. The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts. PubMedQA has 1k expert-annotated, 61.2k unlabeled and 211.3k artificially generated QA instances. Each PubMedQA instance is composed of (1) a question which is either an existing research article title or derived from one, (2) a context which is the corresponding abstract without its conclusion, (3) a long answer, which is the conclusion of the abstract and, presumably, answers the research question, and (4) a yes/no/maybe answer which summarizes the conclusion. PubMedQA is the first QA dataset where reasoning over biomedical research texts, especially their quantitative contents, is required to answer the questions. Our best performing model, multi-phase fine-tuning of BioBERT with long answer bag-of-word statistics as additional supervision, achieves 68.1% accuracy, compared to single human performance of 78.0% accuracy and majority-baseline of 55.2% accuracy, leaving much room for improvement. PubMedQA is publicly available at https://pubmedqa.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PubMedQA releases a new biomedical QA dataset at decent scale, but its yes/no/maybe labels rest on the unverified claim that abstract conclusions answer the title questions.

read the letter

The core contribution is a dataset of 1k expert-labeled PubMed QA pairs plus larger unlabeled and artificial sets. Questions come from titles, context is the abstract minus its conclusion, and the label summarizes that conclusion. The paper positions this as the first QA resource that forces reasoning over quantitative biomedical text rather than just retrieval. They report a BioBERT multi-phase fine-tune with bag-of-words supervision on the long answer reaching 68.1% accuracy, against 78% single-human and 55% majority baseline. The release itself is useful for anyone training models on scientific literature.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces PubMedQA, a dataset for biomedical research question answering consisting of 1k expert-annotated, 61.2k unlabeled, and 211.3k artificially generated instances. Each instance comprises a question derived from a PubMed article title, the abstract (minus conclusion) as context, the conclusion as the long answer, and a yes/no/maybe label that summarizes the conclusion. The authors present baselines showing that multi-phase fine-tuning of BioBERT augmented with long-answer bag-of-words statistics achieves 68.1% accuracy, against a single-human baseline of 78.0% and a majority baseline of 55.2%.

Significance. If the labels prove reliable, the dataset supplies a needed benchmark for QA models that must perform non-trivial reasoning over real biomedical research abstracts, especially their quantitative content. The public release, the three-tier scale, and the explicit performance gap between the strongest model and humans are concrete strengths that would support follow-on work.

major comments (3)

[§3] §3 (Dataset Construction): The yes/no/maybe labels for both the 1k expert-annotated set and the 211.3k artificial instances are produced by summarizing the conclusion section, which the text states 'presumably' answers the title-derived question. No quantitative validation of this alignment (e.g., fraction of cases in which the conclusion directly supports, contradicts, or is silent on the question) is reported. Because this mapping defines the target labels, any systematic mismatch directly affects the reported 68.1% model accuracy, the 78.0% human baseline, and the claim that the task requires non-trivial reasoning.
[§4] §4 (Experiments) and human-evaluation paragraph: No inter-annotator agreement figures or detailed labeling protocol for the 1k expert-annotated instances are supplied. This information is load-bearing for interpreting the 78.0% human performance number and for establishing that the task is well-defined.
[Results section] Results section and Table reporting the 68.1% figure: The manuscript provides no explicit description of the train/validation/test splits used for the artificial data or of any post-generation validation steps. Without these details it is impossible to rule out leakage or selection effects that could inflate the reported accuracy.

minor comments (2)

[Model description] The precise construction of the 'long answer bag-of-word statistics' feature used as additional supervision is described only at a high level; a short algorithmic sketch or pseudocode would improve reproducibility.
[Introduction] A small number of related biomedical QA or reasoning datasets are cited; adding one or two more recent references would better situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We respond to each major point below and will revise the manuscript to address the concerns where possible.

read point-by-point responses

Referee: [§3] §3 (Dataset Construction): The yes/no/maybe labels for both the 1k expert-annotated set and the 211.3k artificial instances are produced by summarizing the conclusion section, which the text states 'presumably' answers the title-derived question. No quantitative validation of this alignment (e.g., fraction of cases in which the conclusion directly supports, contradicts, or is silent on the question) is reported. Because this mapping defines the target labels, any systematic mismatch directly affects the reported 68.1% model accuracy, the 78.0% human baseline, and the claim that the task requires non-trivial reasoning.

Authors: We acknowledge that the manuscript describes the alignment as 'presumably' without providing quantitative validation of how often the conclusion directly supports, contradicts, or is silent on the title-derived question. The construction follows the conventional structure of PubMed abstracts, in which the conclusion is expected to address the research question. To strengthen the paper, we will add a quantitative analysis on a random sample of instances reporting these fractions in the revised version. revision: yes
Referee: [§4] §4 (Experiments) and human-evaluation paragraph: No inter-annotator agreement figures or detailed labeling protocol for the 1k expert-annotated instances are supplied. This information is load-bearing for interpreting the 78.0% human performance number and for establishing that the task is well-defined.

Authors: We agree that inter-annotator agreement and the labeling protocol details are important for interpreting the human baseline. These were collected but not reported in the original submission. We will include both the agreement statistics and a full description of the annotation protocol in the revised manuscript. revision: yes
Referee: [Results section] Results section and Table reporting the 68.1% figure: The manuscript provides no explicit description of the train/validation/test splits used for the artificial data or of any post-generation validation steps. Without these details it is impossible to rule out leakage or selection effects that could inflate the reported accuracy.

Authors: We will add explicit descriptions of the train/validation/test splits for the artificially generated instances and any post-generation validation steps performed, ensuring the revised manuscript allows readers to assess potential leakage or selection effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity; dataset paper with explicit construction

full rationale

This is a data-collection and benchmarking paper that introduces PubMedQA by explicitly defining questions from titles, contexts from abstracts (minus conclusions), long answers from conclusions, and yes/no/maybe labels as summaries of those conclusions. No mathematical derivation, predictive model, or first-principles result is claimed whose output reduces to its inputs by construction. The 'presumably' qualifier on label generation is an open methodological assumption, not a hidden self-definition. Reported model accuracies (e.g., 68.1%) are standard fine-tuning results on the new data, not tautological predictions. No self-citation chain supports a load-bearing uniqueness claim. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset construction paper. No free parameters, axioms, or invented entities are introduced; the central contribution is the collection and labeling process itself.

pith-pipeline@v0.9.0 · 5787 in / 998 out tokens · 17804 ms · 2026-05-23T18:39:07.416183+00:00 · methodology

discussion (0)

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments
cs.HC 2024-05 conditional novelty 8.0

AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences acros...
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
cs.CR 2026-05 unverdicted novelty 7.0

PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
cs.CV 2026-01 conditional novelty 7.0

IBISAgent enables MLLMs to perform iterative pixel-level visual reasoning for biomedical object referring and segmentation via text-based clicks and agentic RL, outperforming prior SOTA methods without model modifications.
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
cs.CL 2024-02 unverdicted novelty 7.0

M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual,...
Capabilities of GPT-4 on Medical Challenge Problems
cs.CL 2023-03 unverdicted novelty 7.0

GPT-4 exceeds the USMLE passing score by more than 20 points and outperforms both GPT-3.5 and the medically fine-tuned Med-PaLM on the MultiMedQA benchmarks.
CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?
cs.CL 2026-05 unverdicted novelty 6.0

CHI-Bench shows current AI agents achieve at most 28% success on long-horizon healthcare workflows that require dense policy adherence, multi-role handoffs, and multi-turn interactions.
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
cs.CR 2026-05 unverdicted novelty 6.0

PragLocker generates function-preserving but non-portable prompts for LLM agents via code-symbol semantic anchoring followed by target-model feedback noise injection.
Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints
cs.AI 2026-04 unverdicted novelty 6.0

Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.
MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors
cs.CL 2026-04 unverdicted novelty 6.0

MedDialBench shows LLMs suffer 1.7-3.4x larger diagnostic accuracy drops from patients fabricating symptoms than withholding them, with fabrication driving super-additive interaction effects across models.
LABBench2: An Improved Benchmark for AI Systems Performing Biology Research
cs.AI 2026-02 unverdicted novelty 6.0

LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.
Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting
cs.CL 2025-10 unverdicted novelty 6.0

Attention-Shifting uses importance-aware suppression on unlearning data and retention enhancement on retained data via dual-loss optimization to achieve selective unlearning with better utility preservation than prior...
DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks
cs.LG 2025-02 unverdicted novelty 6.0

DUET is a global-to-local method that optimizes LLM training data mixtures via Bayesian optimization guided by influence-based selection and feedback from unseen evaluation tasks, with a regret bound showing convergen...
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
cs.CL 2024-12 unverdicted novelty 6.0

HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
An Empirical Study of Mamba-based Language Models
cs.LG 2024-06 accept novelty 6.0

An 8B Mamba-2-Hybrid with 43% Mamba-2, 7% attention, and 50% MLP layers exceeds an 8B Transformer by 2.65 points on average across 12 tasks and matches it on 23 long-context tasks while enabling up to 8x faster inference.
Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support
cs.AI 2026-05 unverdicted novelty 5.0

Multi-turn evidence seeking reduces LLM diagnostic accuracy by 12.75% and supporting-evidence quality by 24.36% versus full-context evaluation in a new OSCE-inspired benchmark across 468 cases and 15 models.
Claim-Selective Certification for High-Risk Medical Retrieval-Augmented Generation
cs.CL 2026-05 unverdicted novelty 5.0

Claim-selective certification decomposes medical RAG responses into verifiable claims scored against retrieved evidence and mapped via an intent-aware selector to actions, reporting zero UCCR and action accuracy of 0....
Gyan: An Explainable Neuro-Symbolic Language Model
cs.CL 2026-05 unverdicted novelty 5.0

Gyan is a novel explainable neuro-symbolic language model that decouples language modeling from knowledge representation using rhetorical and semantic theories and reports superior performance on multiple datasets.
VerifAI: A Verifiable Open-Source Search Engine for Biomedical Question Answering
cs.IR 2026-01 unverdicted novelty 5.0

VerifAI is an open-source biomedical QA system that decomposes generated answers into claims and verifies them with a fine-tuned NLI engine to reduce hallucinations and provide traceable citations.
CoreGuard: Safeguarding Foundational Capabilities of LLMs Against Model Stealing in Edge Deployment
cs.CR 2024-10 unverdicted novelty 5.0

CoreGuard introduces a computation- and communication-efficient protocol claimed to deliver upper-bound security against model stealing for edge-deployed LLMs with negligible overhead.
Galactica: A Large Language Model for Science
cs.CL 2022-11 unverdicted novelty 5.0

Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
Medical Incident Causal Factors and Preventive Measures Generation Using Tag-based Example Selection in Few-shot Learning
cs.CL 2026-05 unverdicted novelty 4.0

Tag-based few-shot selection yields higher precision and stability than random or similarity-based methods when using LLMs to analyze medical incidents.
Gyan: An Explainable Neuro-Symbolic Language Model
cs.CL 2026-05 unverdicted novelty 4.0

Gyan is a novel explainable non-transformer language model that achieves SOTA results on multiple datasets by mimicking human-like compositional context and world models.
MedGemma 1.5 Technical Report
cs.AI 2026-04 unverdicted novelty 4.0

MedGemma 1.5 4B reports absolute gains of 11% on 3D MRI classification, 3% on 3D CT, 47% macro F1 on pathology slides, 35% IoU on anatomical localization, and 5-22% on clinical QA tasks over MedGemma 1.
Comparative Analysis of Large Language Models in Healthcare
cs.CL 2026-04 unverdicted novelty 3.0

Domain-specific models like ChatDoctor excel at medically accurate and contextually reliable text while general-purpose models like Grok and LLaMA perform better on structured medical question-answering tasks.
Data-Centric Foundation Models in Computational Healthcare: A Survey
cs.LG 2024-01 unverdicted novelty 3.0

The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.
Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026)
cs.CL 2025-01 unverdicted novelty 2.0

A literature survey of Small Language Models (1-8B parameters) that can perform comparably or better than larger models, covering general-purpose and task-specific approaches plus creation techniques.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 24 Pith papers · 9 internal anchors

[1]

Annals of thoracic and cardiovascular surgery , volume=

Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting? , author=. Annals of thoracic and cardiovascular surgery , volume=. 2011 , publisher=

work page 2011
[2]

and Surdeanu, Mihai and Bauer, John and Finkel, Jenny and Bethard, Steven J

Manning, Christopher D. and Surdeanu, Mihai and Bauer, John and Finkel, Jenny and Bethard, Steven J. and McClosky, David , title =. Association for Computational Linguistics (ACL) System Demonstrations , year =

work page
[5]

Natural questions: a benchmark for question answering research , author=

work page
[8]

Transactions of the Association of Computational Linguistics , volume=

The narrativeqa reading comprehension challenge , author=. Transactions of the Association of Computational Linguistics , volume=. 2018 , publisher=

work page 2018
[9]

BMC bioinformatics , volume=

An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition , author=. BMC bioinformatics , volume=. 2015 , publisher=

work page 2015
[15]

Distributional Semantics Resources for Biomedical Text Processing , author=

work page
[16]

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018) , year=

BioRead: A New Dataset for Biomedical Reading Comprehension , author=. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018) , year=

work page 2018
[17]

JMIR medical informatics , volume=

A pilot study of biomedical text comprehension using an attention-based deep neural reader: Design and experimental analysis , author=. JMIR medical informatics , volume=. 2018 , publisher=

work page 2018
[18]

TREC 2006 , year=

TREC 2006 Genomics Track Overview , author=. TREC 2006 , year=

work page 2006
[19]

TREC 2007 , year=

TREC 2007 Genomics Track Overview , author=. TREC 2007 , year=

work page 2007
[20]

CLEF 2012 Conference and Labs of the Evaluation Forum-Question Answering For Machine Reading Evaluation (QA4MRE), Rome/Forner, J.[edit.]; ea , pages=

Machine reading of biomedical texts about Alzheimers disease , author=. CLEF 2012 Conference and Labs of the Evaluation Forum-Question Answering For Machine Reading Evaluation (QA4MRE), Rome/Forner, J.[edit.]; ea , pages=

work page 2012
[21]

International Conference of the Cross-Language Evaluation Forum for European Languages , pages=

QA4MRE 2011-2013: Overview of question answering for machine reading evaluation , author=. International Conference of the Cross-Language Evaluation Forum for European Languages , pages=. 2013 , organization=

work page 2011
[24]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Pap...

work page 2019
[25]

Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2016. Enhanced lstm for natural language inference. arXiv preprint arXiv:1609.06038

work page arXiv 2016
[26]

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

work page 2019
[27]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

William Hersh, Aaron Cohen, Lynn Ruslen, and Phoebe Roberts. 2007. Trec 2007 genomics track overview. In TREC 2007

work page 2007
[29]

Cohen, Phoebe Roberts, and Hari Krishna Rekapalli

William Hersh, Aaron M. Cohen, Phoebe Roberts, and Hari Krishna Rekapalli. 2006. Trec 2006 genomics track overview. In TREC 2006

work page 2006
[30]

Qiao Jin, Bhuwan Dhingra, William W Cohen, and Xinghua Lu. 2019. Probing biomedical embeddings from language models. arXiv preprint arXiv:1904.02181

work page internal anchor Pith review Pith/arXiv arXiv 2019
[31]

Seongsoon Kim, Donghyeon Park, Yonghwa Choi, Kyubum Lee, Byounggun Kim, Minji Jeon, Jihye Kim, Aik Choon Tan, and Jaewoo Kang. 2018. A pilot study of biomedical text comprehension using an attention-based deep neural reader: Design and experimental analysis. JMIR medical informatics, 6(1):e2

work page 2018
[32]

Tom \'a s Ko c isk \`y , Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, G \'a abor Melis, and Edward Grefenstette. 2018. The narrativeqa reading comprehension challenge. Transactions of the Association of Computational Linguistics, 6:317--328

work page 2018
[33]

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Rhinehart, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, et al. 2019. Natural questions: a benchmark for question answering research

work page 2019
[34]

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. Biobert: pre-trained biomedical language representation model for biomedical text mining. arXiv preprint arXiv:1901.08746

work page arXiv 2019
[36]

Shuming Ma, Xu Sun, Yizhong Wang, and Junyang Lin. 2018. Bag-of-words as target for neural machine translation. arXiv preprint arXiv:1805.04871

work page internal anchor Pith review Pith/arXiv arXiv 2018
[37]

Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J

Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. http://www.aclweb.org/anthology/P/P14/P14-5010 The Stanford CoreNLP natural language processing toolkit . In Association for Computational Linguistics (ACL) System Demonstrations, pages 55--60

work page 2014
[38]

Roser Morante, Martin Krallinger, Alfonso Valencia, and Walter Daelemans. 2012. Machine reading of biomedical texts about alzheimers disease. In CLEF 2012 Conference and Labs of the Evaluation Forum-Question Answering For Machine Reading Evaluation (QA4MRE), Rome/Forner, J.[edit.]; ea, pages 1--14

work page 2012
[39]

Anusri Pampari, Preethi Raghavan, Jennifer Liang, and Jian Peng. 2018. emrqa: A large corpus for question answering on electronic medical records. arXiv preprint arXiv:1809.00732

work page internal anchor Pith review Pith/arXiv arXiv 2018
[40]

Dimitris Pappas, Ion Androutsopoulos, and Haris Papageorgiou. 2018. Bioread: A new dataset for biomedical reading comprehension. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018)

work page 2018
[41]

Anselmo Pe \ n as, Eduard Hovy, Pamela Forner, \'A lvaro Rodrigo, Richard Sutcliffe, and Roser Morante. 2013. Qa4mre 2011-2013: Overview of question answering for machine reading evaluation. In International Conference of the Cross-Language Evaluation Forum for European Languages, pages 303--320. Springer

work page 2013
[42]

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365

work page internal anchor Pith review Pith/arXiv arXiv 2018
[43]

Sampo Pyysalo, Filip Ginter, Hans Moen, Tapio Salakoski, and Sophia Ananiadou. 2013. Distributional semantics resources for biomedical text processing

work page 2013
[44]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250

work page internal anchor Pith review Pith/arXiv arXiv 2016
[45]

Marzieh Saeidi, Max Bartolo, Patrick Lewis, Sameer Singh, Tim Rockt \"a schel, Mike Sheldon, Guillaume Bouchard, and Sebastian Riedel. 2018. Interpretation of natural language rules in conversational machine reading. arXiv preprint arXiv:1809.01494

work page internal anchor Pith review Pith/arXiv arXiv 2018
[46]

Hiroaki Sakamoto, Yasunori Watanabe, and Masataka Satou. 2011. Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting? Annals of thoracic and cardiovascular surgery, 17(4):376--382

work page 2011
[47]

George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, et al. 2015. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics, 16(1):138

work page 2015
[48]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600

work page internal anchor Pith review Pith/arXiv arXiv 2018

[1] [1]

Annals of thoracic and cardiovascular surgery , volume=

Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting? , author=. Annals of thoracic and cardiovascular surgery , volume=. 2011 , publisher=

work page 2011

[2] [2]

and Surdeanu, Mihai and Bauer, John and Finkel, Jenny and Bethard, Steven J

Manning, Christopher D. and Surdeanu, Mihai and Bauer, John and Finkel, Jenny and Bethard, Steven J. and McClosky, David , title =. Association for Computational Linguistics (ACL) System Demonstrations , year =

work page

[3] [5]

Natural questions: a benchmark for question answering research , author=

work page

[4] [8]

Transactions of the Association of Computational Linguistics , volume=

The narrativeqa reading comprehension challenge , author=. Transactions of the Association of Computational Linguistics , volume=. 2018 , publisher=

work page 2018

[5] [9]

BMC bioinformatics , volume=

An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition , author=. BMC bioinformatics , volume=. 2015 , publisher=

work page 2015

[6] [15]

Distributional Semantics Resources for Biomedical Text Processing , author=

work page

[7] [16]

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018) , year=

BioRead: A New Dataset for Biomedical Reading Comprehension , author=. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018) , year=

work page 2018

[8] [17]

JMIR medical informatics , volume=

A pilot study of biomedical text comprehension using an attention-based deep neural reader: Design and experimental analysis , author=. JMIR medical informatics , volume=. 2018 , publisher=

work page 2018

[9] [18]

TREC 2006 , year=

TREC 2006 Genomics Track Overview , author=. TREC 2006 , year=

work page 2006

[10] [19]

TREC 2007 , year=

TREC 2007 Genomics Track Overview , author=. TREC 2007 , year=

work page 2007

[11] [20]

CLEF 2012 Conference and Labs of the Evaluation Forum-Question Answering For Machine Reading Evaluation (QA4MRE), Rome/Forner, J.[edit.]; ea , pages=

Machine reading of biomedical texts about Alzheimers disease , author=. CLEF 2012 Conference and Labs of the Evaluation Forum-Question Answering For Machine Reading Evaluation (QA4MRE), Rome/Forner, J.[edit.]; ea , pages=

work page 2012

[12] [21]

International Conference of the Cross-Language Evaluation Forum for European Languages , pages=

QA4MRE 2011-2013: Overview of question answering for machine reading evaluation , author=. International Conference of the Cross-Language Evaluation Forum for European Languages , pages=. 2013 , organization=

work page 2011

[13] [24]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Pap...

work page 2019

[14] [25]

Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2016. Enhanced lstm for natural language inference. arXiv preprint arXiv:1609.06038

work page arXiv 2016

[15] [26]

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

work page 2019

[16] [27]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018

[17] [28]

William Hersh, Aaron Cohen, Lynn Ruslen, and Phoebe Roberts. 2007. Trec 2007 genomics track overview. In TREC 2007

work page 2007

[18] [29]

Cohen, Phoebe Roberts, and Hari Krishna Rekapalli

William Hersh, Aaron M. Cohen, Phoebe Roberts, and Hari Krishna Rekapalli. 2006. Trec 2006 genomics track overview. In TREC 2006

work page 2006

[19] [30]

Qiao Jin, Bhuwan Dhingra, William W Cohen, and Xinghua Lu. 2019. Probing biomedical embeddings from language models. arXiv preprint arXiv:1904.02181

work page internal anchor Pith review Pith/arXiv arXiv 2019

[20] [31]

Seongsoon Kim, Donghyeon Park, Yonghwa Choi, Kyubum Lee, Byounggun Kim, Minji Jeon, Jihye Kim, Aik Choon Tan, and Jaewoo Kang. 2018. A pilot study of biomedical text comprehension using an attention-based deep neural reader: Design and experimental analysis. JMIR medical informatics, 6(1):e2

work page 2018

[21] [32]

Tom \'a s Ko c isk \`y , Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, G \'a abor Melis, and Edward Grefenstette. 2018. The narrativeqa reading comprehension challenge. Transactions of the Association of Computational Linguistics, 6:317--328

work page 2018

[22] [33]

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Rhinehart, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, et al. 2019. Natural questions: a benchmark for question answering research

work page 2019

[23] [34]

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683

work page internal anchor Pith review Pith/arXiv arXiv 2017

[24] [35]

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. Biobert: pre-trained biomedical language representation model for biomedical text mining. arXiv preprint arXiv:1901.08746

work page arXiv 2019

[25] [36]

Shuming Ma, Xu Sun, Yizhong Wang, and Junyang Lin. 2018. Bag-of-words as target for neural machine translation. arXiv preprint arXiv:1805.04871

work page internal anchor Pith review Pith/arXiv arXiv 2018

[26] [37]

Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J

Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. http://www.aclweb.org/anthology/P/P14/P14-5010 The Stanford CoreNLP natural language processing toolkit . In Association for Computational Linguistics (ACL) System Demonstrations, pages 55--60

work page 2014

[27] [38]

Roser Morante, Martin Krallinger, Alfonso Valencia, and Walter Daelemans. 2012. Machine reading of biomedical texts about alzheimers disease. In CLEF 2012 Conference and Labs of the Evaluation Forum-Question Answering For Machine Reading Evaluation (QA4MRE), Rome/Forner, J.[edit.]; ea, pages 1--14

work page 2012

[28] [39]

Anusri Pampari, Preethi Raghavan, Jennifer Liang, and Jian Peng. 2018. emrqa: A large corpus for question answering on electronic medical records. arXiv preprint arXiv:1809.00732

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [40]

Dimitris Pappas, Ion Androutsopoulos, and Haris Papageorgiou. 2018. Bioread: A new dataset for biomedical reading comprehension. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018)

work page 2018

[30] [41]

Anselmo Pe \ n as, Eduard Hovy, Pamela Forner, \'A lvaro Rodrigo, Richard Sutcliffe, and Roser Morante. 2013. Qa4mre 2011-2013: Overview of question answering for machine reading evaluation. In International Conference of the Cross-Language Evaluation Forum for European Languages, pages 303--320. Springer

work page 2013

[31] [42]

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365

work page internal anchor Pith review Pith/arXiv arXiv 2018

[32] [43]

Sampo Pyysalo, Filip Ginter, Hans Moen, Tapio Salakoski, and Sophia Ananiadou. 2013. Distributional semantics resources for biomedical text processing

work page 2013

[33] [44]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250

work page internal anchor Pith review Pith/arXiv arXiv 2016

[34] [45]

Marzieh Saeidi, Max Bartolo, Patrick Lewis, Sameer Singh, Tim Rockt \"a schel, Mike Sheldon, Guillaume Bouchard, and Sebastian Riedel. 2018. Interpretation of natural language rules in conversational machine reading. arXiv preprint arXiv:1809.01494

work page internal anchor Pith review Pith/arXiv arXiv 2018

[35] [46]

Hiroaki Sakamoto, Yasunori Watanabe, and Masataka Satou. 2011. Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting? Annals of thoracic and cardiovascular surgery, 17(4):376--382

work page 2011

[36] [47]

George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, et al. 2015. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics, 16(1):138

work page 2015

[37] [48]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600

work page internal anchor Pith review Pith/arXiv arXiv 2018