UW-BHI at MEDIQA 2019: An Analysis of Representation Methods for Medical Natural Language Inference

Jason A. Thomas; William R. Kearns; Wilson Lau

arxiv: 1907.04286 · v1 · pith:6S4E7O5Hnew · submitted 2019-07-09 · 💻 cs.IR · cs.CL· cs.LG

UW-BHI at MEDIQA 2019: An Analysis of Representation Methods for Medical Natural Language Inference

William R. Kearns , Wilson Lau , Jason A. Thomas This is my paper

Pith reviewed 2026-05-25 00:02 UTC · model grok-4.3

classification 💻 cs.IR cs.CLcs.LG

keywords medical natural language inferencerepresentation methodsBERTESIMMedNLIESPCui2Vecsemantic understanding

0 comments

The pith

The performance and internal representations of an ESIM model on MedNLI depend on whether BERT, ESP or Cui2Vec supplies the input representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper compares an Enhanced Sequential Inference Model across three representation methods—BERT, Embeddings of Semantic Predications, and Cui2Vec—on the Medical Natural Language Inference task. The goal is to understand how these methods, which combine distributed and knowledge-based approaches, perform when semantic understanding is central. A reader would care because the results could guide choices in medical NLP applications where inference depends on domain semantics. The evaluation focuses on both task accuracy and what the model learns internally under each condition.

Core claim

The choice of representation method among BERT, ESP, and Cui2Vec influences both the accuracy achieved by the ESIM on the MedNLI task and the characteristics of the model's internal representations.

What carries the argument

The Enhanced Sequential Inference Model (ESIM) operating under different embedding conditions from BERT, ESP, or Cui2Vec.

If this is right

Different representation methods will produce different levels of performance on the MedNLI task.
The internal representations learned by the model will reflect the properties of the chosen input embeddings.
The MedNLI task can serve to distinguish the strengths of knowledge-based versus purely distributed representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Results from this comparison could guide selection of embeddings for other clinical inference tasks beyond MedNLI.
The analysis opens the possibility of hybrid representations that combine the strengths observed from each method.

Load-bearing premise

The MedNLI task relies heavily on semantic understanding and therefore serves as a suitable evaluation set for comparing the representation methods.

What would settle it

If the three representation methods produce identical performance scores and indistinguishable internal representations in the ESIM on the MedNLI dataset, the claim that they differ would be falsified.

Figures

Figures reproduced from arXiv: 1907.04286 by Jason A. Thomas, William R. Kearns, Wilson Lau.

**Figure 2.** Figure 2: An example of a correct ESP prediction demonstrating its ability to associate Advil as a subclass of [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of predicted probability of the gold label from the subset of correct predictions for each [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Recent advances in distributed language modeling have led to large performance increases on a variety of natural language processing (NLP) tasks. However, it is not well understood how these methods may be augmented by knowledge-based approaches. This paper compares the performance and internal representation of an Enhanced Sequential Inference Model (ESIM) between three experimental conditions based on the representation method: Bidirectional Encoder Representations from Transformers (BERT), Embeddings of Semantic Predications (ESP), or Cui2Vec. The methods were evaluated on the Medical Natural Language Inference (MedNLI) subtask of the MEDIQA 2019 shared task. This task relied heavily on semantic understanding and thus served as a suitable evaluation set for the comparison of these representation methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Basic comparison of three known embeddings in ESIM on MedNLI with no new method or reported numbers.

read the letter

This paper runs BERT, ESP, and Cui2Vec through the same ESIM model on the MedNLI shared-task data and compares how the representations behave. That is the whole contribution. It does not propose a new architecture, derive anything from first principles, or claim to fix a known problem in medical NLP. The motivation is reasonable: MedNLI needs semantic understanding, so it is a fair test bed for seeing whether knowledge-based embeddings like ESP or Cui2Vec add anything over BERT. The setup is clean and the three conditions are clearly defined. That is the positive part. The abstract gives no accuracy numbers, no training details, no error analysis, and no discussion of what the internal representations actually differ on. Without those, it is impossible to tell whether the comparison is informative or just noise. The scope is also narrow—one dataset, one model family, off-the-shelf embeddings—so there is little room for general claims. The paper is internally consistent and does not hide assumptions, but it also does not move the needle on representation choice. It is the kind of thing that belongs in a shared-task proceedings or a short workshop note rather than a full paper. Readers already working on medical NLI might skim the results section for the numbers; everyone else can skip it. I would not bring this to a reading group and would not cite it. A serious editor could send it to review for a workshop if the full version contains solid tables and some qualitative inspection of the representations, but on the current text it looks like a desk reject for a main conference.

Referee Report

1 major / 1 minor

Summary. The paper compares the performance and internal representation of an Enhanced Sequential Inference Model (ESIM) between three experimental conditions based on the representation method: BERT, Embeddings of Semantic Predications (ESP), or Cui2Vec. The methods were evaluated on the Medical Natural Language Inference (MedNLI) subtask of the MEDIQA 2019 shared task, which the authors argue is suitable because it relies heavily on semantic understanding.

Significance. If the results demonstrate clear, reproducible differences in how these representations capture medical semantics inside a fixed ESIM architecture, the work would help clarify when knowledge-based embeddings augment or underperform contextual models such as BERT on clinical inference tasks. The explicit focus on both performance and internal representations is a strength that could inform embedding selection in medical NLP.

major comments (1)

[Abstract] Abstract: the description of an empirical comparison is given, yet no performance numbers, error bars, training details, or evaluation metrics are reported. Without these data it is impossible to verify whether the three representation methods produce distinguishable results on MedNLI.

minor comments (1)

[Abstract] Abstract: the statement that MedNLI 'served as a suitable evaluation set' is asserted without a supporting citation or short rationale linking the task's semantic demands to the three chosen representations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their feedback. We address the single major comment below and agree that revisions to the abstract are warranted.

read point-by-point responses

Referee: [Abstract] Abstract: the description of an empirical comparison is given, yet no performance numbers, error bars, training details, or evaluation metrics are reported. Without these data it is impossible to verify whether the three representation methods produce distinguishable results on MedNLI.

Authors: We agree the abstract should report key results to allow immediate assessment of whether the representation methods yield distinguishable outcomes. In revision we will add the primary accuracy figures for the three ESIM variants (BERT, ESP, Cui2Vec) on the MedNLI test set and state that accuracy is the reported metric. Full training hyper-parameters, random seeds, and any error bars or statistical tests belong in the experimental setup and results sections rather than the abstract; we will ensure those sections already contain or will be expanded to contain this information so that the distinguishability of the three conditions can be verified. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper conducts a straightforward empirical comparison of three off-the-shelf representation methods (BERT, ESP, Cui2Vec) inside a fixed ESIM architecture on the MedNLI task. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The justification that MedNLI requires semantic understanding is an explicit, non-circular assumption consistent with the experimental goal. The work contains no load-bearing steps that reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical comparison of representation methods and contains no mathematical derivations, free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5661 in / 958 out tokens · 21727 ms · 2026-05-25T00:02:02.407923+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 7 internal anchors

[1]

The berkeley framenet project. In Pro- ceedings of the 36th Annual Meeting of the Asso- ciation for Computational Linguistics and 17th In- ternational Conference on Computational Linguis- tics - Volume 1 , ACL ’98/COLING ’98, pages 86– 90, Stroudsburg, PA, USA. Association for Com- putational Linguistics. https://doi.org/10. 3115/980845.980860. Andrew L. ...

work page arXiv 2018
[2]

Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vec- tors with subword information. Transactions of the Association for Computational Linguistics , 5:135–

work page 2017
[3]

Marc Mézard and Andrea Montanari.Information, Physics, and Computation

https://doi.org/10.1162/tacl_a_ 00051. Antoine Bordes and Jason Weston. 2009. Learn- ing Structured Embeddings of Knowledge Bases. Artiﬁcial Intelligence , (Bengio):301–306. https://doi.org/10.1016/j.procs. 2017.05.045. Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced LSTM for natural language inference. In Procee...

work page doi:10.1162/tacl_a_ 2009
[4]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Indexing by latent semantic analysis. JASIS, 41:391–407. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language under- standing. CoRR, abs/1810.04805. Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Pe- ters, Michael Schmitz, a...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/312624.312649 2018
[5]

In In Proceedings of the 22nd Annual Conference of the Cognitive Science Society, pages 103–6

Random indexing of text samples for latent semantic analysis. In In Proceedings of the 22nd Annual Conference of the Cognitive Science Society, pages 103–6. Erlbaum. William R Kearns and Jason A Thomas. 2018. Re- source and response type classiﬁcation for consumer health question answering. AMIA Annual Sym- posium proceedings. AMIA Symposium , 2018:634– 6...

work page 2018
[6]

Character-Aware Neural Language Models

SemMedDB: A PubMed-scale repository of biomedical semantic predications. Bioinfor- matics, 28(23):3158–3160. https://doi.org/ 10.1093/bioinformatics/bts591. Yoon Kim, Yacine Jernite, David Sontag, and Alexan- der M. Rush. 2015. Character-aware neural lan- guage models. CoRR, abs/1508.06615. Staffan Larsson and David R. Traum. 2000. In- formation state and...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1093/bioinformatics/bts591 2015
[7]

Lessons from Natural Language Inference in the Clinical Domain

https://doi.org/10.1016/j.jbi. 2003.11.003. Kirk Roberts and Dina Demner-fushman. 2016. An- notating Logical Forms for EHR Questions. In Proceedings of the 10th International Conference on Language Resources and Evaluation , Section 3, pages 3772–3778. Alexey Romanov and Chaitanya Shivade. 2018. Lessons from natural language inference in the clin- ical do...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/j.jbi 2003
[8]

Basic Reasoning with Tensor Product Representations

chapter Information Processing in Dynamical Systems: Foundations of Harmony Theory, pages 194–281. MIT Press, Cambridge, MA, USA. P. Smolensky. 1990. Tensor product variable bind- ing and the representation of symbolic structures in connectionist systems. Artif. Intell. , 46(1- 2):159–216. https://doi.org/10.1016/ 0004-3702(90)90007-M. Paul Smolensky, Moo...

work page internal anchor Pith review Pith/arXiv arXiv 1990
[9]

In Proceedings of the 15th Con- ference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Pa- pers, pages 142–151, Valencia, Spain

Recognizing mentions of adverse drug re- action in social media using knowledge-infused re- current models. In Proceedings of the 15th Con- ference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Pa- pers, pages 142–151, Valencia, Spain. Association for Computational Linguistics. Swabha Swayamdipta, Sam Thomson, Ke...

work page
[10]

Syntactic Scaffolds for Semantic Structures

Syntactic scaffolds for semantic structures. CoRR, abs/1808.10485. Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

Word representations: A simple and general method for semi-supervised learning. In Proceed- ings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 384–394, Up- psala, Sweden. Association for Computational Lin- guistics. Peter D. Turney and Patrick Pantel. 2010. From fre- quency to meaning: Vector space models of se- mantic...

work page internal anchor Pith review Pith/arXiv arXiv 2010
[12]

In Proceedings of the 25th International Con- ference on Machine Learning , ICML ’08, pages 1168–1175, New York, NY , USA

Deep learning via semi-supervised embed- ding. In Proceedings of the 25th International Con- ference on Machine Learning , ICML ’08, pages 1168–1175, New York, NY , USA. ACM. https: //doi.org/10.1145/1390156.1390303. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V . Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, ...

work page doi:10.1145/1390156.1390303 2016
[13]

Character-level Convolutional Networks for Text Classification

Character-level convolutional networks for text classiﬁcation. CoRR, abs/1509.01626

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

The berkeley framenet project. In Pro- ceedings of the 36th Annual Meeting of the Asso- ciation for Computational Linguistics and 17th In- ternational Conference on Computational Linguis- tics - Volume 1 , ACL ’98/COLING ’98, pages 86– 90, Stroudsburg, PA, USA. Association for Com- putational Linguistics. https://doi.org/10. 3115/980845.980860. Andrew L. ...

work page arXiv 2018

[2] [2]

Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vec- tors with subword information. Transactions of the Association for Computational Linguistics , 5:135–

work page 2017

[3] [3]

Marc Mézard and Andrea Montanari.Information, Physics, and Computation

https://doi.org/10.1162/tacl_a_ 00051. Antoine Bordes and Jason Weston. 2009. Learn- ing Structured Embeddings of Knowledge Bases. Artiﬁcial Intelligence , (Bengio):301–306. https://doi.org/10.1016/j.procs. 2017.05.045. Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced LSTM for natural language inference. In Procee...

work page doi:10.1162/tacl_a_ 2009

[4] [4]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Indexing by latent semantic analysis. JASIS, 41:391–407. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language under- standing. CoRR, abs/1810.04805. Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Pe- ters, Michael Schmitz, a...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/312624.312649 2018

[5] [5]

In In Proceedings of the 22nd Annual Conference of the Cognitive Science Society, pages 103–6

Random indexing of text samples for latent semantic analysis. In In Proceedings of the 22nd Annual Conference of the Cognitive Science Society, pages 103–6. Erlbaum. William R Kearns and Jason A Thomas. 2018. Re- source and response type classiﬁcation for consumer health question answering. AMIA Annual Sym- posium proceedings. AMIA Symposium , 2018:634– 6...

work page 2018

[6] [6]

Character-Aware Neural Language Models

SemMedDB: A PubMed-scale repository of biomedical semantic predications. Bioinfor- matics, 28(23):3158–3160. https://doi.org/ 10.1093/bioinformatics/bts591. Yoon Kim, Yacine Jernite, David Sontag, and Alexan- der M. Rush. 2015. Character-aware neural lan- guage models. CoRR, abs/1508.06615. Staffan Larsson and David R. Traum. 2000. In- formation state and...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1093/bioinformatics/bts591 2015

[7] [7]

Lessons from Natural Language Inference in the Clinical Domain

https://doi.org/10.1016/j.jbi. 2003.11.003. Kirk Roberts and Dina Demner-fushman. 2016. An- notating Logical Forms for EHR Questions. In Proceedings of the 10th International Conference on Language Resources and Evaluation , Section 3, pages 3772–3778. Alexey Romanov and Chaitanya Shivade. 2018. Lessons from natural language inference in the clin- ical do...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/j.jbi 2003

[8] [8]

Basic Reasoning with Tensor Product Representations

chapter Information Processing in Dynamical Systems: Foundations of Harmony Theory, pages 194–281. MIT Press, Cambridge, MA, USA. P. Smolensky. 1990. Tensor product variable bind- ing and the representation of symbolic structures in connectionist systems. Artif. Intell. , 46(1- 2):159–216. https://doi.org/10.1016/ 0004-3702(90)90007-M. Paul Smolensky, Moo...

work page internal anchor Pith review Pith/arXiv arXiv 1990

[9] [9]

In Proceedings of the 15th Con- ference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Pa- pers, pages 142–151, Valencia, Spain

Recognizing mentions of adverse drug re- action in social media using knowledge-infused re- current models. In Proceedings of the 15th Con- ference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Pa- pers, pages 142–151, Valencia, Spain. Association for Computational Linguistics. Swabha Swayamdipta, Sam Thomson, Ke...

work page

[10] [10]

Syntactic Scaffolds for Semantic Structures

Syntactic scaffolds for semantic structures. CoRR, abs/1808.10485. Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

Word representations: A simple and general method for semi-supervised learning. In Proceed- ings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 384–394, Up- psala, Sweden. Association for Computational Lin- guistics. Peter D. Turney and Patrick Pantel. 2010. From fre- quency to meaning: Vector space models of se- mantic...

work page internal anchor Pith review Pith/arXiv arXiv 2010

[12] [12]

In Proceedings of the 25th International Con- ference on Machine Learning , ICML ’08, pages 1168–1175, New York, NY , USA

Deep learning via semi-supervised embed- ding. In Proceedings of the 25th International Con- ference on Machine Learning , ICML ’08, pages 1168–1175, New York, NY , USA. ACM. https: //doi.org/10.1145/1390156.1390303. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V . Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, ...

work page doi:10.1145/1390156.1390303 2016

[13] [13]

Character-level Convolutional Networks for Text Classification

Character-level convolutional networks for text classiﬁcation. CoRR, abs/1509.01626

work page internal anchor Pith review Pith/arXiv arXiv