pith. sign in

arxiv: 1907.04286 · v1 · pith:6S4E7O5Hnew · submitted 2019-07-09 · 💻 cs.IR · cs.CL· cs.LG

UW-BHI at MEDIQA 2019: An Analysis of Representation Methods for Medical Natural Language Inference

Pith reviewed 2026-05-25 00:02 UTC · model grok-4.3

classification 💻 cs.IR cs.CLcs.LG
keywords medical natural language inferencerepresentation methodsBERTESIMMedNLIESPCui2Vecsemantic understanding
0
0 comments X

The pith

The performance and internal representations of an ESIM model on MedNLI depend on whether BERT, ESP or Cui2Vec supplies the input representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper compares an Enhanced Sequential Inference Model across three representation methods—BERT, Embeddings of Semantic Predications, and Cui2Vec—on the Medical Natural Language Inference task. The goal is to understand how these methods, which combine distributed and knowledge-based approaches, perform when semantic understanding is central. A reader would care because the results could guide choices in medical NLP applications where inference depends on domain semantics. The evaluation focuses on both task accuracy and what the model learns internally under each condition.

Core claim

The choice of representation method among BERT, ESP, and Cui2Vec influences both the accuracy achieved by the ESIM on the MedNLI task and the characteristics of the model's internal representations.

What carries the argument

The Enhanced Sequential Inference Model (ESIM) operating under different embedding conditions from BERT, ESP, or Cui2Vec.

If this is right

  • Different representation methods will produce different levels of performance on the MedNLI task.
  • The internal representations learned by the model will reflect the properties of the chosen input embeddings.
  • The MedNLI task can serve to distinguish the strengths of knowledge-based versus purely distributed representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Results from this comparison could guide selection of embeddings for other clinical inference tasks beyond MedNLI.
  • The analysis opens the possibility of hybrid representations that combine the strengths observed from each method.

Load-bearing premise

The MedNLI task relies heavily on semantic understanding and therefore serves as a suitable evaluation set for comparing the representation methods.

What would settle it

If the three representation methods produce identical performance scores and indistinguishable internal representations in the ESIM on the MedNLI dataset, the claim that they differ would be falsified.

Figures

Figures reproduced from arXiv: 1907.04286 by Jason A. Thomas, William R. Kearns, Wilson Lau.

Figure 1
Figure 1. Figure 1: An example of a correct BERT prediction demonstrating its general domain coverage and contextual [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An example of a correct ESP prediction demonstrating its ability to associate Advil as a subclass of [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of predicted probability of the gold label from the subset of correct predictions for each [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Recent advances in distributed language modeling have led to large performance increases on a variety of natural language processing (NLP) tasks. However, it is not well understood how these methods may be augmented by knowledge-based approaches. This paper compares the performance and internal representation of an Enhanced Sequential Inference Model (ESIM) between three experimental conditions based on the representation method: Bidirectional Encoder Representations from Transformers (BERT), Embeddings of Semantic Predications (ESP), or Cui2Vec. The methods were evaluated on the Medical Natural Language Inference (MedNLI) subtask of the MEDIQA 2019 shared task. This task relied heavily on semantic understanding and thus served as a suitable evaluation set for the comparison of these representation methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper compares the performance and internal representation of an Enhanced Sequential Inference Model (ESIM) between three experimental conditions based on the representation method: BERT, Embeddings of Semantic Predications (ESP), or Cui2Vec. The methods were evaluated on the Medical Natural Language Inference (MedNLI) subtask of the MEDIQA 2019 shared task, which the authors argue is suitable because it relies heavily on semantic understanding.

Significance. If the results demonstrate clear, reproducible differences in how these representations capture medical semantics inside a fixed ESIM architecture, the work would help clarify when knowledge-based embeddings augment or underperform contextual models such as BERT on clinical inference tasks. The explicit focus on both performance and internal representations is a strength that could inform embedding selection in medical NLP.

major comments (1)
  1. [Abstract] Abstract: the description of an empirical comparison is given, yet no performance numbers, error bars, training details, or evaluation metrics are reported. Without these data it is impossible to verify whether the three representation methods produce distinguishable results on MedNLI.
minor comments (1)
  1. [Abstract] Abstract: the statement that MedNLI 'served as a suitable evaluation set' is asserted without a supporting citation or short rationale linking the task's semantic demands to the three chosen representations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their feedback. We address the single major comment below and agree that revisions to the abstract are warranted.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the description of an empirical comparison is given, yet no performance numbers, error bars, training details, or evaluation metrics are reported. Without these data it is impossible to verify whether the three representation methods produce distinguishable results on MedNLI.

    Authors: We agree the abstract should report key results to allow immediate assessment of whether the representation methods yield distinguishable outcomes. In revision we will add the primary accuracy figures for the three ESIM variants (BERT, ESP, Cui2Vec) on the MedNLI test set and state that accuracy is the reported metric. Full training hyper-parameters, random seeds, and any error bars or statistical tests belong in the experimental setup and results sections rather than the abstract; we will ensure those sections already contain or will be expanded to contain this information so that the distinguishability of the three conditions can be verified. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper conducts a straightforward empirical comparison of three off-the-shelf representation methods (BERT, ESP, Cui2Vec) inside a fixed ESIM architecture on the MedNLI task. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The justification that MedNLI requires semantic understanding is an explicit, non-circular assumption consistent with the experimental goal. The work contains no load-bearing steps that reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical comparison of representation methods and contains no mathematical derivations, free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5661 in / 958 out tokens · 21727 ms · 2026-05-25T00:02:02.407923+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 7 internal anchors

  1. [1]

    The berkeley framenet project. In Pro- ceedings of the 36th Annual Meeting of the Asso- ciation for Computational Linguistics and 17th In- ternational Conference on Computational Linguis- tics - Volume 1 , ACL ’98/COLING ’98, pages 86– 90, Stroudsburg, PA, USA. Association for Com- putational Linguistics. https://doi.org/10. 3115/980845.980860. Andrew L. ...

  2. [2]

    Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vec- tors with subword information. Transactions of the Association for Computational Linguistics , 5:135–

  3. [3]

    Marc Mézard and Andrea Montanari.Information, Physics, and Computation

    https://doi.org/10.1162/tacl_a_ 00051. Antoine Bordes and Jason Weston. 2009. Learn- ing Structured Embeddings of Knowledge Bases. Artificial Intelligence , (Bengio):301–306. https://doi.org/10.1016/j.procs. 2017.05.045. Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced LSTM for natural language inference. In Procee...

  4. [4]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Indexing by latent semantic analysis. JASIS, 41:391–407. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language under- standing. CoRR, abs/1810.04805. Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Pe- ters, Michael Schmitz, a...

  5. [5]

    In In Proceedings of the 22nd Annual Conference of the Cognitive Science Society, pages 103–6

    Random indexing of text samples for latent semantic analysis. In In Proceedings of the 22nd Annual Conference of the Cognitive Science Society, pages 103–6. Erlbaum. William R Kearns and Jason A Thomas. 2018. Re- source and response type classification for consumer health question answering. AMIA Annual Sym- posium proceedings. AMIA Symposium , 2018:634– 6...

  6. [6]

    Character-Aware Neural Language Models

    SemMedDB: A PubMed-scale repository of biomedical semantic predications. Bioinfor- matics, 28(23):3158–3160. https://doi.org/ 10.1093/bioinformatics/bts591. Yoon Kim, Yacine Jernite, David Sontag, and Alexan- der M. Rush. 2015. Character-aware neural lan- guage models. CoRR, abs/1508.06615. Staffan Larsson and David R. Traum. 2000. In- formation state and...

  7. [7]

    Lessons from Natural Language Inference in the Clinical Domain

    https://doi.org/10.1016/j.jbi. 2003.11.003. Kirk Roberts and Dina Demner-fushman. 2016. An- notating Logical Forms for EHR Questions. In Proceedings of the 10th International Conference on Language Resources and Evaluation , Section 3, pages 3772–3778. Alexey Romanov and Chaitanya Shivade. 2018. Lessons from natural language inference in the clin- ical do...

  8. [8]

    Basic Reasoning with Tensor Product Representations

    chapter Information Processing in Dynamical Systems: Foundations of Harmony Theory, pages 194–281. MIT Press, Cambridge, MA, USA. P. Smolensky. 1990. Tensor product variable bind- ing and the representation of symbolic structures in connectionist systems. Artif. Intell. , 46(1- 2):159–216. https://doi.org/10.1016/ 0004-3702(90)90007-M. Paul Smolensky, Moo...

  9. [9]

    In Proceedings of the 15th Con- ference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Pa- pers, pages 142–151, Valencia, Spain

    Recognizing mentions of adverse drug re- action in social media using knowledge-infused re- current models. In Proceedings of the 15th Con- ference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Pa- pers, pages 142–151, Valencia, Spain. Association for Computational Linguistics. Swabha Swayamdipta, Sam Thomson, Ke...

  10. [10]

    Syntactic Scaffolds for Semantic Structures

    Syntactic scaffolds for semantic structures. CoRR, abs/1808.10485. Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio

  11. [11]

    Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

    Word representations: A simple and general method for semi-supervised learning. In Proceed- ings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 384–394, Up- psala, Sweden. Association for Computational Lin- guistics. Peter D. Turney and Patrick Pantel. 2010. From fre- quency to meaning: Vector space models of se- mantic...

  12. [12]

    In Proceedings of the 25th International Con- ference on Machine Learning , ICML ’08, pages 1168–1175, New York, NY , USA

    Deep learning via semi-supervised embed- ding. In Proceedings of the 25th International Con- ference on Machine Learning , ICML ’08, pages 1168–1175, New York, NY , USA. ACM. https: //doi.org/10.1145/1390156.1390303. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V . Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, ...

  13. [13]

    Character-level Convolutional Networks for Text Classification

    Character-level convolutional networks for text classification. CoRR, abs/1509.01626