pith. sign in

arxiv: 1906.09285 · v1 · pith:XCVZTTPQnew · submitted 2019-06-21 · 💻 cs.CL

SurfCon: Synonym Discovery on Privacy-Aware Clinical Data

Pith reviewed 2026-05-25 18:43 UTC · model grok-4.3

classification 💻 cs.CL
keywords synonym discoveryprivacy-aware clinical datasurface formglobal contextmedical termsout-of-vocabularyco-occurrence countsclinical texts
0
0 comments X

The pith

SurfCon discovers medical synonyms from aggregated co-occurrence counts without raw clinical texts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Privacy rules often prevent access to full clinical texts even though they contain rich health information. SurfCon solves synonym discovery using only extracted medical terms paired with their aggregated co-occurrence counts. The method pairs surface-form similarity, which catches terms that look alike, with global context patterns that surface semantically related terms in different forms. It also manages queries for terms absent from the supplied data. Experiments on public privacy-aware datasets show consistent gains over strong baselines across multiple settings.

Core claim

SurfCon is a framework that leverages surface form information to detect synonyms with similar appearances and global context information from aggregated co-occurrence counts to detect semantically similar synonyms, allowing synonym discovery on privacy-aware clinical data while also addressing out-of-vocabulary query terms.

What carries the argument

SurfCon framework with a surface form module and a complementary global context module that together operate on aggregated term co-occurrences.

If this is right

  • SurfCon identifies both surface-similar and semantically similar synonyms from the same aggregated input.
  • The framework handles out-of-vocabulary query terms not present in the given data.
  • All processing stays within privacy-aware aggregated counts and never requires raw patient texts.
  • Performance exceeds strong baseline methods by large margins under varied experimental conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hospitals could generate and share synonym resources derived from their local aggregates while keeping raw records private.
  • The same dual-module approach may transfer to other domains that release only aggregated term statistics.
  • Combining string-level features with global statistics can offset the loss of full sentence context.

Load-bearing premise

Surface form details together with global co-occurrence patterns in aggregated data contain enough signal to identify accurate synonyms for both similar and dissimilar surface forms.

What would settle it

Run SurfCon on a held-out list of established medical synonyms and measure whether correct synonyms are ranked substantially lower than incorrect ones or whether many known synonyms are missed entirely.

Figures

Figures reproduced from arXiv: 1906.09285 by Huan Sun, Simon Lin, Soheil Moosavinasab, Xiang Yue, Yungui Huang, Zhen Wang.

Figure 1
Figure 1. Figure 1: Task illustration: We aim to discover synonyms for [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Framework overview. For each query term, a list [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Dynamic Context Matching Mechanism. In contrast to the static approach, we propose the dynamic context matching mechanism (as shown in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance w.r.t. (a) the coefficient of context [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Unstructured clinical texts contain rich health-related information. To better utilize the knowledge buried in clinical texts, discovering synonyms for a medical query term has become an important task. Recent automatic synonym discovery methods leveraging raw text information have been developed. However, to preserve patient privacy and security, it is usually quite difficult to get access to large-scale raw clinical texts. In this paper, we study a new setting named synonym discovery on privacy-aware clinical data (i.e., medical terms extracted from the clinical texts and their aggregated co-occurrence counts, without raw clinical texts). To solve the problem, we propose a new framework SurfCon that leverages two important types of information in the privacy-aware clinical data, i.e., the surface form information, and the global context information for synonym discovery. In particular, the surface form module enables us to detect synonyms that look similar while the global context module plays a complementary role to discover synonyms that are semantically similar but in different surface forms, and both allow us to deal with the OOV query issue (i.e., when the query is not found in the given data). We conduct extensive experiments and case studies on publicly available privacy-aware clinical data, and show that SurfCon can outperform strong baseline methods by large margins under various settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes SurfCon, a framework for synonym discovery on privacy-aware clinical data (extracted medical terms and aggregated co-occurrence counts, without raw text). It uses a surface form module for similar-looking synonyms and a global context module for semantically similar but dissimilar-surface synonyms; both are claimed to handle OOV queries, with extensive experiments showing large-margin outperformance over strong baselines under various settings.

Significance. If the results are robust, the work addresses a practical constraint in medical NLP by enabling synonym discovery from privacy-preserving aggregated data rather than raw clinical notes, which could facilitate knowledge extraction in regulated settings.

major comments (1)
  1. [Abstract] Abstract: the claim that 'both [modules] allow us to deal with the OOV query issue' is inconsistent with the global context module's definition via aggregated co-occurrence counts in the privacy-aware data. An OOV query term is absent from that data by definition and therefore has no associated counts, so the global context module supplies no signal; only the surface form module remains. This directly affects the central claim that the two information types together suffice for semantically similar synonyms on OOV queries, which underpins the reported large-margin gains under 'various settings' that include OOV.
minor comments (1)
  1. [Abstract] The abstract supplies no experimental details (datasets, baselines, metrics, or OOV-specific evaluation protocol), making it impossible to assess whether the outperformance claim is supported.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading of the manuscript and for highlighting the inconsistency in the abstract's claim regarding OOV queries. We address this point directly below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'both [modules] allow us to deal with the OOV query issue' is inconsistent with the global context module's definition via aggregated co-occurrence counts in the privacy-aware data. An OOV query term is absent from that data by definition and therefore has no associated counts, so the global context module supplies no signal; only the surface form module remains. This directly affects the central claim that the two information types together suffice for semantically similar synonyms on OOV queries, which underpins the reported large-margin gains under 'various settings' that include OOV.

    Authors: We agree that the abstract wording is imprecise and inconsistent with the technical definition of the global context module. By construction, an OOV query has no co-occurrence counts in the privacy-aware data, so the global context module cannot contribute any signal for such queries; only the surface form module applies. The two modules are complementary for in-vocabulary queries, where global context can identify semantically similar terms with dissimilar surface forms. We will revise the abstract (and any corresponding statements in the introduction or method sections) to state clearly that the surface form module handles OOV queries while the global context module augments performance on in-vocabulary queries. We will also verify that the experimental results under the OOV setting are presented without implying contribution from the global context module. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes the SurfCon framework leveraging surface-form and global-context modules on aggregated privacy-aware clinical data for synonym discovery, including OOV handling. No equations, derivations, or parameter-fitting steps are described in the abstract or text that reduce any prediction or result to its inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing elements. The method is presented as a direct combination of two information types without self-referential definitions or renaming of known results. The derivation remains self-contained as an empirical method proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review limited to abstract; ledger populated from stated assumptions in the problem definition and module descriptions.

axioms (2)
  • domain assumption Aggregated co-occurrence counts encode semantic similarity between medical terms
    Invoked by the global context module description in the abstract.
  • domain assumption Surface form similarity is a reliable signal for synonymy in clinical terminology
    Invoked by the surface form module description in the abstract.

pith-pipeline@v0.9.0 · 5763 in / 1163 out tokens · 24236 ms · 2026-05-25T18:43:27.925573+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 1 internal anchor

  1. [1]

    Ballesteros, C

    M. Ballesteros, C. Dyer, and N. A. Smith. 2015. Improved transition-based parsing by modeling characters instead of words with LSTMs. In EMNLP

  2. [2]

    A. L. Beam, B. Kompa, I. Fried, N. P. Palmer, X. Shi, T. Cai, and I. S. Kohane. 2018. Clinical Concept Embeddings Learned from Massive Sources of Medical Data. arXiv preprint arXiv:1804.01486 (2018)

  3. [3]

    Bodenreider

    O. Bodenreider. 2004. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research 32, suppl_1 (2004), D267–D270

  4. [4]

    Bojanowski, E

    P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. 2016. Enriching word vectors with subword information. TACL (2016)

  5. [5]

    Z. Cao, T. Qin, T. Liu, M. Tsai, and H. Li. 2007. Learning to rank: from pairwise approach to listwise approach. In ICML

  6. [6]

    Ohio Supercomputer Center. 1987. Ohio Supercomputer Center. http://osc.edu/ ark:/19495/f5s1ph73

  7. [7]

    D. A. Dorr, W.F. Phillips, S. Phansalkar, S. A. Sims, and J. F. Hurdle. 2006. Assessing the difficulty and time cost of de-identification in clinical narratives. Methods of information in medicine (2006)

  8. [8]

    S. G. Finlayson, P. LePendu, and N. H. Shah. 2014. Building the graph of medicine from millions of clinical narratives. Scientific data 1 (2014), 140032

  9. [9]

    L Garfinkel

    S. L Garfinkel. 2015. De-identification of personal information. NISTIR (2015)

  10. [10]

    W. H. Gomaa and A. A. Fahmy. 2013. A survey of text similarity approaches. In IJCA

  11. [11]

    Hagiwara, Y

    M. Hagiwara, Y. Ogawa, and K. Toyama. 2009. Supervised synonym acquisition using distributional features and syntactic patterns. IMT (2009)

  12. [12]

    Hamilton, Z

    W. Hamilton, Z. Ying, and J. Leskovec. 2017. Inductive representation learning on large graphs. In NeurIPS

  13. [13]

    Hashimoto, Y

    K. Hashimoto, Y. Tsuruoka, R. Socher, and o. 2017. A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks. In ACL

  14. [14]

    Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush. 2016. Character-Aware Neural Language Models.. In AAAI

  15. [15]

    D. P. Kingma and J. Ba. 2015. Adam: A method for stochastic optimization. In ICLR

  16. [16]

    LePendu, S

    P. LePendu, S. V. Iyer, C. Fairon, and N. H. Shah. 2012. Annotation analysis for testing drug safety signals using unstructured clinical notes. In Journal of biomedical semantics, Vol. 3. BioMed Central, S5

  17. [17]

    Levy and Y

    O. Levy and Y. Goldberg. 2014. Linguistic regularities in sparse and explicit word representations. In ACL

  18. [18]

    Levy and Y

    O. Levy and Y. Goldberg. 2014. Neural word embedding as implicit matrix factorization. In NeurIPS

  19. [19]

    Liang, P

    J. Liang, P. Jacobs, J. Sun, and S. Parthasarathy. 2018. Semi-supervised embedding in attributed networks with outliers. In SDM

  20. [20]

    H. J. Lowe, T. A. Ferris, P. M. Hernandez, and S. C. Weber. 2009. STRIDE–An integrated standards-based translational research informatics platform. InAMIA

  21. [21]

    Matsuo, T

    Y. Matsuo, T. Sakaki, and K. Uchiyama. 2006. Graph-based word clustering using a web search engine. In EMNLP

  22. [22]

    Efficient Estimation of Word Representations in Vector Space

    T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781 (2013)

  23. [23]

    Mikolov, I

    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. In NeurIPS

  24. [24]

    Mueller and A

    J. Mueller and A. Thyagarajan. 2016. Siamese Recurrent Architectures for Learn- ing Sentence Similarity.. In AAAI

  25. [25]

    Neculoiu, M

    P. Neculoiu, M. Versteegh, and M. Rotaru. 2016. Learning text similarity with siamese recurrent networks. In Workshop on Representation Learning for NLP

  26. [26]

    S. V. Pakhomov, G. Finley, R. McEwan, Y. Wang, and G. B. Melton. 2016. Corpus domain effects on distributional semantic modeling of medical terms. Bioinfor- matics 32, 23 (2016), 3635–3644

  27. [27]

    Paszke, S

    A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, et al . 2017. Automatic differentiation in PyTorch. InNIPS-W

  28. [28]

    Pennington, R

    J. Pennington, R. Socher, and C. Manning. 2014. Glove: Global vectors for word representation. In EMNLP

  29. [29]

    Perozzi, R

    B. Perozzi, R. Al-Rfou, and S. Skiena. 2014. Deepwalk: Online learning of social representations. In KDD

  30. [30]

    M. Qu, X. Ren, and J. Han. 2017. Automatic synonym discovery with knowledge bases. In KDD

  31. [31]

    J. Shen, R. Lv, X. Ren, M. Vanni, B. Sadler, and J. Han. 2019. Mining Entity Synonyms with Efficient Neural Set Generation. In AAAI

  32. [32]

    Stubbs and Ö

    A. Stubbs and Ö. Uzuner. 2015. Annotating longitudinal clinical narratives for de- identification: The 2014 i2b2/UTHealth corpus. Journal of biomedical informatics 58 (2015), S20–S29

  33. [33]

    C. N. Ta, M. Dumontier, G. Hripcsak, N. P. Tatonetti, and C. Weng. 2018. Columbia Open Health Data, clinical concept prevalence and co-occurrence from electronic health records. Scientific data 5 (2018), 180273

  34. [34]

    J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. 2015. Line: Large-scale information network embedding. In WWW

  35. [35]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In NeurIPS

  36. [36]

    Velickovic, G

    P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. 2018. Graph attention networks. In ICLR

  37. [37]

    C. Wang, L. Cao, and B. Zhou. 2015. Medical synonym extraction with concept space models. In IJCAI

  38. [38]

    Q. Wang, B. Wang, and L. Guo. 2015. Knowledge Base Completion Using Embed- dings and Rules.. In IJCAI

  39. [39]

    Weeds, D

    J. Weeds, D. Weir, and D. McCarthy. 2004. Characterising measures of lexical distributional similarity. In COLING

  40. [40]

    Wieting, M

    J. Wieting, M. Bansal, K. Gimpel, and K. Livescu. 2016. Charagram: Embedding words and sentences via character n-grams. In EMNLP

  41. [41]

    Z. Yang, W. W. Cohen, and R. Salakhutdinov. 2016. Revisiting semi-supervised learning with graph embeddings. In ICML

  42. [42]

    Zhang, Y

    C. Zhang, Y. Li, N. Du, W. Fan, and P. S. Yu. 2018. SynonymNet: Multi-context Bilateral Matching for Entity Synonyms. arXiv preprint arXiv:1901.00056 (2018)