SurfCon: Synonym Discovery on Privacy-Aware Clinical Data
Pith reviewed 2026-05-25 18:43 UTC · model grok-4.3
The pith
SurfCon discovers medical synonyms from aggregated co-occurrence counts without raw clinical texts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SurfCon is a framework that leverages surface form information to detect synonyms with similar appearances and global context information from aggregated co-occurrence counts to detect semantically similar synonyms, allowing synonym discovery on privacy-aware clinical data while also addressing out-of-vocabulary query terms.
What carries the argument
SurfCon framework with a surface form module and a complementary global context module that together operate on aggregated term co-occurrences.
If this is right
- SurfCon identifies both surface-similar and semantically similar synonyms from the same aggregated input.
- The framework handles out-of-vocabulary query terms not present in the given data.
- All processing stays within privacy-aware aggregated counts and never requires raw patient texts.
- Performance exceeds strong baseline methods by large margins under varied experimental conditions.
Where Pith is reading between the lines
- Hospitals could generate and share synonym resources derived from their local aggregates while keeping raw records private.
- The same dual-module approach may transfer to other domains that release only aggregated term statistics.
- Combining string-level features with global statistics can offset the loss of full sentence context.
Load-bearing premise
Surface form details together with global co-occurrence patterns in aggregated data contain enough signal to identify accurate synonyms for both similar and dissimilar surface forms.
What would settle it
Run SurfCon on a held-out list of established medical synonyms and measure whether correct synonyms are ranked substantially lower than incorrect ones or whether many known synonyms are missed entirely.
Figures
read the original abstract
Unstructured clinical texts contain rich health-related information. To better utilize the knowledge buried in clinical texts, discovering synonyms for a medical query term has become an important task. Recent automatic synonym discovery methods leveraging raw text information have been developed. However, to preserve patient privacy and security, it is usually quite difficult to get access to large-scale raw clinical texts. In this paper, we study a new setting named synonym discovery on privacy-aware clinical data (i.e., medical terms extracted from the clinical texts and their aggregated co-occurrence counts, without raw clinical texts). To solve the problem, we propose a new framework SurfCon that leverages two important types of information in the privacy-aware clinical data, i.e., the surface form information, and the global context information for synonym discovery. In particular, the surface form module enables us to detect synonyms that look similar while the global context module plays a complementary role to discover synonyms that are semantically similar but in different surface forms, and both allow us to deal with the OOV query issue (i.e., when the query is not found in the given data). We conduct extensive experiments and case studies on publicly available privacy-aware clinical data, and show that SurfCon can outperform strong baseline methods by large margins under various settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SurfCon, a framework for synonym discovery on privacy-aware clinical data (extracted medical terms and aggregated co-occurrence counts, without raw text). It uses a surface form module for similar-looking synonyms and a global context module for semantically similar but dissimilar-surface synonyms; both are claimed to handle OOV queries, with extensive experiments showing large-margin outperformance over strong baselines under various settings.
Significance. If the results are robust, the work addresses a practical constraint in medical NLP by enabling synonym discovery from privacy-preserving aggregated data rather than raw clinical notes, which could facilitate knowledge extraction in regulated settings.
major comments (1)
- [Abstract] Abstract: the claim that 'both [modules] allow us to deal with the OOV query issue' is inconsistent with the global context module's definition via aggregated co-occurrence counts in the privacy-aware data. An OOV query term is absent from that data by definition and therefore has no associated counts, so the global context module supplies no signal; only the surface form module remains. This directly affects the central claim that the two information types together suffice for semantically similar synonyms on OOV queries, which underpins the reported large-margin gains under 'various settings' that include OOV.
minor comments (1)
- [Abstract] The abstract supplies no experimental details (datasets, baselines, metrics, or OOV-specific evaluation protocol), making it impossible to assess whether the outperformance claim is supported.
Simulated Author's Rebuttal
We thank the referee for the careful reading of the manuscript and for highlighting the inconsistency in the abstract's claim regarding OOV queries. We address this point directly below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'both [modules] allow us to deal with the OOV query issue' is inconsistent with the global context module's definition via aggregated co-occurrence counts in the privacy-aware data. An OOV query term is absent from that data by definition and therefore has no associated counts, so the global context module supplies no signal; only the surface form module remains. This directly affects the central claim that the two information types together suffice for semantically similar synonyms on OOV queries, which underpins the reported large-margin gains under 'various settings' that include OOV.
Authors: We agree that the abstract wording is imprecise and inconsistent with the technical definition of the global context module. By construction, an OOV query has no co-occurrence counts in the privacy-aware data, so the global context module cannot contribute any signal for such queries; only the surface form module applies. The two modules are complementary for in-vocabulary queries, where global context can identify semantically similar terms with dissimilar surface forms. We will revise the abstract (and any corresponding statements in the introduction or method sections) to state clearly that the surface form module handles OOV queries while the global context module augments performance on in-vocabulary queries. We will also verify that the experimental results under the OOV setting are presented without implying contribution from the global context module. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper proposes the SurfCon framework leveraging surface-form and global-context modules on aggregated privacy-aware clinical data for synonym discovery, including OOV handling. No equations, derivations, or parameter-fitting steps are described in the abstract or text that reduce any prediction or result to its inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing elements. The method is presented as a direct combination of two information types without self-referential definitions or renaming of known results. The derivation remains self-contained as an empirical method proposal.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Aggregated co-occurrence counts encode semantic similarity between medical terms
- domain assumption Surface form similarity is a reliable signal for synonymy in clinical terminology
Reference graph
Works this paper leans on
-
[1]
M. Ballesteros, C. Dyer, and N. A. Smith. 2015. Improved transition-based parsing by modeling characters instead of words with LSTMs. In EMNLP
work page 2015
- [2]
-
[3]
O. Bodenreider. 2004. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research 32, suppl_1 (2004), D267–D270
work page 2004
-
[4]
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. 2016. Enriching word vectors with subword information. TACL (2016)
work page 2016
-
[5]
Z. Cao, T. Qin, T. Liu, M. Tsai, and H. Li. 2007. Learning to rank: from pairwise approach to listwise approach. In ICML
work page 2007
-
[6]
Ohio Supercomputer Center. 1987. Ohio Supercomputer Center. http://osc.edu/ ark:/19495/f5s1ph73
work page 1987
-
[7]
D. A. Dorr, W.F. Phillips, S. Phansalkar, S. A. Sims, and J. F. Hurdle. 2006. Assessing the difficulty and time cost of de-identification in clinical narratives. Methods of information in medicine (2006)
work page 2006
-
[8]
S. G. Finlayson, P. LePendu, and N. H. Shah. 2014. Building the graph of medicine from millions of clinical narratives. Scientific data 1 (2014), 140032
work page 2014
-
[9]
S. L Garfinkel. 2015. De-identification of personal information. NISTIR (2015)
work page 2015
-
[10]
W. H. Gomaa and A. A. Fahmy. 2013. A survey of text similarity approaches. In IJCA
work page 2013
-
[11]
M. Hagiwara, Y. Ogawa, and K. Toyama. 2009. Supervised synonym acquisition using distributional features and syntactic patterns. IMT (2009)
work page 2009
-
[12]
W. Hamilton, Z. Ying, and J. Leskovec. 2017. Inductive representation learning on large graphs. In NeurIPS
work page 2017
-
[13]
K. Hashimoto, Y. Tsuruoka, R. Socher, and o. 2017. A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks. In ACL
work page 2017
-
[14]
Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush. 2016. Character-Aware Neural Language Models.. In AAAI
work page 2016
-
[15]
D. P. Kingma and J. Ba. 2015. Adam: A method for stochastic optimization. In ICLR
work page 2015
-
[16]
P. LePendu, S. V. Iyer, C. Fairon, and N. H. Shah. 2012. Annotation analysis for testing drug safety signals using unstructured clinical notes. In Journal of biomedical semantics, Vol. 3. BioMed Central, S5
work page 2012
-
[17]
O. Levy and Y. Goldberg. 2014. Linguistic regularities in sparse and explicit word representations. In ACL
work page 2014
-
[18]
O. Levy and Y. Goldberg. 2014. Neural word embedding as implicit matrix factorization. In NeurIPS
work page 2014
- [19]
-
[20]
H. J. Lowe, T. A. Ferris, P. M. Hernandez, and S. C. Weber. 2009. STRIDE–An integrated standards-based translational research informatics platform. InAMIA
work page 2009
- [21]
-
[22]
Efficient Estimation of Word Representations in Vector Space
T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781 (2013)
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[23]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. In NeurIPS
work page 2013
-
[24]
J. Mueller and A. Thyagarajan. 2016. Siamese Recurrent Architectures for Learn- ing Sentence Similarity.. In AAAI
work page 2016
-
[25]
P. Neculoiu, M. Versteegh, and M. Rotaru. 2016. Learning text similarity with siamese recurrent networks. In Workshop on Representation Learning for NLP
work page 2016
-
[26]
S. V. Pakhomov, G. Finley, R. McEwan, Y. Wang, and G. B. Melton. 2016. Corpus domain effects on distributional semantic modeling of medical terms. Bioinfor- matics 32, 23 (2016), 3635–3644
work page 2016
- [27]
-
[28]
J. Pennington, R. Socher, and C. Manning. 2014. Glove: Global vectors for word representation. In EMNLP
work page 2014
-
[29]
B. Perozzi, R. Al-Rfou, and S. Skiena. 2014. Deepwalk: Online learning of social representations. In KDD
work page 2014
-
[30]
M. Qu, X. Ren, and J. Han. 2017. Automatic synonym discovery with knowledge bases. In KDD
work page 2017
-
[31]
J. Shen, R. Lv, X. Ren, M. Vanni, B. Sadler, and J. Han. 2019. Mining Entity Synonyms with Efficient Neural Set Generation. In AAAI
work page 2019
-
[32]
A. Stubbs and Ö. Uzuner. 2015. Annotating longitudinal clinical narratives for de- identification: The 2014 i2b2/UTHealth corpus. Journal of biomedical informatics 58 (2015), S20–S29
work page 2015
-
[33]
C. N. Ta, M. Dumontier, G. Hripcsak, N. P. Tatonetti, and C. Weng. 2018. Columbia Open Health Data, clinical concept prevalence and co-occurrence from electronic health records. Scientific data 5 (2018), 180273
work page 2018
-
[34]
J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. 2015. Line: Large-scale information network embedding. In WWW
work page 2015
-
[35]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In NeurIPS
work page 2017
-
[36]
P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. 2018. Graph attention networks. In ICLR
work page 2018
-
[37]
C. Wang, L. Cao, and B. Zhou. 2015. Medical synonym extraction with concept space models. In IJCAI
work page 2015
-
[38]
Q. Wang, B. Wang, and L. Guo. 2015. Knowledge Base Completion Using Embed- dings and Rules.. In IJCAI
work page 2015
- [39]
-
[40]
J. Wieting, M. Bansal, K. Gimpel, and K. Livescu. 2016. Charagram: Embedding words and sentences via character n-grams. In EMNLP
work page 2016
-
[41]
Z. Yang, W. W. Cohen, and R. Salakhutdinov. 2016. Revisiting semi-supervised learning with graph embeddings. In ICML
work page 2016
- [42]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.