Hierarchical Retrieval with Out-Of-Vocabulary Queries: A Case Study on SNOMED CT

Ernesto Jimenez-Ruiz; Hui Yang; Jiaoyan Chen; Jonathon Dilworth; Yongsheng Gao

arxiv: 2511.16698 · v2 · submitted 2025-11-17 · 💻 cs.CL · cs.AI

Hierarchical Retrieval with Out-Of-Vocabulary Queries: A Case Study on SNOMED CT

Jonathon Dilworth , Hui Yang , Jiaoyan Chen , Yongsheng Gao , Ernesto Jimenez-Ruiz This is my paper

Pith reviewed 2026-05-17 21:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords SNOMED CTout-of-vocabulary querieshierarchical retrievalhyperbolic embeddingssubsumption inferenceontology embeddingsbiomedical ontologieslanguage models

0 comments

The pith

Language-model embeddings in hyperbolic space enable effective hierarchical retrieval from SNOMED CT for queries using out-of-vocabulary terms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles retrieving hierarchical concepts from the large SNOMED CT biomedical ontology when queries contain terms absent from the ontology vocabulary. It develops an approach that uses language models to create embeddings of both queries and ontology concepts, then projects these into hyperbolic space to support inference of subsumption relations. This setup allows identifying the most specific concepts that a query falls under, along with broader ancestor concepts. Tests on three new datasets with annotated OOV queries show the method beats standard sentence embedding models and simple string matching approaches. The technique is positioned as adaptable to other structured ontologies beyond SNOMED CT.

Core claim

The central discovery is that projecting language model-based embeddings of SNOMED CT concepts into hyperbolic space supports efficient and accurate subsumption inference between an arbitrary out-of-vocabulary textual query and any concept in the ontology, leading to better hierarchical retrieval performance than existing embedding and lexical baselines.

What carries the argument

Hyperbolic embeddings of ontology concepts generated from language models, used to model and infer subsumption relations for OOV queries.

If this is right

Retrieval of the most specific subsumers and less relevant ancestors for OOV queries outperforms SBERT, SapBERT, and lexical methods on the constructed datasets.
The approach supports efficient subsumption inference in hyperbolic space for hierarchical concept matching.
Evaluation datasets and code are made available to support further research on ontology retrieval.
The method extends in principle to other biomedical or hierarchical ontologies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such retrieval could enhance search capabilities in medical databases where users phrase questions in everyday language.
Integrating this with electronic health record systems might reduce errors from mismatched terminology.
Testing on queries from different medical subfields could reveal where the hyperbolic projection provides the largest gains over flat embeddings.

Load-bearing premise

The assumption that language-model embeddings projected into hyperbolic space reliably encode the subsumption relation between an arbitrary OOV textual query and an arbitrary SNOMED CT concept.

What would settle it

A new test set of OOV queries with expert-annotated correct subsumers where the hyperbolic method does not outperform or match the performance of standard Euclidean embeddings or lexical matching.

Figures

Figures reproduced from arXiv: 2511.16698 by Ernesto Jimenez-Ruiz, Hui Yang, Jiaoyan Chen, Jonathon Dilworth, Yongsheng Gao.

**Figure 2.** Figure 2: Depth intuition within the Poincaré Ball. Concepts [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Architecture for embedding and retrieval using [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

read the original abstract

SNOMED CT is a biomedical ontology with a hierarchical representation, modelling terminological concepts at a large scale. Knowledge retrieval in SNOMED CT is critical for its application but often proves challenging due to linguistic ambiguity, synonymy, polysemy, and so on. This problem is exacerbated when the queries are out-of-vocabulary (OOV), i.e., lacking any equivalent matches in the ontology. In this work, we focus on the problem of hierarchical concept retrieval from SNOMED CT with OOV queries, and propose an approach driven by utilising language model-based ontology embeddings, which represent hierarchical concepts in a hyperbolic space for enabling efficient subsumption inference between a textual query and an arbitrary concept. For evaluation, we construct three datasets where OOV queries are annotated against SNOMED CT concepts, testing the retrieval of the most specific subsumers and their less relevant ancestors. We find that our method outperforms the baselines, including SBERT, SapBERT, and two lexical matching methods. While evaluated against SNOMED CT, the approach is generalisable and can be extended to other ontologies. We release all the experiment codes and datasets at https://github.com/jonathondilworth/HR-OOV-SNOMED-CT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes using language-model embeddings of SNOMED CT concepts and OOV textual queries, projected into hyperbolic space, to perform hierarchical retrieval by inferring subsumption relations. It constructs three evaluation datasets with annotated OOV queries, reports that the method outperforms SBERT, SapBERT, and two lexical baselines on retrieval of most-specific subsumers and ancestors, and claims generalizability to other ontologies while releasing code and data.

Significance. If the central empirical claim holds, the work offers a practical approach to OOV query handling in large biomedical ontologies, where linguistic variation is common. The open release of datasets and code strengthens reproducibility. However, the significance is tempered by the lack of detailed quantitative results or validation of the hyperbolic projection's ability to recover the partial order for arbitrary unseen queries.

major comments (2)

[Abstract and §4] Abstract and §4 (Evaluation): The claim that the method 'outperforms the baselines' is presented without any numerical results, error bars, statistical significance tests, or details on OOV query selection and annotation in the abstract; the central empirical claim therefore cannot be assessed for magnitude or reliability from the provided summary.
[§3.2] §3.2 (Hyperbolic Projection): The assumption that off-the-shelf LM embeddings projected into hyperbolic space will place arbitrary OOV query vectors such that the most specific subsumer lies in the correct hierarchical position is load-bearing for the outperformance claim, yet no theoretical argument or ablation is given to show why standard Euclidean LM training plus projection recovers the SNOMED CT DAG partial order for unseen biomedical phrasing.

minor comments (2)

[§2] §2 (Related Work): The discussion of prior hyperbolic embeddings for ontologies could usefully cite additional recent work on Poincaré embeddings for taxonomies to better situate the contribution.
[Figure 1] Figure 1: The diagram illustrating the embedding and projection pipeline would benefit from explicit labels on the hyperbolic space component to clarify the subsumption inference step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We have carefully considered each point and provide our responses below. We agree with several suggestions and will make revisions to improve the clarity and completeness of the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Evaluation): The claim that the method 'outperforms the baselines' is presented without any numerical results, error bars, statistical significance tests, or details on OOV query selection and annotation in the abstract; the central empirical claim therefore cannot be assessed for magnitude or reliability from the provided summary.

Authors: We thank the referee for pointing this out. While the full evaluation details and numerical results are presented in Section 4 with tables comparing our method to SBERT, SapBERT, and lexical baselines, we agree that the abstract should provide a more concrete summary of the key findings. In the revised version, we will include specific metrics, such as the improvement in retrieval performance for most-specific subsumers, along with a brief mention of the dataset construction. For Section 4, we will ensure error bars are included if not already present and add statistical significance tests to the comparisons. Details on OOV query selection and annotation are described in the dataset construction subsection, but we will add a summary sentence in the abstract. revision: yes
Referee: [§3.2] §3.2 (Hyperbolic Projection): The assumption that off-the-shelf LM embeddings projected into hyperbolic space will place arbitrary OOV query vectors such that the most specific subsumer lies in the correct hierarchical position is load-bearing for the outperformance claim, yet no theoretical argument or ablation is given to show why standard Euclidean LM training plus projection recovers the SNOMED CT DAG partial order for unseen biomedical phrasing.

Authors: We recognize the importance of justifying the use of hyperbolic projection for recovering hierarchical relations with OOV queries. Our work is motivated by prior literature on hyperbolic embeddings for hierarchical data, where the geometry naturally accommodates tree-like structures. The empirical results across three datasets demonstrate that this approach yields superior performance compared to Euclidean baselines. We will add a more detailed discussion in Section 3.2 explaining the rationale, including references to works on hyperbolic space for ontologies. Regarding ablations, we have conducted experiments comparing hyperbolic and Euclidean projections, which we will highlight more explicitly in the revised manuscript or appendix. revision: partial

Circularity Check

0 steps flagged

Empirical evaluation against external baselines; no derivation reduces to internal fit or self-citation

full rationale

The paper describes an embedding-based retrieval method (LM embeddings projected to hyperbolic space) and reports empirical outperformance on three constructed OOV query datasets against published baselines (SBERT, SapBERT, lexical matchers). No equations, predictions, or uniqueness claims are shown to reduce by construction to parameters fitted inside the paper or to prior self-citations that bear the central load. The evaluation is externally falsifiable via the released code and datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions of hyperbolic geometry for hierarchies and on the quality of pre-trained language model embeddings; no new free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Hyperbolic space can faithfully embed hierarchical relations so that subsumption corresponds to a geometric relation between embeddings
Invoked when the method places ontology concepts in hyperbolic space to enable subsumption inference

pith-pipeline@v0.9.0 · 5532 in / 1175 out tokens · 23522 ms · 2026-05-17T21:22:58.967157+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

They both jointly encode text labels by taking a language model and efficiently embed the hierarchy in a hyperbolic space, allowing direct subsumption inference ... s(C ⊑ D) := −(d_κ(x_C, x_D) + λ(∥x_D∥_κ − ∥x_C∥_κ))
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HiT incorporates textual information by tuning a pre-trained language model (PLM) using contrastive hyperbolic objectives

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ClinQueryAgent: A Conversational Agent for Population Health Management
cs.IR 2026-04 unverdicted novelty 4.0

The paper introduces ClinQueryAgent, a conversational agent that converts natural language queries into database queries for population health management while keeping patient data secure, and reports its use by 128 s...

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Ferishta Bakhshi-Raiez, NF de Keizer, Ronald Cornet, M Dorrepaal, Dave Don- gelmans, and Monique WM Jaspers. 2012. A usability evaluation of a SNOMED CT based compositional interface terminology for intensive care.International journal of medical informatics81, 5 (2012), 351–362

work page 2012
[2]

Jiaoyan Chen, Pan Hu, Ernesto Jimenez-Ruiz, Ole Magnus Holter, Denvar Antonyrajah, and Ian Horrocks. 2021. OWL2Vec*: Embedding of OWL On- tologies. doi:10.48550/arXiv.2009.14654

work page doi:10.48550/arxiv.2009.14654 2021
[3]

Jiaoyan Chen, Olga Mashkova, Fernando Zhapa-Camacho, Robert Hoehndorf, Yuan He, and Ian Horrocks. 2025. Ontology embedding: a survey of methods, applications and resources.IEEE Knowledge and Data Engineering(2025)

work page 2025
[4]

NHS England. 2025. SNOMED CT (Terminology and Classifications)

work page 2025
[5]

Yuan He, Jiaoyan Chen, Hang Dong, Ian Horrocks, Carlo Allocca, Taehun Kim, and Brahmananda Sapkota. 2024. DeepOnto: A Python package for ontology engineering with deep learning.Semantic Web15, 5 (2024), 1991–2004

work page 2024
[6]

Yuan He, Zhangdie Yuan, Jiaoyan Chen, and Ian Horrocks. 2024. Language Models as Hierarchy Encoders. doi:10.48550/arXiv.2401.11374

work page doi:10.48550/arxiv.2401.11374 2024
[7]

Patel-Schneider, and Se- bastian Rudolph

Pascal Hitzler, Markus Krötzsch, Bijan Parsia, Peter F. Patel-Schneider, and Se- bastian Rudolph. 2012.OWL 2 web ontology language: Primer (second edition). W3C Recommendation. World Wide Web Consortium (W3C)

work page 2012
[8]

Maximilian Nickel and Douwe Kiela. 2017. Poincaré Embeddings for Learning Hierarchical Representations. doi:10.48550/arXiv.1705.08039

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1705.08039 2017
[9]

Sampo Pyysalo, Tomoko Ohta, Rafal Rak, Andrew Rowley, Hong-Woo Chun, Sung-Jae Jung, Sung-Pil Choi, Jun’ichi Tsujii, and Sophia Ananiadou. 2015. Overview of the cancer genetics and pathway curation tasks of bionlp shared task 2013.BMC bioinformatics16, Suppl 10 (2015), S2. Publisher: Springer

work page 2015
[10]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. doi:10.48550/arXiv.1908.10084

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1908.10084 2019
[11]

Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang. 2024. Benchmark- ing retrieval-augmented generation for medicine. InFindings of the association for computational linguistics ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Bangkok, Thailand and virtual meeting, 6233–6251

work page 2024
[12]

Hui Yang, Jiaoyan Chen, Yuan He, Yongsheng Gao, and Ian Horrocks. 2025. Language models as ontology encoders.arXiv preprint arXiv:2507.14334(2025). Hierarchical Retrieval with Out-Of-Vocabulary Queries: A Case Study on SNOMED CT submission to WWW’26, April 13–17, 2026, Dubai, UAE

work page arXiv 2025
[13]

Chong You, Rajesh Jayaram, Ananda Theertha Suresh, Robin Nittka, Felix Yu, and Sanjiv Kumar. 2025. Hierarchical Retrieval: The Geometry and a Pretrain- Finetune Recipe.arXiv preprint arXiv:2509.16411(2025)

work page arXiv 2025

[1] [1]

Ferishta Bakhshi-Raiez, NF de Keizer, Ronald Cornet, M Dorrepaal, Dave Don- gelmans, and Monique WM Jaspers. 2012. A usability evaluation of a SNOMED CT based compositional interface terminology for intensive care.International journal of medical informatics81, 5 (2012), 351–362

work page 2012

[2] [2]

Jiaoyan Chen, Pan Hu, Ernesto Jimenez-Ruiz, Ole Magnus Holter, Denvar Antonyrajah, and Ian Horrocks. 2021. OWL2Vec*: Embedding of OWL On- tologies. doi:10.48550/arXiv.2009.14654

work page doi:10.48550/arxiv.2009.14654 2021

[3] [3]

Jiaoyan Chen, Olga Mashkova, Fernando Zhapa-Camacho, Robert Hoehndorf, Yuan He, and Ian Horrocks. 2025. Ontology embedding: a survey of methods, applications and resources.IEEE Knowledge and Data Engineering(2025)

work page 2025

[4] [4]

NHS England. 2025. SNOMED CT (Terminology and Classifications)

work page 2025

[5] [5]

Yuan He, Jiaoyan Chen, Hang Dong, Ian Horrocks, Carlo Allocca, Taehun Kim, and Brahmananda Sapkota. 2024. DeepOnto: A Python package for ontology engineering with deep learning.Semantic Web15, 5 (2024), 1991–2004

work page 2024

[6] [6]

Yuan He, Zhangdie Yuan, Jiaoyan Chen, and Ian Horrocks. 2024. Language Models as Hierarchy Encoders. doi:10.48550/arXiv.2401.11374

work page doi:10.48550/arxiv.2401.11374 2024

[7] [7]

Patel-Schneider, and Se- bastian Rudolph

Pascal Hitzler, Markus Krötzsch, Bijan Parsia, Peter F. Patel-Schneider, and Se- bastian Rudolph. 2012.OWL 2 web ontology language: Primer (second edition). W3C Recommendation. World Wide Web Consortium (W3C)

work page 2012

[8] [8]

Maximilian Nickel and Douwe Kiela. 2017. Poincaré Embeddings for Learning Hierarchical Representations. doi:10.48550/arXiv.1705.08039

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1705.08039 2017

[9] [9]

Sampo Pyysalo, Tomoko Ohta, Rafal Rak, Andrew Rowley, Hong-Woo Chun, Sung-Jae Jung, Sung-Pil Choi, Jun’ichi Tsujii, and Sophia Ananiadou. 2015. Overview of the cancer genetics and pathway curation tasks of bionlp shared task 2013.BMC bioinformatics16, Suppl 10 (2015), S2. Publisher: Springer

work page 2015

[10] [10]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. doi:10.48550/arXiv.1908.10084

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1908.10084 2019

[11] [11]

Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang. 2024. Benchmark- ing retrieval-augmented generation for medicine. InFindings of the association for computational linguistics ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Bangkok, Thailand and virtual meeting, 6233–6251

work page 2024

[12] [12]

Hui Yang, Jiaoyan Chen, Yuan He, Yongsheng Gao, and Ian Horrocks. 2025. Language models as ontology encoders.arXiv preprint arXiv:2507.14334(2025). Hierarchical Retrieval with Out-Of-Vocabulary Queries: A Case Study on SNOMED CT submission to WWW’26, April 13–17, 2026, Dubai, UAE

work page arXiv 2025

[13] [13]

Chong You, Rajesh Jayaram, Ananda Theertha Suresh, Robin Nittka, Felix Yu, and Sanjiv Kumar. 2025. Hierarchical Retrieval: The Geometry and a Pretrain- Finetune Recipe.arXiv preprint arXiv:2509.16411(2025)

work page arXiv 2025