BioHiCL: Hierarchical Multi-Label Contrastive Learning for Biomedical Retrieval with MeSH Labels

Halil Kilicoglu; Lecheng Zheng; Mengfei Lan

arxiv: 2604.15591 · v1 · submitted 2026-04-17 · 💻 cs.IR · cs.AI

BioHiCL: Hierarchical Multi-Label Contrastive Learning for Biomedical Retrieval with MeSH Labels

Mengfei Lan , Lecheng Zheng , Halil Kilicoglu This is my paper

Pith reviewed 2026-05-10 09:25 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords biomedical information retrievalcontrastive learningMeSH labelshierarchical multi-label learningtext embeddingssentence similarity

0 comments

The pith

Hierarchical MeSH labels supply structured supervision for multi-label contrastive learning that improves biomedical retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes BioHiCL to replace coarse binary relevance signals with hierarchical MeSH annotations in contrastive learning for biomedical texts. By treating labels as multi-level positives and negatives, the method captures semantic overlaps that flat relevance judgments miss. The resulting Base and Large models reach competitive results on retrieval, sentence similarity, and question answering while staying small enough for practical use. A sympathetic reader would care because biomedical search often fails when documents share only partial concept hierarchies rather than exact matches.

Core claim

BioHiCL leverages hierarchical MeSH annotations to provide structured supervision for multi-label contrastive learning, enabling generative retrievers to model domain semantics more precisely than binary relevance approaches allow.

What carries the argument

Hierarchical multi-label contrastive learning, in which MeSH label trees define positive pairs at varying levels of specificity and negative pairs across branches.

If this is right

Retrieval systems can rank documents that share only parent or sibling MeSH concepts higher than those with no overlap.
Sentence similarity judgments become graded rather than binary, reflecting partial label matches.
Question answering pipelines gain from improved document ranking without increasing model size.
Smaller 0.1B and 0.3B parameter models remain competitive, lowering deployment cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hierarchical label structure could be tested on non-biomedical domains that already possess taxonomies, such as legal or technical document collections.
Combining the multi-label supervision with existing generative retriever architectures might reduce the need for large-scale negative sampling.
A controlled ablation that removes only the hierarchy while keeping multi-label structure would isolate how much of the gain comes from the tree versus the multi-label format.

Load-bearing premise

The hierarchical relationships recorded in MeSH annotations are consistent and informative enough to give better learning signals than simple relevant-or-not labels.

What would settle it

Retrain the identical model architecture on the same data but replace the hierarchical multi-label loss with standard binary relevance contrastive loss and measure whether retrieval, similarity, and QA scores drop, stay flat, or rise.

Figures

Figures reproduced from arXiv: 2604.15591 by Halil Kilicoglu, Lecheng Zheng, Mengfei Lan.

**Figure 2.** Figure 2: Overview of BioHiCL. During LoRA fine-tuning, embedding similarity ( [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

read the original abstract

Effective biomedical information retrieval requires modeling domain semantics and hierarchical relationships among biomedical texts. Existing biomedical generative retrievers build on coarse binary relevance signals, limiting their ability to capture semantic overlap. We propose BioHiCL (Biomedical Retrieval with Hierarchical Multi-Label Contrastive Learning), which leverages hierarchical MeSH annotations to provide structured supervision for multi-label contrastive learning. Our models, BioHiCL-Base (0.1B) and BioHiCL-Large (0.3B), achieve promising performance on biomedical retrieval, sentence similarity, and question answering tasks, while remaining computationally efficient for deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes BioHiCL, a hierarchical multi-label contrastive learning framework for biomedical retrieval that leverages MeSH label hierarchies to provide structured supervision signals. It introduces two compact models (BioHiCL-Base at 0.1B parameters and BioHiCL-Large at 0.3B parameters) and claims they deliver promising results on biomedical retrieval, sentence similarity, and question-answering tasks while remaining computationally efficient.

Significance. If the central claims hold after proper validation, the work could meaningfully advance biomedical IR by moving beyond coarse binary relevance to exploit the natural hierarchy in MeSH annotations. The emphasis on small, deployable models is a practical strength. However, the absence of any quantitative results, baselines, or ablations in the current manuscript prevents any assessment of whether the hierarchical component actually drives gains over standard multi-label contrastive learning.

major comments (3)

[Abstract] Abstract: The abstract asserts 'promising performance' on retrieval, similarity, and QA tasks but supplies no numerical results, baselines, ablation studies, or error analysis. This renders the central claim—that hierarchical MeSH labels supply structured supervision beyond flat multi-label or binary signals—impossible to evaluate from the manuscript.
[Method] Method section (loss formulation): The concrete mechanism by which the MeSH hierarchy is incorporated into the contrastive loss (ancestor positives, level-weighted sampling, tree-distance negatives, label propagation, etc.) is not specified. Without this detail it is impossible to determine whether reported improvements, if any, arise from hierarchical structure rather than multi-label supervision alone, which is load-bearing for the paper's main contribution.
[Experiments] Experiments: No ablation isolating the hierarchical component versus a flat multi-label contrastive baseline is presented. The skeptic note correctly identifies that any gains could be explained by multi-label supervision; this missing comparison directly undermines the claim that hierarchy provides additional structured value.

minor comments (2)

[Abstract] The abstract and introduction repeatedly use the phrase 'promising performance' without defining what threshold or comparison makes performance promising; replace with concrete metrics once results are added.
[Model Description] Model sizes are given in parameters (0.1B, 0.3B) but no details on architecture, pre-training corpus, or training compute are provided; add these for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that will make the hierarchical contribution clearer and the claims more rigorously supported by quantitative evidence.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts 'promising performance' on retrieval, similarity, and QA tasks but supplies no numerical results, baselines, ablation studies, or error analysis. This renders the central claim—that hierarchical MeSH labels supply structured supervision beyond flat multi-label or binary signals—impossible to evaluate from the manuscript.

Authors: We agree that the abstract would be strengthened by including concrete numerical results. In the revised manuscript we will update the abstract to report key metrics (e.g., nDCG@10 and MAP for retrieval, Pearson/Spearman correlations for sentence similarity, and accuracy for QA) together with the main baseline comparisons, so that the performance claims can be directly evaluated. revision: yes
Referee: [Method] Method section (loss formulation): The concrete mechanism by which the MeSH hierarchy is incorporated into the contrastive loss (ancestor positives, level-weighted sampling, tree-distance negatives, label propagation, etc.) is not specified. Without this detail it is impossible to determine whether reported improvements, if any, arise from hierarchical structure rather than multi-label supervision alone, which is load-bearing for the paper's main contribution.

Authors: The referee correctly identifies that the precise integration of hierarchy into the loss must be stated explicitly. Although the current manuscript describes the overall hierarchical multi-label contrastive framework, we will expand the Method section with the exact loss equations, the rules for selecting ancestor positives, level-based weighting, tree-distance negative sampling, and any label-propagation steps. This will allow readers to distinguish the hierarchical signal from flat multi-label supervision. revision: yes
Referee: [Experiments] Experiments: No ablation isolating the hierarchical component versus a flat multi-label contrastive baseline is presented. The skeptic note correctly identifies that any gains could be explained by multi-label supervision; this missing comparison directly undermines the claim that hierarchy provides additional structured value.

Authors: We concur that an explicit ablation is necessary to isolate the benefit of the MeSH hierarchy. We will add a dedicated ablation subsection in the Experiments section that compares BioHiCL against an otherwise identical flat multi-label contrastive baseline (no hierarchy), reports all relevant metrics with statistical significance, and includes additional standard baselines plus error analysis. This will directly address whether the hierarchical structure supplies value beyond multi-label supervision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; no derivations or equations present

full rationale

The paper proposes BioHiCL as a contrastive learning method that incorporates hierarchical MeSH annotations for multi-label supervision in biomedical retrieval. No mathematical derivations, loss equations, or first-principles results are described in the provided text. The approach is presented as an extension of standard contrastive learning with added hierarchical label handling, without any self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central claim to its own inputs. Performance claims are empirical and externally verifiable on retrieval, similarity, and QA tasks, making the work self-contained against benchmarks with no internal circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that MeSH hierarchical structure provides useful multi-label supervision beyond binary relevance; no free parameters or invented entities are visible in the abstract.

axioms (1)

domain assumption Hierarchical MeSH annotations supply structured supervision that improves contrastive learning for biomedical texts
Invoked in the abstract as the key innovation over coarse binary signals.

pith-pipeline@v0.9.0 · 5398 in / 1216 out tokens · 31484 ms · 2026-05-10T09:25:19.265461+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon

Wdc-24 gold standard for product categoriza- tion. Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2020. Domain-specific lan- guage model pretraining for biomedical natural lan- guage processing. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, L...

work page 2020
[2]

arXiv:2403.06789 doi:10.48550/ARXIV.2403.06789

Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval.Bioinformatics, 39(11):btad651. Mengfei Lan, Lecheng Zheng, Shufan Ming, and Halil Kilicoglu. 2024. Multi-label sequential sentence clas- sification via large language model. InFindings of the Association for Computational Lingui...

work page arXiv 2024
[3]

arXiv preprint arXiv:2202.08904 , year=

Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240. Yibin Lei, Liang Ding, Yu Cao, Changtong Zan, An- drew Yates, and Dacheng Tao. 2023. Unsupervised dense retrieval with relevance-aware contrastive pre- training. InFindings of the Association for Computa- tional Linguistics: ACL 202...

work page arXiv 2023
[4]

Overview of the trec 2022 clinical trials track. InTREC. Stephen Robertson, Hugo Zaragoza, and 1 others. 2009. The probabilistic relevance framework: Bm25 and beyond.Foundations and Trends® in Information Retrieval, 3(4):333–389. Kendrick Shen, Robbie M Jones, Ananya Kumar, Sang Michael Xie, Jeff Z HaoChen, Tengyu Ma, and Percy Liang. 2022. Connect, not c...

work page 2022
[5]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Scifact-open: Towards open-domain scientific claim verification. InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 4719–4734. Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly- supervised contrastive pre-training.arXiv preprint arXiv:221...

work page internal anchor Pith review arXiv 2022
[6]

Lecheng Zheng, Jinjun Xiong, Yada Zhu, and Jingrui He

ACM. Lecheng Zheng, Jinjun Xiong, Yada Zhu, and Jingrui He. 2022. Contrastive learning with complex het- erogeneity. InKDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Min- ing, Washington, DC, USA, August 14 - 18, 2022, pages 2594–2604. ACM. A Evaluation Datasets A.1 Information Retrieval We leverage the BEIR framework (Thakur et al.,

work page 2022
[7]

A.2 Sentence Similarity BIOSSES.BIOSSES (So ˘gancıo˘glu et al., 2017) is a benchmark dataset for evaluating biomedical sentence similarity

to conduct a unified evaluation of retrievers across four biomedical IR benchmarks, consistent with the previous biomedical IR tasks (Jin et al., 2023; Xu et al., 2024; Sinha et al., 2025). A.2 Sentence Similarity BIOSSES.BIOSSES (So ˘gancıo˘glu et al., 2017) is a benchmark dataset for evaluating biomedical sentence similarity. It consists of sentence pai...

work page 2023

[1] [1]

Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon

Wdc-24 gold standard for product categoriza- tion. Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2020. Domain-specific lan- guage model pretraining for biomedical natural lan- guage processing. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, L...

work page 2020

[2] [2]

arXiv:2403.06789 doi:10.48550/ARXIV.2403.06789

Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval.Bioinformatics, 39(11):btad651. Mengfei Lan, Lecheng Zheng, Shufan Ming, and Halil Kilicoglu. 2024. Multi-label sequential sentence clas- sification via large language model. InFindings of the Association for Computational Lingui...

work page arXiv 2024

[3] [3]

arXiv preprint arXiv:2202.08904 , year=

Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240. Yibin Lei, Liang Ding, Yu Cao, Changtong Zan, An- drew Yates, and Dacheng Tao. 2023. Unsupervised dense retrieval with relevance-aware contrastive pre- training. InFindings of the Association for Computa- tional Linguistics: ACL 202...

work page arXiv 2023

[4] [4]

Overview of the trec 2022 clinical trials track. InTREC. Stephen Robertson, Hugo Zaragoza, and 1 others. 2009. The probabilistic relevance framework: Bm25 and beyond.Foundations and Trends® in Information Retrieval, 3(4):333–389. Kendrick Shen, Robbie M Jones, Ananya Kumar, Sang Michael Xie, Jeff Z HaoChen, Tengyu Ma, and Percy Liang. 2022. Connect, not c...

work page 2022

[5] [5]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Scifact-open: Towards open-domain scientific claim verification. InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 4719–4734. Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly- supervised contrastive pre-training.arXiv preprint arXiv:221...

work page internal anchor Pith review arXiv 2022

[6] [6]

Lecheng Zheng, Jinjun Xiong, Yada Zhu, and Jingrui He

ACM. Lecheng Zheng, Jinjun Xiong, Yada Zhu, and Jingrui He. 2022. Contrastive learning with complex het- erogeneity. InKDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Min- ing, Washington, DC, USA, August 14 - 18, 2022, pages 2594–2604. ACM. A Evaluation Datasets A.1 Information Retrieval We leverage the BEIR framework (Thakur et al.,

work page 2022

[7] [7]

A.2 Sentence Similarity BIOSSES.BIOSSES (So ˘gancıo˘glu et al., 2017) is a benchmark dataset for evaluating biomedical sentence similarity

to conduct a unified evaluation of retrievers across four biomedical IR benchmarks, consistent with the previous biomedical IR tasks (Jin et al., 2023; Xu et al., 2024; Sinha et al., 2025). A.2 Sentence Similarity BIOSSES.BIOSSES (So ˘gancıo˘glu et al., 2017) is a benchmark dataset for evaluating biomedical sentence similarity. It consists of sentence pai...

work page 2023