pith. sign in

arxiv: 2605.28868 · v1 · pith:A6JC6RDWnew · submitted 2026-05-22 · 💻 cs.LG · cs.AI

TaxDistill: Improving Metagenomic Taxonomic Annotation via Distilled Genomic Foundation Models

Pith reviewed 2026-06-30 16:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords metagenomic taxonomic annotationknowledge distillationgenomic foundation modelslabel noise reductionmicrobial classificationCAMI2 benchmarkssequence representation learningsoft label transfer
0
0 comments X

The pith

Distilling soft labels from a large genomic foundation model into a small student network corrects noise from similarity searches and raises accuracy in identifying microbial sources of DNA fragments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Metagenomic taxonomic annotation identifies the microbial origins of DNA pieces collected from mixed environmental samples. Conventional similarity search tools generate noisy hard labels because reference databases are incomplete and microbial diversity is high, which then impairs learning-based correction methods. TaxDistill counters this by letting a 500 million parameter genomic foundation model produce soft labels that reflect deeper sequence semantics and confidence levels. These soft labels are distilled into a lightweight student network so the training signal becomes cleaner. The approach yields higher classification performance on standard benchmark datasets and matters because better microbial identification supports applications from environmental monitoring to clinical microbiome analysis even when reference data remain incomplete.

Core claim

TaxDistill introduces a knowledge distillation framework in which a 500M-parameter genomic foundation model serves as teacher to extract semantic features and emit soft labels, which are then transferred to a lightweight student network; this process reduces label noise originating from initial retrieval tools and produces higher F1 scores than prior baselines on seven CAMI2 datasets, such as lifting MMseqs2 performance from 0.763 to 0.941 on the Gastrointestinal set while also surpassing the Taxometer baseline.

What carries the argument

Knowledge distillation of soft labels generated by a genomic foundation model teacher into a student classifier to reduce noise in taxonomic labels.

If this is right

  • F1 score on the Gastrointestinal CAMI2 dataset rises from 0.763 with MMseqs2 alone to 0.941 after distillation.
  • The method outperforms the Taxometer baseline on most of the seven tested CAMI2 datasets.
  • Label noise from incomplete reference databases is mitigated through the use of confidence-weighted soft labels.
  • Representation learning for metagenomic sequences becomes more reliable when training signals are cleaned by distillation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distillation pattern could be applied to other sequence labeling tasks that currently rely on noisy similarity-derived labels.
  • A lightweight student obtained this way might enable accurate annotation pipelines to run on portable sequencing hardware with limited compute.
  • If the teacher's semantic features capture relationships beyond database matches, the approach may help annotate fragments from microbes absent from existing references.

Load-bearing premise

The soft labels produced by the large genomic foundation model are less noisy and more informative than the hard labels supplied by similarity search tools.

What would settle it

Training the student network on the distilled soft labels and observing no improvement in F1 score over a student trained directly on the hard labels from the retrieval tools across the CAMI2 evaluation sets would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.28868 by Lun Li, Rongye Ye, Shuhui Song, Yiran Zhan, Zheng Luo.

Figure 1
Figure 1. Figure 1: Metagenomic Analysis Pipeline and Research [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall architecture of the proposed TaxDistill framework. It consists of three core modules: [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: CAMI2 Human Microbiome Dataset Experimental Results. Each row demonstrates the optimization [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Experimental Results on CAMI2 Plant Rhizosphere and Marine Datasets. Each column demonstrates [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Sankey diagram analysis of label transition and recalibration dynamics on the Marine dataset. The first row [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study on the effect of contig sequence length on model inference performance. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Dataset contig volume statistics and time [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
read the original abstract

Metagenomic taxonomic annotation aims to identify the microbial origins of DNA fragments in environmental samples. Traditional methods that rely on sequence similarity are often constrained by the high microbial diversity and the incompleteness of reference databases, which has motivated the development of learning approaches such as Taxometer that perform post hoc correction to learn more informative metagenomic sequence representations. However, these methods typically rely on labels derived from similarity search tools during training, which inevitably introduces noise that can impair representation learning and degrade classification performance. To address this issue, we propose TaxDistill, a knowledge distillation framework for metagenomic classification. We introduce GenomeOcean, a 500M parameter genomic foundation model, as the teacher network to extract deep semantic features and generate soft labels based on confidence. By distilling this soft label information into a lightweight student network, TaxDistill effectively reduces the label noise introduced by initial retrieval tools. Comprehensive experiments on seven diverse CAMI2 datasets demonstrate that TaxDistill outperforms existing baselines in most scenarios. For instance, on the Gastrointestinal dataset, it improves the F1 score of MMseqs2 from 0.763 to 0.941, outperforming the Taxometer baseline. Overall, TaxDistill provides a reliable method for label correction in complex metagenomic analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes TaxDistill, a knowledge-distillation framework for metagenomic taxonomic annotation. A 500M-parameter genomic foundation model (GenomeOcean) serves as teacher to generate soft labels from deep semantic features; these are distilled into a lightweight student network to reduce label noise arising from initial similarity-search tools such as MMseqs2. Experiments on seven CAMI2 datasets report substantial F1 gains (e.g., MMseqs2 F1 rising from 0.763 to 0.941 on the Gastrointestinal dataset) and outperformance relative to Taxometer and other baselines.

Significance. If the reported gains are shown to arise specifically from the use of lower-noise soft labels rather than from architecture, training regime, or data handling differences, the work would demonstrate a practical route for leveraging large genomic foundation models to correct noisy labels in metagenomic classification. The concrete numerical improvements on standard CAMI2 benchmarks constitute a clear empirical contribution, though their attribution remains to be verified.

major comments (2)
  1. [§4] §4 (Experiments) and associated tables: the central performance claim (e.g., F1 0.763 → 0.941) is presented without an ablation that directly measures whether argmax(teacher soft labels) matches ground-truth CAMI2 labels more frequently than the hard labels from MMseqs2 or other retrieval tools. Without this comparison, it is impossible to confirm that the distillation step reduces label noise rather than that gains arise from the student architecture or training procedure alone.
  2. [§4, §3.2] §4 and §3.2 (Methods): no statistical significance tests, error bars, or cross-validation details are reported for the F1 improvements across the seven datasets, nor is there an oracle experiment isolating the contribution of soft-label distillation versus simply training the student on teacher hard labels. These omissions make the load-bearing claim that soft labels are “less noisy and more informative” unverifiable from the presented evidence.
minor comments (2)
  1. [Abstract, §1] The abstract and §1 refer to “seven diverse CAMI2 datasets” but do not list the exact dataset identifiers or accession numbers; this should be added for reproducibility.
  2. [§3] Notation for the distillation loss and the precise definition of “confidence” used to produce soft labels from GenomeOcean are not stated explicitly; a short equation or pseudocode block would clarify the procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that additional ablations and statistical analyses are needed to strengthen attribution of gains to soft-label noise reduction and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and associated tables: the central performance claim (e.g., F1 0.763 → 0.941) is presented without an ablation that directly measures whether argmax(teacher soft labels) matches ground-truth CAMI2 labels more frequently than the hard labels from MMseqs2 or other retrieval tools. Without this comparison, it is impossible to confirm that the distillation step reduces label noise rather than that gains arise from the student architecture or training procedure alone.

    Authors: We agree that an explicit comparison of label accuracy (argmax of teacher soft labels versus MMseqs2 hard labels against CAMI2 ground truth) is required to isolate the contribution of the teacher. In the revised version we will add this ablation, reporting the percentage of correct labels for each source on all seven datasets. revision: yes

  2. Referee: [§4, §3.2] §4 and §3.2 (Methods): no statistical significance tests, error bars, or cross-validation details are reported for the F1 improvements across the seven datasets, nor is there an oracle experiment isolating the contribution of soft-label distillation versus simply training the student on teacher hard labels. These omissions make the load-bearing claim that soft labels are “less noisy and more informative” unverifiable from the presented evidence.

    Authors: We concur that statistical tests, error bars, and an oracle ablation are necessary. The revision will include: multiple training runs with standard deviations and paired significance tests; an oracle experiment training the student on teacher hard labels; and clarification of any cross-validation procedures used. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical distillation evaluated on external benchmarks

full rationale

The paper describes TaxDistill as a standard knowledge-distillation pipeline: GenomeOcean (introduced as a 500M-parameter teacher) produces soft labels that are distilled into a student network, with performance measured by F1 on the independent CAMI2 benchmark suites. No equations, uniqueness theorems, or fitted parameters are presented whose outputs reduce by construction to the inputs; the reported gains (e.g., MMseqs2 F1 0.763 → 0.941) are obtained via ordinary supervised training and held-out evaluation. The derivation chain therefore consists of externally falsifiable experimental comparisons rather than self-definitional or self-citation loops.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.1-grok · 5766 in / 1181 out tokens · 35099 ms · 2026-06-30T16:21:06.801718+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 4 canonical work pages · 3 internal anchors

  1. [1]

    Emanuel Ben-Baruch, Adam Botach, Igor Kviatkovsky, Manoj Aggarwal, and G \'e rard Medioni. 2024. Distilling the knowledge in data pruning. arXiv preprint arXiv:2403.07854

  2. [2]

    Garyk Brixi, Matthew G Durrant, Jerome Ku, Mohsen Naghipourfar, Michael Poli, Gwanggyu Sun, Greg Brockman, Daniel Chang, Alison Fanton, Gabriel A Gonzalez, and 1 others. 2026. Genome modelling and design across all domains of life with evo 2. Nature, pages 1--13

  3. [3]

    Xingyi Cheng, Bo Chen, Pan Li, Jing Gong, Jie Tang, and Le Song. 2024. Training compute-optimal protein language models. Advances in Neural Information Processing Systems, 37:69386--69418

  4. [4]

    Charles Y Chiu and Steven A Miller. 2019. Clinical metagenomics. Nature Reviews Genetics, 20(6):341--355

  5. [5]

    Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. 2021. Knowledge distillation: A survey. International journal of computer vision, 129(6):1789--1819

  6. [6]

    Jo Handelsman. 2004. Metagenomics: application of genomics to uncultured microorganisms. Microbiology and molecular biology reviews, 68(4):669--685

  7. [7]

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

  8. [8]

    Felix Kallenborn, Alejandro Chacon, Christian Hundt, Hassan Sirelkhatim, Kieran Didi, Sooyoung Cha, Christian Dallago, Milot Mirdita, Bertil Schmidt, and Martin Steinegger. 2025. Gpu-accelerated homology search with mmseqs2. Nature Methods, 22(10):2024--2027

  9. [9]

    Daehwan Kim, Li Song, Florian P Breitwieser, and Steven L Salzberg. 2016. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome research, 26(12):1721--1729

  10. [10]

    Jaebeom Kim and Martin Steinegger. 2024. Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and dna. Nature methods, 21(6):971--973

  11. [11]

    Svetlana Kutuzova, Mads Nielsen, Pau Piera, Jakob Nybo Nissen, and Simon Rasmussen. 2024. Taxometer: Improving taxonomic classification of metagenomics contigs. Nature Communications, 15(1):8357

  12. [12]

    Eli Levy Karin and Martin Steinegger. 2025. Cutting-edge deep-learning based tools for metagenomic research. National Science Review, 12(6):nwaf056

  13. [13]

    Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Li-Jia Li. 2017. Learning from noisy labels with distillation. In Proceedings of the IEEE international conference on computer vision, pages 1910--1918

  14. [14]

    Qiaoxing Liang, Paul W Bible, Yu Liu, Bin Zou, and Lai Wei. 2020. Deepmicrobes: taxonomic classification for metagenomics with deep learning. NAR Genomics and Bioinformatics, 2(1):lqaa009

  15. [15]

    Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, and 1 others. 2023. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123--1130

  16. [16]

    Sheng Liu, Jonathan Niles-Weed, Narges Razavian, and Carlos Fernandez-Granda. 2020. Early-learning regularization prevents memorization of noisy labels. Advances in neural information processing systems, 33:20331--20342

  17. [17]

    Fernando Meyer, Adrian Fritz, Zhi-Luo Deng, David Koslicki, Till Robin Lesker, Alexey Gurevich, Gary Robertson, Mohammed Alser, Dmitry Antipov, Francesco Beghini, and 1 others. 2022. Critical assessment of metagenome interpretation: the second round of challenges. Nature methods, 19(4):429--440

  18. [18]

    Rafael M \"u ller, Simon Kornblith, and Geoffrey E Hinton. 2019. When does label smoothing help? Advances in neural information processing systems, 32

  19. [19]

    Stephen Nayfach, Simon Roux, Rekha Seshadri, Daniel Udwary, Neha Varghese, Frederik Schulz, Dongying Wu, David Paez-Espino, I-Min Chen, Marcel Huntemann, and 1 others. 2021. A genomic catalog of earth’s microbiomes. Nature biotechnology, 39(4):499--509

  20. [20]

    R Prabakaran and Yana Bromberg. 2025. Deciphering enzymatic potential in metagenomic reads through dna language models. Nucleic Acids Research, 53(16):gkaf836

  21. [21]

    H Ye Simon, Katherine J Siddle, Daniel J Park, and Pardis C Sabeti. 2019. Benchmarking metagenomics tools for taxonomic classification. Cell, 178(4):779--794

  22. [22]

    Luke R Thompson, Jon G Sanders, Daniel McDonald, Amnon Amir, Joshua Ladau, Kenneth J Locey, Robert J Prill, Anupriya Tripathi, Sean M Gibbons, Gail Ackermann, and 1 others. 2017. A communal catalogue reveals earth’s multiscale microbial diversity. Nature, 551(7681):457--463

  23. [23]

    Jack Valmadre. 2022. Hierarchical classification at multiple operating points. Advances in Neural Information Processing Systems, 35:18034--18045

  24. [24]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30

  25. [25]

    Harit Vishwakarma, Yi Chen, Satya Sai Srinath Namburi Gnvv, Sui Jiet Tay, Ramya Korlakai Vinayak, and Frederic Sala. 2025. Rethinking confidence scores and thresholds in pseudolabeling-based ssl. In Forty-second International Conference on Machine Learning

  26. [26]

    u ller, Daniel J \

    Alexander Wichmann, Etienne Buschong, Andr \'e M \"u ller, Daniel J \"u nger, Andreas Hildebrandt, Thomas Hankeln, and Bertil Schmidt. 2023. Metatransformer: deep metagenomic sequencing read classification using self-attention models. NAR Genomics and Bioinformatics, 5(3):lqad082

  27. [27]

    Derrick E Wood, Jennifer Lu, and Ben Langmead. 2019. Improved metagenomic analysis with kraken 2. Genome biology, 20(1):257

  28. [28]

    Rongguang Ye, Ming Tang, and Edith CH Ngai. 2025. On-the-fly adaptation to quantization: Configuration-aware lora for efficient fine-tuning of quantized llms. arXiv preprint arXiv:2509.25214

  29. [29]

    Rongye Ye, Lun Li, Ana Tereza Ribeiro de Vasconcelos, and Shuhui Song. 2026. Influ-bert: a domain-adaptive genomic language model for advancing influenza a virus research. Briefings in Bioinformatics, 27(2):bbag171

  30. [30]

    Li Yuan, Francis EH Tay, Guilin Li, Tao Wang, and Jiashi Feng. 2020. Revisiting knowledge distillation via label smoothing regularization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3903--3911

  31. [31]

    Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2016. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530

  32. [32]

    Zhihan Zhou, Robert Riley, Satria Kautsar, Weimin Wu, Rob Egan, Steven Hofmeyr, Shira Goldhaber-Gordon, Mutian Yu, Harrison Ho, Fengchen Liu, and 1 others. 2025. Genomeocean: an efficient genome foundation model trained on large-scale metagenomic assemblies. bioRxiv

  33. [33]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  34. [34]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...