Morphology-Aware Multimodal Representation Learning for Insect Phylogenetic Reconstruction

Chun He; Gongyin Ye; Haishuai Wang; Jiajun Bu; Kaijie Yu; Xiaoxu Cai; Xinhai Ye; Zixuan Liu

arxiv: 2606.22077 · v1 · pith:6JIW63BInew · submitted 2026-06-20 · 💻 cs.CV

Morphology-Aware Multimodal Representation Learning for Insect Phylogenetic Reconstruction

Zixuan Liu , Kaijie Yu , Chun He , Xiaoxu Cai , Xinhai Ye , Haishuai Wang , Gongyin Ye , Jiajun Bu This is my paper

Pith reviewed 2026-06-26 12:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal learningphylogenetic reconstructioninsect morphologyvision transformercontrastive learningBayesian inferenceimage-text alignmentcontinuous traits

0 comments

The pith

Multimodal alignment of insect images with morphological descriptions yields embeddings that improve agreement with reference phylogenies in Bayesian reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that aligning specimen images with curated morphological descriptions in a shared latent space produces visual embeddings that function as continuous traits for Bayesian phylogenetic reconstruction. It adapts a vision transformer via parameter-efficient fine-tuning and supervised contrastive learning to achieve the alignment, then feeds the resulting image embeddings into standard tree-building pipelines. Experiments on the Rove-Tree-11 dataset show higher topological agreement with the reference phylogeny than single-modality visual baselines. A sympathetic reader would care because the method offers a route to automate incorporation of semantic morphological knowledge without exhaustive manual trait coding. This could scale phylogenetic work to more taxa by bridging existing image collections and text descriptions.

Core claim

The central claim is that the morphology-aware multimodal alignment framework, which combines specimen images with morphological descriptions through vision transformer adaptation and image-text alignment, derives image embeddings that, when used as continuous traits, improve topological agreement with the reference phylogeny in Bayesian reconstruction compared to single-modality approaches.

What carries the argument

The morphology-aware multimodal alignment framework that performs supervised contrastive learning for image-text alignment in a shared latent space after parameter-efficient fine-tuning of a vision transformer.

If this is right

Image embeddings from the aligned model serve as continuous traits that capture more phylogenetic signal than visual-only features.
Multimodal alignment produces higher topological agreement metrics than single-modality baselines across tested visual backbones.
The framework enables direct use of existing morphological text alongside images without manual discretization of traits.
Ablation results attribute performance gains specifically to the image-text alignment step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could reduce dependence on fully expert-coded discrete characters if morphological descriptions already exist for imaged specimens.
It may generalize to other organism groups provided paired image-description datasets are available.
Future work could test whether the continuous traits combine productively with molecular sequence data in joint reconstructions.

Load-bearing premise

Curated morphological descriptions accurately encode phylogenetic signal that image-text alignment can capture and transfer into continuous visual traits for Bayesian reconstruction.

What would settle it

Finding equal or lower topological agreement with the reference phylogeny when the multimodal-aligned embeddings replace single-modality image embeddings in Bayesian reconstruction on the Rove-Tree-11 dataset.

read the original abstract

Morphological traits provide important evidence for phylogenetic reconstruction and evolutionary relationship analysis. Recent image-based approaches have introduced deep learning, particularly convolutional models, to derive morphological features from specimen images, but these methods generally rely on single-modality visual representations and do not explicitly incorporate morphological semantics. This study proposes a morphology-aware multimodal alignment framework for insect phylogenetic reconstruction. The framework combines specimen images with curated morphological descriptions by adapting a vision transformer through parameter-efficient fine-tuning and supervised contrastive learning, followed by image-text alignment in a shared latent space. The learned image embeddings are then used as continuous traits for Bayesian phylogenetic reconstruction. On the public Rove-Tree-11 dataset, comparative and ablation experiments across multiple visual backbones and feature adaptation strategies demonstrate that multimodal alignment improves topological agreement with the reference phylogeny. The results indicate that the proposed framework can derive morphology-aware visual traits for computational phylogenetic reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Multimodal alignment lifts topology agreement on Rove-Tree-11 but the embeddings lack any reported phylogenetic signal diagnostics.

read the letter

The main point is that adding supervised contrastive alignment between insect images and curated morphological text produces embeddings that, when treated as continuous traits, improve agreement with a reference phylogeny compared to image-only baselines.

What is new is the explicit step of injecting morphological semantics via image-text alignment before the Bayesian step. Earlier single-modality work stopped at visual features from CNNs or ViTs. The paper adapts a vision transformer with parameter-efficient fine-tuning and contrastive loss, then feeds the resulting vectors directly into the tree reconstruction. Ablations across backbones and adaptation choices are included, which is a basic but useful check.

The experiments use a public dataset, so the setup is at least testable. The reported improvement in topological agreement is the concrete result.

The soft spot is the missing verification that the embeddings actually encode heritable phylogenetic signal rather than dataset artifacts. The abstract gives no Pagel's lambda, Blomberg's K, or Mantel tests on the final traits, no comparison to independent character matrices, and no error bars or significance numbers on the tree metrics. Without those, it is hard to rule out that the gain comes from imaging conditions or text phrasing that happen to correlate with the particular split. Dataset handling and statistical details are also thin.

This is for people already working on image-based trait extraction in computational phylogenetics, especially entomology. A reader who wants a concrete multimodal recipe to adapt and test further will find the method straightforward.

It deserves peer review. The core idea is coherent, the data are public, and the experiments exist even if they need more phylogenetic diagnostics and clearer stats. A referee can ask for those checks without starting from zero.

Referee Report

3 major / 2 minor

Summary. The paper proposes a morphology-aware multimodal framework that fine-tunes a vision transformer with supervised contrastive learning to align insect specimen images and curated morphological text descriptions in a shared latent space; the resulting image embeddings are supplied as continuous traits to Bayesian phylogenetic reconstruction. On the public Rove-Tree-11 dataset, comparative and ablation experiments across visual backbones are reported to show improved topological agreement with a reference phylogeny relative to single-modality baselines.

Significance. If the central claim is substantiated, the work would demonstrate a practical route for injecting semantic morphological information into large-scale phylogenetic pipelines without manual character coding. The reliance on a public dataset and systematic ablations across backbones strengthens the potential for follow-up studies, though the absence of phylogenetic-signal diagnostics leaves open whether the reported gains reflect heritable morphology or dataset-specific correlations.

major comments (3)

[Results section] Results section (and abstract): the reported improvements in topological agreement are presented without statistical significance tests, error bars, or repeated runs with different random seeds or data splits; this directly undermines the claim that multimodal alignment reliably outperforms single-modality baselines.
[Methods (§3) and Results] Methods (§3) and Results: no phylogenetic signal diagnostics (Pagel’s λ, Blomberg’s K, or Mantel tests against independent character matrices) are applied to the final embeddings before they are used as continuous traits; without these, it is impossible to confirm that the contrastive alignment extracts heritable morphological variation rather than imaging or text artifacts.
[Experimental setup] Experimental setup: the manuscript does not specify the exact train/validation/test splits of Rove-Tree-11, the embedding dimensionality reduction (if any) prior to the multivariate Brownian/OU model, or whether the reference phylogeny was held completely out of the contrastive training; each of these choices is load-bearing for the central claim of improved topological agreement.

minor comments (2)

[Methods] Notation for the contrastive loss and the parameter-efficient fine-tuning adapters is introduced without an explicit equation or diagram, making the precise alignment objective difficult to reproduce from the text alone.
[Figures] Figure captions for the ablation plots do not state the number of independent runs or the metric aggregation method (mean, median, etc.).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas to strengthen the statistical rigor, validation of phylogenetic signal, and experimental transparency in our work. We respond to each major comment below.

read point-by-point responses

Referee: [Results section] Results section (and abstract): the reported improvements in topological agreement are presented without statistical significance tests, error bars, or repeated runs with different random seeds or data splits; this directly undermines the claim that multimodal alignment reliably outperforms single-modality baselines.

Authors: We agree that the lack of statistical tests and variability reporting limits the strength of the reliability claims. The original experiments used a single fixed seed for reproducibility. In the revision we will rerun all comparative and ablation experiments across five different random seeds, report mean topological agreement metrics with standard deviations, and include statistical significance tests (e.g., paired Wilcoxon signed-rank tests) between the multimodal approach and single-modality baselines. Updated results and p-values will appear in the Results section and be reflected in the abstract. revision: yes
Referee: [Methods (§3) and Results] Methods (§3) and Results: no phylogenetic signal diagnostics (Pagel’s λ, Blomberg’s K, or Mantel tests against independent character matrices) are applied to the final embeddings before they are used as continuous traits; without these, it is impossible to confirm that the contrastive alignment extracts heritable morphological variation rather than imaging or text artifacts.

Authors: We acknowledge the importance of these diagnostics. We will add Pagel’s λ and Blomberg’s K computed on the learned embeddings using the reference phylogeny to quantify phylogenetic signal. However, Rove-Tree-11 does not supply independent character matrices, precluding Mantel tests; we will therefore rely on the signal diagnostics together with the observed gains in topological agreement as supporting evidence. A new subsection will be added to Methods and the resulting values reported in Results. revision: partial
Referee: [Experimental setup] Experimental setup: the manuscript does not specify the exact train/validation/test splits of Rove-Tree-11, the embedding dimensionality reduction (if any) prior to the multivariate Brownian/OU model, or whether the reference phylogeny was held completely out of the contrastive training; each of these choices is load-bearing for the central claim of improved topological agreement.

Authors: We apologize for these omissions. The Rove-Tree-11 dataset was partitioned 70/15/15 for contrastive training/validation/test with no specimen overlap between splits. The reference phylogeny was held completely out of contrastive training. No dimensionality reduction was performed; the full 768-dimensional ViT embeddings were supplied directly to the multivariate Brownian motion model. These specifications will be added to the Experimental setup section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on standard methods and public data

full rationale

The paper's chain consists of applying off-the-shelf supervised contrastive learning (parameter-efficient fine-tuning of ViT) to align images with curated text descriptions on the public Rove-Tree-11 dataset, then supplying the resulting embeddings as continuous traits to a standard Bayesian phylogenetic pipeline. No equations, fitted parameters, or predictions are shown to reduce to the inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. The central empirical claim (improved topological agreement) is evaluated against an external reference phylogeny and ablations, making the derivation self-contained against public benchmarks rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters or invented entities; the framework implicitly rests on domain assumptions about morphological data.

axioms (1)

domain assumption Morphological descriptions contain phylogenetic signal that can be aligned with visual features
Central to the proposed image-text alignment and trait extraction

pith-pipeline@v0.9.1-grok · 5699 in / 1036 out tokens · 25241 ms · 2026-06-26T12:33:43.697360+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 16 canonical work pages

[1]

The role of morphological data in phylogeny reconstruction

J. J. Wiens, “The role of morphological data in phylogeny reconstruction.” Syst. Biol., vol. 53, no. 4, pp. 653–661, Aug. 2004, doi: 10.1080/10635150490472959

work page doi:10.1080/10635150490472959 2004
[2]

Morphological phylogenetics in the genomic age

M. S. Y. Lee and A. Palci, “Morphological phylogenetics in the genomic age.” Curr. Biol., vol. 25, no. 19, pp. R922–R929, Oct. 2015, doi: 10.1016/j.cub.2015.07.009

work page doi:10.1016/j.cub.2015.07.009 2015
[3]

Morphology should not be forgotten in the era of genomics- a phylogenetic perspective

G. Giribet, “Morphology should not be forgotten in the era of genomics- a phylogenetic perspective.” Zool. Anz., vol. 256, pp. 96–103, May 2015, doi: 10.1016/j.jcz.2015.01.003

work page doi:10.1016/j.jcz.2015.01.003 2015
[4]

Felsenstein, Inferring Phylogenies

J. Felsenstein, Inferring Phylogenies. Sunderland, MA, USA: Sinauer Associates, 2004

2004
[5]

Learning transferable visual models from natural language supervision

A. Radford et al., “ Learning transferable visual models from natural language supervision.” in Proc. Int. Conf. Mach. Learn. (ICML), 2021, pp. 8748–8763

2021
[6]

Maximum-likelihood estimation of evolutionary trees from continuous characters

J. Felsenstein, “ Maximum-likelihood estimation of evolutionary trees from continuous characters.” Amer. J. Hum. Genet., vol. 25, no. 5, pp. 471, Sep. 1973

1973
[7]

Phylogenies and quantitative characters

J. Felsenstein, “ Phylogenies and quantitative characters. ” Annu. Rev. Ecol. Syst., pp. 445–471, Jan. 1988

1988
[8]

Rove-Tree-11: The not-so-wild rover, a hierarchically structured image dataset for deep metric learning research

R. Hunt and K. S. Pedersen, “ Rove-Tree-11: The not-so-wild rover, a hierarchically structured image dataset for deep metric learning research.” in Proc. Asian Conf. Comput. Vis. (ACCV), pp. 2967–2983, 2022

2022
[9]

Integrating deep learning-derived morphological traits and molecular data for total-evidence phylogenetics: Lessons from digitized collections

R. Hunt et al., “ Integrating deep learning-derived morphological traits and molecular data for total-evidence phylogenetics: Lessons from digitized collections.” Syst. Biol., vol. 74, no. 3, pp. 453–468, 2025

2025
[10]

Deep learning on butterfly phenotypes tests evolution’s oldest mathematical model

J. F. Hoyal Cuthill et al., “Deep learning on butterfly phenotypes tests evolution’s oldest mathematical model. ” Sci. Adv., vol. 5, no. 8, Aug. 2019, Art. no. eaaw4967, doi: 10.1126/sciadv.aaw4967

work page doi:10.1126/sciadv.aaw4967 2019
[11]

An image is worth 16 × 16 words: Transformers for image recognition at scale

A. Dosovitskiy et al., “An image is worth 16 × 16 words: Transformers for image recognition at scale. ” in Proc. Int. Conf. Learn. Represent. (ICLR), 2021

2021
[12]

Emerging properties in self-supervised vision transformers

M. Caron et al., “ Emerging properties in self-supervised vision transformers.” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 9650–9660

2021
[13]

DINOv2: Learning robust visual features without supervision,

M. Oquab et al., “ DINOv2: Learning robust visual features without supervision,” Trans. Mach. Learn. Res., pp. 1–32, Jan. 2024

2024
[14]

EV A-02: A visual representation for neon genesis.Image and Vision Computing, 149: 105171, 2024

Y. Fang et al., “ EVA-02: A visual representation for neon genesis. ” Image Vis. Comput., vol. 149, Sep. 2024, doi: 10.1016/j.imavis.2024.105171

work page doi:10.1016/j.imavis.2024.105171 2024
[15]

BEiT v2: Masked image modeling with vector-quantized visual tokenizers

Z. Peng, L. Dong, H. Bao, Q. Ye, and F. Wei, “BEiT v2: Masked image modeling with vector-quantized visual tokenizers. ” 2022, arXiv:2208.06366

arXiv 2022
[16]

ConvNeXt V2: Co-designing and scaling ConvNets with masked autoencoders

S. Woo et al., “ConvNeXt V2: Co-designing and scaling ConvNets with masked autoencoders.” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 16133–16142

2023
[17]

BioCLIP: A vision foundation model for the tree of life

S. Stevens et al., “BioCLIP: A vision foundation model for the tree of life.” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 19412–19424
[18]

Multi-modal molecule structure–text model for text-based retrieval and editing,

S. Liu et al., “Multi-modal molecule structure–text model for text-based retrieval and editing,” Nat. Mach. Intell., vol. 5, no. 12, pp. 1447–1457, Dec. 2023

2023
[19]

A molecular multimodal foundation model associating molecule graphs with natural language

B. Su et al., “ A molecular multimodal foundation model associating molecule graphs with natural language.” 2022, arXiv:2209.05481

arXiv 2022
[20]

Extracting molecular properties from natural language with multimodal contrastive learning

R. Lacombe et al., “ Extracting molecular properties from natural language with multimodal contrastive learning. ” 2023, arXiv:2307.12996

arXiv 2023
[21]

MMCL: A multi-modal contrastive learning framework for molecular property prediction

M. Gao and F. Zhu, “ MMCL: A multi-modal contrastive learning framework for molecular property prediction. ” IEEE/ACM Trans. Comput. Biol. Bioinf., doi: 10.1109/TCBBIO.2026.3663206

work page doi:10.1109/tcbbio.2026.3663206 2026
[22]

PTPPI: A study on protein inhibitor prediction methods using multimodal feature fusion and attention mechanism

Z. Dong et al., “PTPPI: A study on protein inhibitor prediction methods using multimodal feature fusion and attention mechanism.” IEEE/ACM Trans. Comput. Biol. Bioinf., doi: 10.1109/TCBBIO.2026.3657905

work page doi:10.1109/tcbbio.2026.3657905 2026
[23]

LoRA: Low-rank adaptation of large language models

E. J. Hu et al., “LoRA: Low-rank adaptation of large language models.” in Proc. Int. Conf. Learn. Represent. (ICLR), 2022

2022
[24]

Supervised contrastive learning

P. Khosla et al., “Supervised contrastive learning.” in Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 18661–18673

2020
[25]

BioClinical ModernBERT: A state-of-the-art long- context encoder for biomedical and clinical NLP

T. Sounack et al., “ BioClinical ModernBERT: A state-of-the-art long- context encoder for biomedical and clinical NLP. ” 2025, arXiv:2506.10896

arXiv 2025
[26]

RevBayes: Bayesian phylogenetic inference using graphical models and an interactive model-specification language

S. Höhna et al., “ RevBayes: Bayesian phylogenetic inference using graphical models and an interactive model-specification language.” Syst. Biol., vol. 65, no. 4, pp. 726–736, Jul. 2016, doi: 10.1093/sysbio/syw021

work page doi:10.1093/sysbio/syw021 2016
[27]

Comparison of phylogenetic trees

D. F. Robinson and L. R. Foulds, “Comparison of phylogenetic trees.” Math. Biosci., vol. 53, no. 1-2, pp. 131 – 147, Feb. 1981, doi: 10.1016/0025-5564(81)90043-2

work page doi:10.1016/0025-5564(81)90043-2 1981
[28]

Introducing GPT-5.2

OpenAI, “ Introducing GPT-5.2 ” 2025. [Online]. Available: https://openai.com/index/introducing-gpt-5-2/. Accessed: Jan. 15, 2026

2025
[29]

Siméoni et al., “DINOv3” 2025, arXiv:2508.10104

O. Siméoni et al., “DINOv3” 2025, arXiv:2508.10104

Pith/arXiv arXiv 2025
[30]

Evidence for an ancient adaptive episode of convergent molecular evolution,

T. A. Castoe et al., “ Evidence for an ancient adaptive episode of convergent molecular evolution,” Proc. Natl. Acad. Sci. USA, vol. 106, no. 22, pp. 8986–8991, 2009

2009
[31]

Can phylogenetics identify C4 origins and reversals?

P.-A. Christin, R. P. Freckleton, and C. P. Osborne, “Can phylogenetics identify C4 origins and reversals?” Trends Ecol. Evol., vol. 25, no. 7, pp. 403–409, Jul. 2010

2010
[32]

Convergent adaptation in mitochondria of phylogenetically distant birds: Does it exist?

V. Burskaia et al., “ Convergent adaptation in mitochondria of phylogenetically distant birds: Does it exist?” Genome Biol. Evol., vol. 13, no. 7, Jul 2021, evab113

2021
[33]

Gene tree discordance, phylogenetic inference and the multispecies coalescent

J. H. Degnan and N. A. Rosenberg, “ Gene tree discordance, phylogenetic inference and the multispecies coalescent. ” Trends Ecol. Evol., vol. 24, no. 6, pp. 332 – 340, Jun. 2009, doi: 10.1016/j.tree.2009.01.009

work page doi:10.1016/j.tree.2009.01.009 2009
[34]

Resolving difficult phylogenetic questions: Why more sequences are not enough

H. Philippe et al., “ Resolving difficult phylogenetic questions: Why more sequences are not enough.” PLoS Biol., vol. 9, no. 3, Mar. 2011, doi: 10.1371/journal.pbio.1000602

work page doi:10.1371/journal.pbio.1000602 2011
[35]

OrthoFinder: Phylogenetic orthology inference for comparative genomics

D. M. Emms and S. Kelly, “ OrthoFinder: Phylogenetic orthology inference for comparative genomics.” Genome Biol., vol. 20, no. 1, pp. 238, Nov. 2019, doi: 10.1186/s13059-019-1832-y

work page doi:10.1186/s13059-019-1832-y 2019
[36]

UFBoot2: Improving the ultrafast bootstrap approximation

D. T. Hoang et al., “ UFBoot2: Improving the ultrafast bootstrap approximation.” Mol. Biol. Evol., vol. 35, no. 2, pp. 518–522, Feb. 2018, doi: 10.1093/molbev/msx281

work page doi:10.1093/molbev/msx281 2018
[37]

ModelFinder: Fast model selection for accurate phylogenetic estimates

S. Kalyaanamoorthy et al., “ ModelFinder: Fast model selection for accurate phylogenetic estimates.” Nat. Methods, vol. 14, no. 6, pp. 587– 589, Jun. 2017, doi: 10.1038/nmeth.4285

work page doi:10.1038/nmeth.4285 2017
[38]

doi: 10.1093/molbev/mst010

K. Katoh and D. M. Standley, “ MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. ” Mol. Biol. Evol., vol. 30, no. 4, pp. 772 – 780, Jan. 2013, doi: 10.1093/molbev/mst010

work page doi:10.1093/molbev/mst010 2013
[39]

IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era

B. Q. Minh et al., “IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era.” Mol. Biol. Evol., vol. 37, no. 5, pp. 1530–1534, May 2020, doi: 10.1093/molbev/msaa015

work page doi:10.1093/molbev/msaa015 2020
[40]

Yang, Computational Molecular Evolution

Z. Yang, Computational Molecular Evolution. Oxford, U.K.: Oxford Univ. Press, 2006

2006

[1] [1]

The role of morphological data in phylogeny reconstruction

J. J. Wiens, “The role of morphological data in phylogeny reconstruction.” Syst. Biol., vol. 53, no. 4, pp. 653–661, Aug. 2004, doi: 10.1080/10635150490472959

work page doi:10.1080/10635150490472959 2004

[2] [2]

Morphological phylogenetics in the genomic age

M. S. Y. Lee and A. Palci, “Morphological phylogenetics in the genomic age.” Curr. Biol., vol. 25, no. 19, pp. R922–R929, Oct. 2015, doi: 10.1016/j.cub.2015.07.009

work page doi:10.1016/j.cub.2015.07.009 2015

[3] [3]

Morphology should not be forgotten in the era of genomics- a phylogenetic perspective

G. Giribet, “Morphology should not be forgotten in the era of genomics- a phylogenetic perspective.” Zool. Anz., vol. 256, pp. 96–103, May 2015, doi: 10.1016/j.jcz.2015.01.003

work page doi:10.1016/j.jcz.2015.01.003 2015

[4] [4]

Felsenstein, Inferring Phylogenies

J. Felsenstein, Inferring Phylogenies. Sunderland, MA, USA: Sinauer Associates, 2004

2004

[5] [5]

Learning transferable visual models from natural language supervision

A. Radford et al., “ Learning transferable visual models from natural language supervision.” in Proc. Int. Conf. Mach. Learn. (ICML), 2021, pp. 8748–8763

2021

[6] [6]

Maximum-likelihood estimation of evolutionary trees from continuous characters

J. Felsenstein, “ Maximum-likelihood estimation of evolutionary trees from continuous characters.” Amer. J. Hum. Genet., vol. 25, no. 5, pp. 471, Sep. 1973

1973

[7] [7]

Phylogenies and quantitative characters

J. Felsenstein, “ Phylogenies and quantitative characters. ” Annu. Rev. Ecol. Syst., pp. 445–471, Jan. 1988

1988

[8] [8]

Rove-Tree-11: The not-so-wild rover, a hierarchically structured image dataset for deep metric learning research

R. Hunt and K. S. Pedersen, “ Rove-Tree-11: The not-so-wild rover, a hierarchically structured image dataset for deep metric learning research.” in Proc. Asian Conf. Comput. Vis. (ACCV), pp. 2967–2983, 2022

2022

[9] [9]

Integrating deep learning-derived morphological traits and molecular data for total-evidence phylogenetics: Lessons from digitized collections

R. Hunt et al., “ Integrating deep learning-derived morphological traits and molecular data for total-evidence phylogenetics: Lessons from digitized collections.” Syst. Biol., vol. 74, no. 3, pp. 453–468, 2025

2025

[10] [10]

Deep learning on butterfly phenotypes tests evolution’s oldest mathematical model

J. F. Hoyal Cuthill et al., “Deep learning on butterfly phenotypes tests evolution’s oldest mathematical model. ” Sci. Adv., vol. 5, no. 8, Aug. 2019, Art. no. eaaw4967, doi: 10.1126/sciadv.aaw4967

work page doi:10.1126/sciadv.aaw4967 2019

[11] [11]

An image is worth 16 × 16 words: Transformers for image recognition at scale

A. Dosovitskiy et al., “An image is worth 16 × 16 words: Transformers for image recognition at scale. ” in Proc. Int. Conf. Learn. Represent. (ICLR), 2021

2021

[12] [12]

Emerging properties in self-supervised vision transformers

M. Caron et al., “ Emerging properties in self-supervised vision transformers.” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 9650–9660

2021

[13] [13]

DINOv2: Learning robust visual features without supervision,

M. Oquab et al., “ DINOv2: Learning robust visual features without supervision,” Trans. Mach. Learn. Res., pp. 1–32, Jan. 2024

2024

[14] [14]

EV A-02: A visual representation for neon genesis.Image and Vision Computing, 149: 105171, 2024

Y. Fang et al., “ EVA-02: A visual representation for neon genesis. ” Image Vis. Comput., vol. 149, Sep. 2024, doi: 10.1016/j.imavis.2024.105171

work page doi:10.1016/j.imavis.2024.105171 2024

[15] [15]

BEiT v2: Masked image modeling with vector-quantized visual tokenizers

Z. Peng, L. Dong, H. Bao, Q. Ye, and F. Wei, “BEiT v2: Masked image modeling with vector-quantized visual tokenizers. ” 2022, arXiv:2208.06366

arXiv 2022

[16] [16]

ConvNeXt V2: Co-designing and scaling ConvNets with masked autoencoders

S. Woo et al., “ConvNeXt V2: Co-designing and scaling ConvNets with masked autoencoders.” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 16133–16142

2023

[17] [17]

BioCLIP: A vision foundation model for the tree of life

S. Stevens et al., “BioCLIP: A vision foundation model for the tree of life.” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 19412–19424

[18] [18]

Multi-modal molecule structure–text model for text-based retrieval and editing,

S. Liu et al., “Multi-modal molecule structure–text model for text-based retrieval and editing,” Nat. Mach. Intell., vol. 5, no. 12, pp. 1447–1457, Dec. 2023

2023

[19] [19]

A molecular multimodal foundation model associating molecule graphs with natural language

B. Su et al., “ A molecular multimodal foundation model associating molecule graphs with natural language.” 2022, arXiv:2209.05481

arXiv 2022

[20] [20]

Extracting molecular properties from natural language with multimodal contrastive learning

R. Lacombe et al., “ Extracting molecular properties from natural language with multimodal contrastive learning. ” 2023, arXiv:2307.12996

arXiv 2023

[21] [21]

MMCL: A multi-modal contrastive learning framework for molecular property prediction

M. Gao and F. Zhu, “ MMCL: A multi-modal contrastive learning framework for molecular property prediction. ” IEEE/ACM Trans. Comput. Biol. Bioinf., doi: 10.1109/TCBBIO.2026.3663206

work page doi:10.1109/tcbbio.2026.3663206 2026

[22] [22]

PTPPI: A study on protein inhibitor prediction methods using multimodal feature fusion and attention mechanism

Z. Dong et al., “PTPPI: A study on protein inhibitor prediction methods using multimodal feature fusion and attention mechanism.” IEEE/ACM Trans. Comput. Biol. Bioinf., doi: 10.1109/TCBBIO.2026.3657905

work page doi:10.1109/tcbbio.2026.3657905 2026

[23] [23]

LoRA: Low-rank adaptation of large language models

E. J. Hu et al., “LoRA: Low-rank adaptation of large language models.” in Proc. Int. Conf. Learn. Represent. (ICLR), 2022

2022

[24] [24]

Supervised contrastive learning

P. Khosla et al., “Supervised contrastive learning.” in Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 18661–18673

2020

[25] [25]

BioClinical ModernBERT: A state-of-the-art long- context encoder for biomedical and clinical NLP

T. Sounack et al., “ BioClinical ModernBERT: A state-of-the-art long- context encoder for biomedical and clinical NLP. ” 2025, arXiv:2506.10896

arXiv 2025

[26] [26]

RevBayes: Bayesian phylogenetic inference using graphical models and an interactive model-specification language

S. Höhna et al., “ RevBayes: Bayesian phylogenetic inference using graphical models and an interactive model-specification language.” Syst. Biol., vol. 65, no. 4, pp. 726–736, Jul. 2016, doi: 10.1093/sysbio/syw021

work page doi:10.1093/sysbio/syw021 2016

[27] [27]

Comparison of phylogenetic trees

D. F. Robinson and L. R. Foulds, “Comparison of phylogenetic trees.” Math. Biosci., vol. 53, no. 1-2, pp. 131 – 147, Feb. 1981, doi: 10.1016/0025-5564(81)90043-2

work page doi:10.1016/0025-5564(81)90043-2 1981

[28] [28]

Introducing GPT-5.2

OpenAI, “ Introducing GPT-5.2 ” 2025. [Online]. Available: https://openai.com/index/introducing-gpt-5-2/. Accessed: Jan. 15, 2026

2025

[29] [29]

Siméoni et al., “DINOv3” 2025, arXiv:2508.10104

O. Siméoni et al., “DINOv3” 2025, arXiv:2508.10104

Pith/arXiv arXiv 2025

[30] [30]

Evidence for an ancient adaptive episode of convergent molecular evolution,

T. A. Castoe et al., “ Evidence for an ancient adaptive episode of convergent molecular evolution,” Proc. Natl. Acad. Sci. USA, vol. 106, no. 22, pp. 8986–8991, 2009

2009

[31] [31]

Can phylogenetics identify C4 origins and reversals?

P.-A. Christin, R. P. Freckleton, and C. P. Osborne, “Can phylogenetics identify C4 origins and reversals?” Trends Ecol. Evol., vol. 25, no. 7, pp. 403–409, Jul. 2010

2010

[32] [32]

Convergent adaptation in mitochondria of phylogenetically distant birds: Does it exist?

V. Burskaia et al., “ Convergent adaptation in mitochondria of phylogenetically distant birds: Does it exist?” Genome Biol. Evol., vol. 13, no. 7, Jul 2021, evab113

2021

[33] [33]

Gene tree discordance, phylogenetic inference and the multispecies coalescent

J. H. Degnan and N. A. Rosenberg, “ Gene tree discordance, phylogenetic inference and the multispecies coalescent. ” Trends Ecol. Evol., vol. 24, no. 6, pp. 332 – 340, Jun. 2009, doi: 10.1016/j.tree.2009.01.009

work page doi:10.1016/j.tree.2009.01.009 2009

[34] [34]

Resolving difficult phylogenetic questions: Why more sequences are not enough

H. Philippe et al., “ Resolving difficult phylogenetic questions: Why more sequences are not enough.” PLoS Biol., vol. 9, no. 3, Mar. 2011, doi: 10.1371/journal.pbio.1000602

work page doi:10.1371/journal.pbio.1000602 2011

[35] [35]

OrthoFinder: Phylogenetic orthology inference for comparative genomics

D. M. Emms and S. Kelly, “ OrthoFinder: Phylogenetic orthology inference for comparative genomics.” Genome Biol., vol. 20, no. 1, pp. 238, Nov. 2019, doi: 10.1186/s13059-019-1832-y

work page doi:10.1186/s13059-019-1832-y 2019

[36] [36]

UFBoot2: Improving the ultrafast bootstrap approximation

D. T. Hoang et al., “ UFBoot2: Improving the ultrafast bootstrap approximation.” Mol. Biol. Evol., vol. 35, no. 2, pp. 518–522, Feb. 2018, doi: 10.1093/molbev/msx281

work page doi:10.1093/molbev/msx281 2018

[37] [37]

ModelFinder: Fast model selection for accurate phylogenetic estimates

S. Kalyaanamoorthy et al., “ ModelFinder: Fast model selection for accurate phylogenetic estimates.” Nat. Methods, vol. 14, no. 6, pp. 587– 589, Jun. 2017, doi: 10.1038/nmeth.4285

work page doi:10.1038/nmeth.4285 2017

[38] [38]

doi: 10.1093/molbev/mst010

K. Katoh and D. M. Standley, “ MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. ” Mol. Biol. Evol., vol. 30, no. 4, pp. 772 – 780, Jan. 2013, doi: 10.1093/molbev/mst010

work page doi:10.1093/molbev/mst010 2013

[39] [39]

IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era

B. Q. Minh et al., “IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era.” Mol. Biol. Evol., vol. 37, no. 5, pp. 1530–1534, May 2020, doi: 10.1093/molbev/msaa015

work page doi:10.1093/molbev/msaa015 2020

[40] [40]

Yang, Computational Molecular Evolution

Z. Yang, Computational Molecular Evolution. Oxford, U.K.: Oxford Univ. Press, 2006

2006