Morphology-Aware Multimodal Representation Learning for Insect Phylogenetic Reconstruction
Pith reviewed 2026-06-26 12:33 UTC · model grok-4.3
The pith
Multimodal alignment of insect images with morphological descriptions yields embeddings that improve agreement with reference phylogenies in Bayesian reconstruction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the morphology-aware multimodal alignment framework, which combines specimen images with morphological descriptions through vision transformer adaptation and image-text alignment, derives image embeddings that, when used as continuous traits, improve topological agreement with the reference phylogeny in Bayesian reconstruction compared to single-modality approaches.
What carries the argument
The morphology-aware multimodal alignment framework that performs supervised contrastive learning for image-text alignment in a shared latent space after parameter-efficient fine-tuning of a vision transformer.
If this is right
- Image embeddings from the aligned model serve as continuous traits that capture more phylogenetic signal than visual-only features.
- Multimodal alignment produces higher topological agreement metrics than single-modality baselines across tested visual backbones.
- The framework enables direct use of existing morphological text alongside images without manual discretization of traits.
- Ablation results attribute performance gains specifically to the image-text alignment step.
Where Pith is reading between the lines
- The approach could reduce dependence on fully expert-coded discrete characters if morphological descriptions already exist for imaged specimens.
- It may generalize to other organism groups provided paired image-description datasets are available.
- Future work could test whether the continuous traits combine productively with molecular sequence data in joint reconstructions.
Load-bearing premise
Curated morphological descriptions accurately encode phylogenetic signal that image-text alignment can capture and transfer into continuous visual traits for Bayesian reconstruction.
What would settle it
Finding equal or lower topological agreement with the reference phylogeny when the multimodal-aligned embeddings replace single-modality image embeddings in Bayesian reconstruction on the Rove-Tree-11 dataset.
read the original abstract
Morphological traits provide important evidence for phylogenetic reconstruction and evolutionary relationship analysis. Recent image-based approaches have introduced deep learning, particularly convolutional models, to derive morphological features from specimen images, but these methods generally rely on single-modality visual representations and do not explicitly incorporate morphological semantics. This study proposes a morphology-aware multimodal alignment framework for insect phylogenetic reconstruction. The framework combines specimen images with curated morphological descriptions by adapting a vision transformer through parameter-efficient fine-tuning and supervised contrastive learning, followed by image-text alignment in a shared latent space. The learned image embeddings are then used as continuous traits for Bayesian phylogenetic reconstruction. On the public Rove-Tree-11 dataset, comparative and ablation experiments across multiple visual backbones and feature adaptation strategies demonstrate that multimodal alignment improves topological agreement with the reference phylogeny. The results indicate that the proposed framework can derive morphology-aware visual traits for computational phylogenetic reconstruction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a morphology-aware multimodal framework that fine-tunes a vision transformer with supervised contrastive learning to align insect specimen images and curated morphological text descriptions in a shared latent space; the resulting image embeddings are supplied as continuous traits to Bayesian phylogenetic reconstruction. On the public Rove-Tree-11 dataset, comparative and ablation experiments across visual backbones are reported to show improved topological agreement with a reference phylogeny relative to single-modality baselines.
Significance. If the central claim is substantiated, the work would demonstrate a practical route for injecting semantic morphological information into large-scale phylogenetic pipelines without manual character coding. The reliance on a public dataset and systematic ablations across backbones strengthens the potential for follow-up studies, though the absence of phylogenetic-signal diagnostics leaves open whether the reported gains reflect heritable morphology or dataset-specific correlations.
major comments (3)
- [Results section] Results section (and abstract): the reported improvements in topological agreement are presented without statistical significance tests, error bars, or repeated runs with different random seeds or data splits; this directly undermines the claim that multimodal alignment reliably outperforms single-modality baselines.
- [Methods (§3) and Results] Methods (§3) and Results: no phylogenetic signal diagnostics (Pagel’s λ, Blomberg’s K, or Mantel tests against independent character matrices) are applied to the final embeddings before they are used as continuous traits; without these, it is impossible to confirm that the contrastive alignment extracts heritable morphological variation rather than imaging or text artifacts.
- [Experimental setup] Experimental setup: the manuscript does not specify the exact train/validation/test splits of Rove-Tree-11, the embedding dimensionality reduction (if any) prior to the multivariate Brownian/OU model, or whether the reference phylogeny was held completely out of the contrastive training; each of these choices is load-bearing for the central claim of improved topological agreement.
minor comments (2)
- [Methods] Notation for the contrastive loss and the parameter-efficient fine-tuning adapters is introduced without an explicit equation or diagram, making the precise alignment objective difficult to reproduce from the text alone.
- [Figures] Figure captions for the ablation plots do not state the number of independent runs or the metric aggregation method (mean, median, etc.).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which identifies key areas to strengthen the statistical rigor, validation of phylogenetic signal, and experimental transparency in our work. We respond to each major comment below.
read point-by-point responses
-
Referee: [Results section] Results section (and abstract): the reported improvements in topological agreement are presented without statistical significance tests, error bars, or repeated runs with different random seeds or data splits; this directly undermines the claim that multimodal alignment reliably outperforms single-modality baselines.
Authors: We agree that the lack of statistical tests and variability reporting limits the strength of the reliability claims. The original experiments used a single fixed seed for reproducibility. In the revision we will rerun all comparative and ablation experiments across five different random seeds, report mean topological agreement metrics with standard deviations, and include statistical significance tests (e.g., paired Wilcoxon signed-rank tests) between the multimodal approach and single-modality baselines. Updated results and p-values will appear in the Results section and be reflected in the abstract. revision: yes
-
Referee: [Methods (§3) and Results] Methods (§3) and Results: no phylogenetic signal diagnostics (Pagel’s λ, Blomberg’s K, or Mantel tests against independent character matrices) are applied to the final embeddings before they are used as continuous traits; without these, it is impossible to confirm that the contrastive alignment extracts heritable morphological variation rather than imaging or text artifacts.
Authors: We acknowledge the importance of these diagnostics. We will add Pagel’s λ and Blomberg’s K computed on the learned embeddings using the reference phylogeny to quantify phylogenetic signal. However, Rove-Tree-11 does not supply independent character matrices, precluding Mantel tests; we will therefore rely on the signal diagnostics together with the observed gains in topological agreement as supporting evidence. A new subsection will be added to Methods and the resulting values reported in Results. revision: partial
-
Referee: [Experimental setup] Experimental setup: the manuscript does not specify the exact train/validation/test splits of Rove-Tree-11, the embedding dimensionality reduction (if any) prior to the multivariate Brownian/OU model, or whether the reference phylogeny was held completely out of the contrastive training; each of these choices is load-bearing for the central claim of improved topological agreement.
Authors: We apologize for these omissions. The Rove-Tree-11 dataset was partitioned 70/15/15 for contrastive training/validation/test with no specimen overlap between splits. The reference phylogeny was held completely out of contrastive training. No dimensionality reduction was performed; the full 768-dimensional ViT embeddings were supplied directly to the multivariate Brownian motion model. These specifications will be added to the Experimental setup section. revision: yes
Circularity Check
No significant circularity; derivation relies on standard methods and public data
full rationale
The paper's chain consists of applying off-the-shelf supervised contrastive learning (parameter-efficient fine-tuning of ViT) to align images with curated text descriptions on the public Rove-Tree-11 dataset, then supplying the resulting embeddings as continuous traits to a standard Bayesian phylogenetic pipeline. No equations, fitted parameters, or predictions are shown to reduce to the inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. The central empirical claim (improved topological agreement) is evaluated against an external reference phylogeny and ablations, making the derivation self-contained against public benchmarks rather than internally forced.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Morphological descriptions contain phylogenetic signal that can be aligned with visual features
Reference graph
Works this paper leans on
-
[1]
The role of morphological data in phylogeny reconstruction
J. J. Wiens, “The role of morphological data in phylogeny reconstruction.” Syst. Biol., vol. 53, no. 4, pp. 653–661, Aug. 2004, doi: 10.1080/10635150490472959
-
[2]
Morphological phylogenetics in the genomic age
M. S. Y. Lee and A. Palci, “Morphological phylogenetics in the genomic age.” Curr. Biol., vol. 25, no. 19, pp. R922–R929, Oct. 2015, doi: 10.1016/j.cub.2015.07.009
-
[3]
Morphology should not be forgotten in the era of genomics- a phylogenetic perspective
G. Giribet, “Morphology should not be forgotten in the era of genomics- a phylogenetic perspective.” Zool. Anz., vol. 256, pp. 96–103, May 2015, doi: 10.1016/j.jcz.2015.01.003
-
[4]
Felsenstein, Inferring Phylogenies
J. Felsenstein, Inferring Phylogenies. Sunderland, MA, USA: Sinauer Associates, 2004
2004
-
[5]
Learning transferable visual models from natural language supervision
A. Radford et al., “ Learning transferable visual models from natural language supervision.” in Proc. Int. Conf. Mach. Learn. (ICML), 2021, pp. 8748–8763
2021
-
[6]
Maximum-likelihood estimation of evolutionary trees from continuous characters
J. Felsenstein, “ Maximum-likelihood estimation of evolutionary trees from continuous characters.” Amer. J. Hum. Genet., vol. 25, no. 5, pp. 471, Sep. 1973
1973
-
[7]
Phylogenies and quantitative characters
J. Felsenstein, “ Phylogenies and quantitative characters. ” Annu. Rev. Ecol. Syst., pp. 445–471, Jan. 1988
1988
-
[8]
Rove-Tree-11: The not-so-wild rover, a hierarchically structured image dataset for deep metric learning research
R. Hunt and K. S. Pedersen, “ Rove-Tree-11: The not-so-wild rover, a hierarchically structured image dataset for deep metric learning research.” in Proc. Asian Conf. Comput. Vis. (ACCV), pp. 2967–2983, 2022
2022
-
[9]
Integrating deep learning-derived morphological traits and molecular data for total-evidence phylogenetics: Lessons from digitized collections
R. Hunt et al., “ Integrating deep learning-derived morphological traits and molecular data for total-evidence phylogenetics: Lessons from digitized collections.” Syst. Biol., vol. 74, no. 3, pp. 453–468, 2025
2025
-
[10]
Deep learning on butterfly phenotypes tests evolution’s oldest mathematical model
J. F. Hoyal Cuthill et al., “Deep learning on butterfly phenotypes tests evolution’s oldest mathematical model. ” Sci. Adv., vol. 5, no. 8, Aug. 2019, Art. no. eaaw4967, doi: 10.1126/sciadv.aaw4967
-
[11]
An image is worth 16 × 16 words: Transformers for image recognition at scale
A. Dosovitskiy et al., “An image is worth 16 × 16 words: Transformers for image recognition at scale. ” in Proc. Int. Conf. Learn. Represent. (ICLR), 2021
2021
-
[12]
Emerging properties in self-supervised vision transformers
M. Caron et al., “ Emerging properties in self-supervised vision transformers.” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 9650–9660
2021
-
[13]
DINOv2: Learning robust visual features without supervision,
M. Oquab et al., “ DINOv2: Learning robust visual features without supervision,” Trans. Mach. Learn. Res., pp. 1–32, Jan. 2024
2024
-
[14]
EV A-02: A visual representation for neon genesis.Image and Vision Computing, 149: 105171, 2024
Y. Fang et al., “ EVA-02: A visual representation for neon genesis. ” Image Vis. Comput., vol. 149, Sep. 2024, doi: 10.1016/j.imavis.2024.105171
-
[15]
BEiT v2: Masked image modeling with vector-quantized visual tokenizers
Z. Peng, L. Dong, H. Bao, Q. Ye, and F. Wei, “BEiT v2: Masked image modeling with vector-quantized visual tokenizers. ” 2022, arXiv:2208.06366
arXiv 2022
-
[16]
ConvNeXt V2: Co-designing and scaling ConvNets with masked autoencoders
S. Woo et al., “ConvNeXt V2: Co-designing and scaling ConvNets with masked autoencoders.” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 16133–16142
2023
-
[17]
BioCLIP: A vision foundation model for the tree of life
S. Stevens et al., “BioCLIP: A vision foundation model for the tree of life.” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 19412–19424
-
[18]
Multi-modal molecule structure–text model for text-based retrieval and editing,
S. Liu et al., “Multi-modal molecule structure–text model for text-based retrieval and editing,” Nat. Mach. Intell., vol. 5, no. 12, pp. 1447–1457, Dec. 2023
2023
-
[19]
A molecular multimodal foundation model associating molecule graphs with natural language
B. Su et al., “ A molecular multimodal foundation model associating molecule graphs with natural language.” 2022, arXiv:2209.05481
arXiv 2022
-
[20]
Extracting molecular properties from natural language with multimodal contrastive learning
R. Lacombe et al., “ Extracting molecular properties from natural language with multimodal contrastive learning. ” 2023, arXiv:2307.12996
arXiv 2023
-
[21]
MMCL: A multi-modal contrastive learning framework for molecular property prediction
M. Gao and F. Zhu, “ MMCL: A multi-modal contrastive learning framework for molecular property prediction. ” IEEE/ACM Trans. Comput. Biol. Bioinf., doi: 10.1109/TCBBIO.2026.3663206
-
[22]
Z. Dong et al., “PTPPI: A study on protein inhibitor prediction methods using multimodal feature fusion and attention mechanism.” IEEE/ACM Trans. Comput. Biol. Bioinf., doi: 10.1109/TCBBIO.2026.3657905
-
[23]
LoRA: Low-rank adaptation of large language models
E. J. Hu et al., “LoRA: Low-rank adaptation of large language models.” in Proc. Int. Conf. Learn. Represent. (ICLR), 2022
2022
-
[24]
Supervised contrastive learning
P. Khosla et al., “Supervised contrastive learning.” in Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 18661–18673
2020
-
[25]
BioClinical ModernBERT: A state-of-the-art long- context encoder for biomedical and clinical NLP
T. Sounack et al., “ BioClinical ModernBERT: A state-of-the-art long- context encoder for biomedical and clinical NLP. ” 2025, arXiv:2506.10896
arXiv 2025
-
[26]
S. Höhna et al., “ RevBayes: Bayesian phylogenetic inference using graphical models and an interactive model-specification language.” Syst. Biol., vol. 65, no. 4, pp. 726–736, Jul. 2016, doi: 10.1093/sysbio/syw021
-
[27]
Comparison of phylogenetic trees
D. F. Robinson and L. R. Foulds, “Comparison of phylogenetic trees.” Math. Biosci., vol. 53, no. 1-2, pp. 131 – 147, Feb. 1981, doi: 10.1016/0025-5564(81)90043-2
-
[28]
Introducing GPT-5.2
OpenAI, “ Introducing GPT-5.2 ” 2025. [Online]. Available: https://openai.com/index/introducing-gpt-5-2/. Accessed: Jan. 15, 2026
2025
-
[29]
Siméoni et al., “DINOv3” 2025, arXiv:2508.10104
O. Siméoni et al., “DINOv3” 2025, arXiv:2508.10104
Pith/arXiv arXiv 2025
-
[30]
Evidence for an ancient adaptive episode of convergent molecular evolution,
T. A. Castoe et al., “ Evidence for an ancient adaptive episode of convergent molecular evolution,” Proc. Natl. Acad. Sci. USA, vol. 106, no. 22, pp. 8986–8991, 2009
2009
-
[31]
Can phylogenetics identify C4 origins and reversals?
P.-A. Christin, R. P. Freckleton, and C. P. Osborne, “Can phylogenetics identify C4 origins and reversals?” Trends Ecol. Evol., vol. 25, no. 7, pp. 403–409, Jul. 2010
2010
-
[32]
Convergent adaptation in mitochondria of phylogenetically distant birds: Does it exist?
V. Burskaia et al., “ Convergent adaptation in mitochondria of phylogenetically distant birds: Does it exist?” Genome Biol. Evol., vol. 13, no. 7, Jul 2021, evab113
2021
-
[33]
Gene tree discordance, phylogenetic inference and the multispecies coalescent
J. H. Degnan and N. A. Rosenberg, “ Gene tree discordance, phylogenetic inference and the multispecies coalescent. ” Trends Ecol. Evol., vol. 24, no. 6, pp. 332 – 340, Jun. 2009, doi: 10.1016/j.tree.2009.01.009
-
[34]
Resolving difficult phylogenetic questions: Why more sequences are not enough
H. Philippe et al., “ Resolving difficult phylogenetic questions: Why more sequences are not enough.” PLoS Biol., vol. 9, no. 3, Mar. 2011, doi: 10.1371/journal.pbio.1000602
-
[35]
OrthoFinder: Phylogenetic orthology inference for comparative genomics
D. M. Emms and S. Kelly, “ OrthoFinder: Phylogenetic orthology inference for comparative genomics.” Genome Biol., vol. 20, no. 1, pp. 238, Nov. 2019, doi: 10.1186/s13059-019-1832-y
-
[36]
UFBoot2: Improving the ultrafast bootstrap approximation
D. T. Hoang et al., “ UFBoot2: Improving the ultrafast bootstrap approximation.” Mol. Biol. Evol., vol. 35, no. 2, pp. 518–522, Feb. 2018, doi: 10.1093/molbev/msx281
-
[37]
ModelFinder: Fast model selection for accurate phylogenetic estimates
S. Kalyaanamoorthy et al., “ ModelFinder: Fast model selection for accurate phylogenetic estimates.” Nat. Methods, vol. 14, no. 6, pp. 587– 589, Jun. 2017, doi: 10.1038/nmeth.4285
-
[38]
K. Katoh and D. M. Standley, “ MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. ” Mol. Biol. Evol., vol. 30, no. 4, pp. 772 – 780, Jan. 2013, doi: 10.1093/molbev/mst010
-
[39]
IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era
B. Q. Minh et al., “IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era.” Mol. Biol. Evol., vol. 37, no. 5, pp. 1530–1534, May 2020, doi: 10.1093/molbev/msaa015
-
[40]
Yang, Computational Molecular Evolution
Z. Yang, Computational Molecular Evolution. Oxford, U.K.: Oxford Univ. Press, 2006
2006
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.