Recognition: unknown
Analysis and Explainability of LLMs Via Evolutionary Methods
Pith reviewed 2026-05-09 20:33 UTC · model grok-4.3
The pith
Phylogenetic trees built from LLM weights and text outputs recover the structure of their training histories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Relating LLM weights to genotypes and output text to phenotypes allows construction of evolutionary trees that reliably recover the topology of the ground-truth training tree in controlled experiments. Weight-difference analysis identifies the most important layers, phenotypic experiments show that one training dataset contributes more useful information than the others, and an unsupervised evolutionary tree is generated for black-box foundation models, all supported by visualizations of model relationships.
What carries the argument
Phylogenetic tree estimation from distance matrices computed on weight differences (as genotype distances) and on generated text outputs (as phenotype distances).
If this is right
- Estimated trees recover the topology of the ground-truth training tree in controlled experiments.
- Weight differences between models highlight the layers that contribute most to distinguishing them.
- Phenotypic experiments identify that one training dataset supplies more useful information than the others.
- Unsupervised trees can be built for black-box foundation models using only output text.
- Visualizations clarify evolutionary relationships and lineage among the models.
Where Pith is reading between the lines
- The same distance-based approach could trace how fine-tuning or continued pretraining shifts a model relative to its base version.
- If the analogy holds, the method might help audit model provenance when training details are withheld.
- Extending the genotype-phenotype framing to other architectures could allow similar lineage analysis for vision or multimodal models.
Load-bearing premise
The mapping of LLM weights to genotypes and generated text to phenotypes is close enough to biological data that standard phylogenetic methods will produce accurate and interpretable results about model lineage and training influences.
What would settle it
In the controlled experiment, if the estimated evolutionary tree topology fails to match the known order and branching of the training tree, the claim of reliable recovery would be disproved.
Figures
read the original abstract
Evolutionary methods have long been useful for analysis and explanation in genetics, biology, ecology, and related fields. In this work, we extend these methods to neural networks, specifically large language models (LLMs), to better analyze and explain relationships among models. We show how relating weights to genotypes and output text to phenotypes can improve our understanding of model lineage, important datasets, the roles of different model layers, and visualization of model relationships. We demonstrate this in a controlled experiment, where our estimated evolutionary trees reliably recover the topology of the ground-truth training tree. We further identify the most important weight layers according to weight differences and show through phenotypic experiments that one training dataset appears to contribute more useful information than the others. Finally, we generate an unsupervised evolutionary tree of black-box foundation models. Throughout, we provide visualizations that support a clearer understanding of evolutionary relationships among LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper extends evolutionary and phylogenetic methods to LLMs by mapping weights to genotypes and generated text to phenotypes. It claims that in a controlled experiment with models derived along a known training hierarchy, the reconstructed evolutionary trees reliably recover the ground-truth topology. Additional results include identifying important weight layers from differences, phenotypic analysis showing one training dataset contributes more, and an unsupervised tree over black-box foundation models, all supported by visualizations for understanding model relationships and training influences.
Significance. If the topology recovery is demonstrated to be non-trivial rather than an artifact of weight distances on a fine-tuning tree, the work could offer a valuable new lens for analyzing LLM lineages, data influences, and layer roles using established biological tools. The visualizations and real-world application to foundation models provide practical utility for explainability in neural networks, though the overall impact hinges on showing that the evolutionary framing adds explanatory power beyond standard clustering.
major comments (2)
- [Abstract and controlled experiment section] Abstract and controlled experiment: the claim that estimated evolutionary trees 'reliably recover the topology of the ground-truth training tree' risks circularity. In a setup where models are forked and fine-tuned along a known hierarchy, weight-vector distances (genotypes) will be smaller for more closely related models by construction; standard distance-based methods (e.g., UPGMA or neighbor-joining) will therefore reconstruct the tree without any evolutionary model or phenotypic signal. The manuscript must specify the distance metric, reconstruction algorithm, and controls (such as phenotype-only reconstruction or comparison to direct hierarchical clustering) to establish that the result is non-trivial.
- [Methods / experimental setup] Experimental setup: details are missing on the precise distance metrics for weights and text, tree-building parameters (e.g., any evolutionary model assumptions or software used), and controls for confounders such as model size or architecture variations. These omissions make it impossible to evaluate whether the reported recovery depends on the biological analogy or would occur under any reasonable distance-based clustering.
minor comments (2)
- [Visualizations and notation] Clarify notation for genotype/phenotype mappings and ensure all visualizations include legends that explicitly link colors or branches to model lineages and training steps.
- [Abstract] The abstract would benefit from a brief quantitative statement of recovery accuracy (e.g., topological similarity metric) rather than the qualitative claim of 'reliable' recovery.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the need for greater methodological transparency. We address each major point below and will revise the manuscript to incorporate the requested specifications and controls.
read point-by-point responses
-
Referee: [Abstract and controlled experiment section] Abstract and controlled experiment: the claim that estimated evolutionary trees 'reliably recover the topology of the ground-truth training tree' risks circularity. In a setup where models are forked and fine-tuned along a known hierarchy, weight-vector distances (genotypes) will be smaller for more closely related models by construction; standard distance-based methods (e.g., UPGMA or neighbor-joining) will therefore reconstruct the tree without any evolutionary model or phenotypic signal. The manuscript must specify the distance metric, reconstruction algorithm, and controls (such as phenotype-only reconstruction or comparison to direct hierarchical clustering) to establish that the result is non-trivial.
Authors: We acknowledge this risk of circularity when using weight distances alone. Our approach integrates phenotypic signals from generated text to provide an independent measure of model relationships. In revision, we will explicitly detail the distance metric (Euclidean on flattened weight vectors for genotypes; cosine similarity on text embeddings for phenotypes), the reconstruction algorithm (neighbor-joining), and add controls: a phenotype-only tree and direct comparison to hierarchical clustering on the same distances. These will demonstrate the non-trivial contribution of the evolutionary framing. revision: yes
-
Referee: [Methods / experimental setup] Experimental setup: details are missing on the precise distance metrics for weights and text, tree-building parameters (e.g., any evolutionary model assumptions or software used), and controls for confounders such as model size or architecture variations. These omissions make it impossible to evaluate whether the reported recovery depends on the biological analogy or would occur under any reasonable distance-based clustering.
Authors: We agree these details were insufficiently specified. The revised Methods section will include: distance metrics (L2 norm for weights, cosine on sentence embeddings for text), tree-building parameters (neighbor-joining with no parametric evolutionary model, implemented in BioPython), and confounder controls (all controlled-experiment models share identical architecture and size, varying only by training data and steps). This will allow assessment of whether the phylogenetic approach adds value beyond standard clustering. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper applies established phylogenetic methods (e.g., distance-based tree reconstruction) to LLM weights treated as genotypes and text as phenotypes in a controlled fine-tuning experiment. The reported recovery of the ground-truth training tree topology is a demonstration that the methods produce expected outputs on hierarchically related models, rather than a first-principles derivation or prediction that reduces to a fitted parameter or self-referential definition by the paper's own equations. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to force the central result. The approach remains self-contained against external benchmarks for phylogenetic reconstruction, with the biological analogy serving interpretability rather than mathematical necessity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The mapping of LLM weights to genotypes and generated text to phenotypes is a valid and useful analogy for phylogenetic analysis
Reference graph
Works this paper leans on
-
[1]
Meteor: An automatic metric for mt evaluation with improved correlation with human judgments
Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005
2005
-
[2]
Uniparentalinheritanceofmitochondrialandchloroplastgenes: mechanisms and evolution.Proceedings of the National Academy of Sciences, 92(25):11331–11338, 1995
CWilliamBirkyJr. Uniparentalinheritanceofmitochondrialandchloroplastgenes: mechanisms and evolution.Proceedings of the National Academy of Sciences, 92(25):11331–11338, 1995
1995
-
[3]
Xai meets llms: A survey of the relation between explainable ai and large language models
Erik Cambria, Lorenzo Malandri, Fabio Mercorio, Navid Nobani, and Andrea Seveso. Xai meets llms: A survey of the relation between explainable ai and large language models.arXiv preprint arXiv:2407.15248, 2024
-
[4]
D ialog S um: A Real-Life Scenario Dialogue Summarization Dataset
Yulong Chen, Yang Liu, Liang Chen, and Yue Zhang. DialogSum: A real-life scenario dialogue summarization dataset. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 5062–5074, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.449. URL https://aclanthology.org/2021. findings-acl.449
-
[5]
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024. URL https://arxiv.org/abs/2403.04132
work page internal anchor Pith review arXiv 2024
-
[6]
Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, and Owain Evans. Subliminal learning: Language models transmit behavioral traits via hidden signals in data.arXiv preprint arXiv:2507.14805, 2025
-
[7]
A discourse-aware attention model for abstractive summarization of long documents
Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. A discourse-aware attention model for abstractive summarization of long documents. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers...
-
[8]
Assessing llms for high stakes applications
Shannon K Gallagher, Jasmine Ratchford, Tyler Brooks, Bryan P Brown, Eric Heim, William R Nichols, Scott Mcmillan, Swati Rallapalli, Carol J Smith, Nathan VanHoudnos, et al. Assessing llms for high stakes applications. InProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice, pages 103–105, 2024
2024
-
[9]
SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization
Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. InProceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70–79, Hong Kong, China, November
-
[10]
doi:10.18653/v1/D19-5409 , pages =
Association for Computational Linguistics. doi: 10.18653/v1/D19-5409. URLhttps: //www.aclweb.org/anthology/D19-5409
-
[11]
Phylogenetic scale in ecology and evolution.Global Ecology and Biogeography, 27(2):175–187, 2018
Catherine H Graham, David Storch, and Antonin Machac. Phylogenetic scale in ecology and evolution.Global Ecology and Biogeography, 27(2):175–187, 2018
2018
-
[12]
Max Grusky, Mor Naaman, and Yoav Artzi. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies.arXiv preprint arXiv:1804.11283, 2018
-
[13]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URLhttps: //arxiv.org/abs/2009.03300
work page internal anchor Pith review arXiv 2021
-
[14]
Teaching machines to read and comprehend
Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. volume 28, 2015
2015
-
[15]
Unsupervised model tree heritage recovery
Eliahu Horwitz, Asaf Shul, and Yedid Hoshen. Unsupervised model tree heritage recovery. arXiv preprint arXiv:2405.18432, 2024
-
[16]
Efficient attentions for long document summarization, 2021
Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. Efficient attentions for long document summarization, 2021
2021
-
[17]
arXiv:2104.14337 (2021), https://arxiv.org/abs/2104.14337
Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, et al. Dynabench: Rethinking benchmarking in nlp.arXiv preprint arXiv:2104.14337, 2021
-
[18]
BillSum: A corpus for automatic summarization of US legislation
Anastassia Kornilova and Vladimir Eidelman. BillSum: A corpus for automatic summarization of US legislation. In Lu Wang, Jackie Chi Kit Cheung, Giuseppe Carenini, and Fei Liu, editors, Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 48–56, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1...
-
[19]
Summac: Re-visiting nli-based models for inconsistency detection in summarization.Transactions of the Association for Computational Linguistics, 10:163–177, 2022
Philippe Laban, Tobias Schnabel, Paul N Bennett, and Marti A Hearst. Summac: Re-visiting nli-based models for inconsistency detection in summarization.Transactions of the Association for Computational Linguistics, 10:163–177, 2022
2022
-
[20]
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: A comprehensive survey on llm-based evaluation methods, 2024. URL https://arxiv.org/abs/2412.05579
work page internal anchor Pith review arXiv 2024
-
[21]
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Ya- sunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Alexander Cosgrove, Christopher D Manning, Christopher Re, Diana Acosta-Navas, Drew Arad Hudson, Eric Zelikman, Esin Durmus, Faisal [Distribution ...
2023
-
[22]
Rouge: A package for automatic evaluation of summaries
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004
2004
-
[23]
Automatic evaluation of summaries using n-gram co-occurrence statistics
Chin-Yew Lin and Eduard Hovy. Automatic evaluation of summaries using n-gram co-occurrence statistics. InProceedings of the 2003 human language technology conference of the North American chapter of the association for computational linguistics, pages 150–157, 2003
2003
-
[24]
Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivo...
2025
-
[25]
Neural networks–state of art, brief history, basic models and architecture
Bohdan Macukow. Neural networks–state of art, brief history, basic models and architecture. InIFIP international conference on computer information systems and industrial management, pages 3–14. Springer, 2016
2016
-
[26]
Evolutionary algorithms and neural networks.Studies in computational intelligence, 780(1):43–53, 2019
Seyedali Mirjalili. Evolutionary algorithms and neural networks.Studies in computational intelligence, 780(1):43–53, 2019
2019
-
[27]
Phylogenetic networks: a review of methods to display evolutionary history
David A Morrison. Phylogenetic networks: a review of methods to display evolutionary history. Annual Research & Review in Biology, 4(10):1518, 2014
2014
-
[28]
Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization.ArXiv, abs/1808.08745, 2018
work page Pith review arXiv 2018
-
[29]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002
2002
-
[30]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html
2020
-
[31]
Comparison of phylogenetic trees.Mathematical biosciences, 53(1-2):131–147, 1981
David F Robinson and Leslie R Foulds. Comparison of phylogenetic trees.Mathematical biosciences, 53(1-2):131–147, 1981
1981
-
[32]
DebateSum: A large-scale argument mining and summarization dataset
Allen Roush and Arvind Balaji. DebateSum: A large-scale argument mining and summarization dataset. In Elena Cabrio and Serena Villata, editors,Proceedings of the 7th Workshop on [Distribution Statement A] Approved for public release and unlimited distribution Argument Mining, pages 1–7, Online, December 2020. Association for Computational Linguistics. URL...
2020
-
[33]
N Saitou and M Nei. The neighbor-joining method: a new method for reconstructing phylo- genetic trees.Molecular Biology and Evolution, 4(4):406–425, 07 1987. ISSN 0737-4038. doi: 10.1093/oxfordjournals.molbev.a040454. URL https://doi.org/10.1093/oxfordjournals. molbev.a040454
-
[34]
Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator networks. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1099. URL h...
-
[35]
Eva Sharma, Chen Li, and Lu Wang. BIGPATENT: A large-scale dataset for abstractive and coherent summarization.CoRR, abs/1906.03741, 2019. URL http://arxiv.org/abs/1906. 03741
-
[36]
Distribution of covid-19 and phylogenetic tree construction of sars-cov-2 in indonesia
Dora Dayu Rahma Turista, Aesthetica Islamy, Viol Dhea Kharisma, and Arif Nur Muhammad Ansori. Distribution of covid-19 and phylogenetic tree construction of sars-cov-2 in indonesia. J Pure Appl Microbiol, 14(suppl 1):1035–42, 2020
2020
-
[37]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
2017
-
[38]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019. URLhttps://arxiv.org/abs/1804.07461
work page internal anchor Pith review arXiv 2019
-
[39]
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, WeimingRen, AaranArulraj, XuanHe, ZiyanJiang, TianleLi, MaxKu, KaiWang, AlexZhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024. URLhttps://arxiv.org/abs/2406.01574
work page internal anchor Pith review arXiv 2024
-
[40]
Confidence sets for phylogenetic trees.Journal of the American Statistical Association, 114(525):235–244, 2019
Amy Willis. Confidence sets for phylogenetic trees.Journal of the American Statistical Association, 114(525):235–244, 2019
2019
-
[41]
Usable xai: 10 strategies towards exploiting explainability in the llm era, 2025
Xuansheng Wu, Haiyan Zhao, Yaochen Zhu, Yucheng Shi, Fan Yang, Lijie Hu, Tianming Liu, Xiaoming Zhai, Wenlin Yao, Jundong Li, Mengnan Du, and Ninghao Liu. Usable xai: 10 strategies towards exploiting explainability in the llm era, 2025. URLhttps://arxiv.org/ abs/2403.08946
-
[42]
Harnessing the power of llms in practice: A survey on chatgpt and beyond
Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, and Xia Hu. Harnessing the power of llms in practice: A survey on chatgpt and beyond. 2023
2023
-
[43]
Nicolas Yax, Pierre-Yves Oudeyer, and Stefano Palminteri. Phylolm: Inferring the phylogeny of large language models and predicting their performances in benchmarks.arXiv preprint arXiv:2404.04671, 2024
-
[44]
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019. [Distribution Statement A] Approved for public release and unlimited distribution A Model, Data, and Prompts A.1 Controlled Experiment The primary model used in our controlled experiment is T5-...
work page internal anchor Pith review arXiv 1904
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.