pith. machine review for the scientific record. sign in

arxiv: 2605.02930 · v1 · submitted 2026-04-27 · 💻 cs.NE · cs.LG· stat.ML

Recognition: unknown

Analysis and Explainability of LLMs Via Evolutionary Methods

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:33 UTC · model grok-4.3

classification 💻 cs.NE cs.LGstat.ML
keywords large language modelsevolutionary methodsphylogenetic treesmodel lineageweight differencesphenotypic analysistraining data influencemodel explainability
0
0 comments X

The pith

Phylogenetic trees built from LLM weights and text outputs recover the structure of their training histories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies evolutionary biology methods to large language models by mapping weights to genotypes and generated text to phenotypes. This produces trees that reconstruct known training relationships in controlled tests and reveal which layers and datasets matter most. The approach works even when models are treated as black boxes, using only output text for analysis. It supplies visualizations that make model lineages and influences easier to inspect. The core goal is to give researchers new tools for tracing how models develop and relate without needing full training records.

Core claim

Relating LLM weights to genotypes and output text to phenotypes allows construction of evolutionary trees that reliably recover the topology of the ground-truth training tree in controlled experiments. Weight-difference analysis identifies the most important layers, phenotypic experiments show that one training dataset contributes more useful information than the others, and an unsupervised evolutionary tree is generated for black-box foundation models, all supported by visualizations of model relationships.

What carries the argument

Phylogenetic tree estimation from distance matrices computed on weight differences (as genotype distances) and on generated text outputs (as phenotype distances).

If this is right

  • Estimated trees recover the topology of the ground-truth training tree in controlled experiments.
  • Weight differences between models highlight the layers that contribute most to distinguishing them.
  • Phenotypic experiments identify that one training dataset supplies more useful information than the others.
  • Unsupervised trees can be built for black-box foundation models using only output text.
  • Visualizations clarify evolutionary relationships and lineage among the models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distance-based approach could trace how fine-tuning or continued pretraining shifts a model relative to its base version.
  • If the analogy holds, the method might help audit model provenance when training details are withheld.
  • Extending the genotype-phenotype framing to other architectures could allow similar lineage analysis for vision or multimodal models.

Load-bearing premise

The mapping of LLM weights to genotypes and generated text to phenotypes is close enough to biological data that standard phylogenetic methods will produce accurate and interpretable results about model lineage and training influences.

What would settle it

In the controlled experiment, if the estimated evolutionary tree topology fails to match the known order and branching of the training tree, the claim of reliable recovery would be disproved.

Figures

Figures reproduced from arXiv: 2605.02930 by Chuck Loughin, Michele Sezgin, Ronald Yurko, Shannon K. Gallagher, Swati Rallapalli, Tyler Brooks.

Figure 1
Figure 1. Figure 1: Conceptual framework for evolutionary analysis of large language models. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Original training tree sequence from T5 models fine-tuned on 10 different summarization [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Most important layers based on average distance between all model pairs. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Left: PCA of embeddings for a single prompt from the arXiv summarization dataset [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Trees estimated from distances between output embeddings derived from different datasets. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: PCA shown by dataset and prompt for each of the foundation models. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

Evolutionary methods have long been useful for analysis and explanation in genetics, biology, ecology, and related fields. In this work, we extend these methods to neural networks, specifically large language models (LLMs), to better analyze and explain relationships among models. We show how relating weights to genotypes and output text to phenotypes can improve our understanding of model lineage, important datasets, the roles of different model layers, and visualization of model relationships. We demonstrate this in a controlled experiment, where our estimated evolutionary trees reliably recover the topology of the ground-truth training tree. We further identify the most important weight layers according to weight differences and show through phenotypic experiments that one training dataset appears to contribute more useful information than the others. Finally, we generate an unsupervised evolutionary tree of black-box foundation models. Throughout, we provide visualizations that support a clearer understanding of evolutionary relationships among LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper extends evolutionary and phylogenetic methods to LLMs by mapping weights to genotypes and generated text to phenotypes. It claims that in a controlled experiment with models derived along a known training hierarchy, the reconstructed evolutionary trees reliably recover the ground-truth topology. Additional results include identifying important weight layers from differences, phenotypic analysis showing one training dataset contributes more, and an unsupervised tree over black-box foundation models, all supported by visualizations for understanding model relationships and training influences.

Significance. If the topology recovery is demonstrated to be non-trivial rather than an artifact of weight distances on a fine-tuning tree, the work could offer a valuable new lens for analyzing LLM lineages, data influences, and layer roles using established biological tools. The visualizations and real-world application to foundation models provide practical utility for explainability in neural networks, though the overall impact hinges on showing that the evolutionary framing adds explanatory power beyond standard clustering.

major comments (2)
  1. [Abstract and controlled experiment section] Abstract and controlled experiment: the claim that estimated evolutionary trees 'reliably recover the topology of the ground-truth training tree' risks circularity. In a setup where models are forked and fine-tuned along a known hierarchy, weight-vector distances (genotypes) will be smaller for more closely related models by construction; standard distance-based methods (e.g., UPGMA or neighbor-joining) will therefore reconstruct the tree without any evolutionary model or phenotypic signal. The manuscript must specify the distance metric, reconstruction algorithm, and controls (such as phenotype-only reconstruction or comparison to direct hierarchical clustering) to establish that the result is non-trivial.
  2. [Methods / experimental setup] Experimental setup: details are missing on the precise distance metrics for weights and text, tree-building parameters (e.g., any evolutionary model assumptions or software used), and controls for confounders such as model size or architecture variations. These omissions make it impossible to evaluate whether the reported recovery depends on the biological analogy or would occur under any reasonable distance-based clustering.
minor comments (2)
  1. [Visualizations and notation] Clarify notation for genotype/phenotype mappings and ensure all visualizations include legends that explicitly link colors or branches to model lineages and training steps.
  2. [Abstract] The abstract would benefit from a brief quantitative statement of recovery accuracy (e.g., topological similarity metric) rather than the qualitative claim of 'reliable' recovery.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the need for greater methodological transparency. We address each major point below and will revise the manuscript to incorporate the requested specifications and controls.

read point-by-point responses
  1. Referee: [Abstract and controlled experiment section] Abstract and controlled experiment: the claim that estimated evolutionary trees 'reliably recover the topology of the ground-truth training tree' risks circularity. In a setup where models are forked and fine-tuned along a known hierarchy, weight-vector distances (genotypes) will be smaller for more closely related models by construction; standard distance-based methods (e.g., UPGMA or neighbor-joining) will therefore reconstruct the tree without any evolutionary model or phenotypic signal. The manuscript must specify the distance metric, reconstruction algorithm, and controls (such as phenotype-only reconstruction or comparison to direct hierarchical clustering) to establish that the result is non-trivial.

    Authors: We acknowledge this risk of circularity when using weight distances alone. Our approach integrates phenotypic signals from generated text to provide an independent measure of model relationships. In revision, we will explicitly detail the distance metric (Euclidean on flattened weight vectors for genotypes; cosine similarity on text embeddings for phenotypes), the reconstruction algorithm (neighbor-joining), and add controls: a phenotype-only tree and direct comparison to hierarchical clustering on the same distances. These will demonstrate the non-trivial contribution of the evolutionary framing. revision: yes

  2. Referee: [Methods / experimental setup] Experimental setup: details are missing on the precise distance metrics for weights and text, tree-building parameters (e.g., any evolutionary model assumptions or software used), and controls for confounders such as model size or architecture variations. These omissions make it impossible to evaluate whether the reported recovery depends on the biological analogy or would occur under any reasonable distance-based clustering.

    Authors: We agree these details were insufficiently specified. The revised Methods section will include: distance metrics (L2 norm for weights, cosine on sentence embeddings for text), tree-building parameters (neighbor-joining with no parametric evolutionary model, implemented in BioPython), and confounder controls (all controlled-experiment models share identical architecture and size, varying only by training data and steps). This will allow assessment of whether the phylogenetic approach adds value beyond standard clustering. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper applies established phylogenetic methods (e.g., distance-based tree reconstruction) to LLM weights treated as genotypes and text as phenotypes in a controlled fine-tuning experiment. The reported recovery of the ground-truth training tree topology is a demonstration that the methods produce expected outputs on hierarchically related models, rather than a first-principles derivation or prediction that reduces to a fitted parameter or self-referential definition by the paper's own equations. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to force the central result. The approach remains self-contained against external benchmarks for phylogenetic reconstruction, with the biological analogy serving interpretability rather than mathematical necessity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the validity of the genotype-phenotype analogy for LLMs and the assumption that standard phylogenetic distance measures capture meaningful training relationships.

axioms (1)
  • domain assumption The mapping of LLM weights to genotypes and generated text to phenotypes is a valid and useful analogy for phylogenetic analysis
    This premise is invoked throughout the abstract to justify applying evolutionary methods to neural networks.

pith-pipeline@v0.9.0 · 5464 in / 1187 out tokens · 32852 ms · 2026-05-09T20:33:24.868614+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 21 canonical work pages · 6 internal anchors

  1. [1]

    Meteor: An automatic metric for mt evaluation with improved correlation with human judgments

    Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

  2. [2]

    Uniparentalinheritanceofmitochondrialandchloroplastgenes: mechanisms and evolution.Proceedings of the National Academy of Sciences, 92(25):11331–11338, 1995

    CWilliamBirkyJr. Uniparentalinheritanceofmitochondrialandchloroplastgenes: mechanisms and evolution.Proceedings of the National Academy of Sciences, 92(25):11331–11338, 1995

  3. [3]

    Xai meets llms: A survey of the relation between explainable ai and large language models

    Erik Cambria, Lorenzo Malandri, Fabio Mercorio, Navid Nobani, and Andrea Seveso. Xai meets llms: A survey of the relation between explainable ai and large language models.arXiv preprint arXiv:2407.15248, 2024

  4. [4]

    D ialog S um: A Real-Life Scenario Dialogue Summarization Dataset

    Yulong Chen, Yang Liu, Liang Chen, and Yue Zhang. DialogSum: A real-life scenario dialogue summarization dataset. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 5062–5074, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.449. URL https://aclanthology.org/2021. findings-acl.449

  5. [5]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024. URL https://arxiv.org/abs/2403.04132

  6. [6]

    Subliminal learning: Language models transmit behavioral traits via hidden signals in data.arXiv preprint arXiv:2507.14805,

    Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, and Owain Evans. Subliminal learning: Language models transmit behavioral traits via hidden signals in data.arXiv preprint arXiv:2507.14805, 2025

  7. [7]

    A discourse-aware attention model for abstractive summarization of long documents

    Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. A discourse-aware attention model for abstractive summarization of long documents. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers...

  8. [8]

    Assessing llms for high stakes applications

    Shannon K Gallagher, Jasmine Ratchford, Tyler Brooks, Bryan P Brown, Eric Heim, William R Nichols, Scott Mcmillan, Swati Rallapalli, Carol J Smith, Nathan VanHoudnos, et al. Assessing llms for high stakes applications. InProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice, pages 103–105, 2024

  9. [9]

    SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization

    Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. InProceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70–79, Hong Kong, China, November

  10. [10]

    doi:10.18653/v1/D19-5409 , pages =

    Association for Computational Linguistics. doi: 10.18653/v1/D19-5409. URLhttps: //www.aclweb.org/anthology/D19-5409

  11. [11]

    Phylogenetic scale in ecology and evolution.Global Ecology and Biogeography, 27(2):175–187, 2018

    Catherine H Graham, David Storch, and Antonin Machac. Phylogenetic scale in ecology and evolution.Global Ecology and Biogeography, 27(2):175–187, 2018

  12. [12]

    Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies.arXiv preprint arXiv:1804.11283, 2018

    Max Grusky, Mor Naaman, and Yoav Artzi. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies.arXiv preprint arXiv:1804.11283, 2018

  13. [13]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URLhttps: //arxiv.org/abs/2009.03300

  14. [14]

    Teaching machines to read and comprehend

    Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. volume 28, 2015

  15. [15]

    Unsupervised model tree heritage recovery

    Eliahu Horwitz, Asaf Shul, and Yedid Hoshen. Unsupervised model tree heritage recovery. arXiv preprint arXiv:2405.18432, 2024

  16. [16]

    Efficient attentions for long document summarization, 2021

    Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. Efficient attentions for long document summarization, 2021

  17. [17]

    arXiv:2104.14337 (2021), https://arxiv.org/abs/2104.14337

    Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, et al. Dynabench: Rethinking benchmarking in nlp.arXiv preprint arXiv:2104.14337, 2021

  18. [18]

    BillSum: A corpus for automatic summarization of US legislation

    Anastassia Kornilova and Vladimir Eidelman. BillSum: A corpus for automatic summarization of US legislation. In Lu Wang, Jackie Chi Kit Cheung, Giuseppe Carenini, and Fei Liu, editors, Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 48–56, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1...

  19. [19]

    Summac: Re-visiting nli-based models for inconsistency detection in summarization.Transactions of the Association for Computational Linguistics, 10:163–177, 2022

    Philippe Laban, Tobias Schnabel, Paul N Bennett, and Marti A Hearst. Summac: Re-visiting nli-based models for inconsistency detection in summarization.Transactions of the Association for Computational Linguistics, 10:163–177, 2022

  20. [20]

    LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: A comprehensive survey on llm-based evaluation methods, 2024. URL https://arxiv.org/abs/2412.05579

  21. [21]

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Ya- sunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Alexander Cosgrove, Christopher D Manning, Christopher Re, Diana Acosta-Navas, Drew Arad Hudson, Eric Zelikman, Esin Durmus, Faisal [Distribution ...

  22. [22]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

  23. [23]

    Automatic evaluation of summaries using n-gram co-occurrence statistics

    Chin-Yew Lin and Eduard Hovy. Automatic evaluation of summaries using n-gram co-occurrence statistics. InProceedings of the 2003 human language technology conference of the North American chapter of the association for computational linguistics, pages 150–157, 2003

  24. [24]

    Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivo...

  25. [25]

    Neural networks–state of art, brief history, basic models and architecture

    Bohdan Macukow. Neural networks–state of art, brief history, basic models and architecture. InIFIP international conference on computer information systems and industrial management, pages 3–14. Springer, 2016

  26. [26]

    Evolutionary algorithms and neural networks.Studies in computational intelligence, 780(1):43–53, 2019

    Seyedali Mirjalili. Evolutionary algorithms and neural networks.Studies in computational intelligence, 780(1):43–53, 2019

  27. [27]

    Phylogenetic networks: a review of methods to display evolutionary history

    David A Morrison. Phylogenetic networks: a review of methods to display evolutionary history. Annual Research & Review in Biology, 4(10):1518, 2014

  28. [28]

    Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

    Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization.ArXiv, abs/1808.08745, 2018

  29. [29]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

  30. [30]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html

  31. [31]

    Comparison of phylogenetic trees.Mathematical biosciences, 53(1-2):131–147, 1981

    David F Robinson and Leslie R Foulds. Comparison of phylogenetic trees.Mathematical biosciences, 53(1-2):131–147, 1981

  32. [32]

    DebateSum: A large-scale argument mining and summarization dataset

    Allen Roush and Arvind Balaji. DebateSum: A large-scale argument mining and summarization dataset. In Elena Cabrio and Serena Villata, editors,Proceedings of the 7th Workshop on [Distribution Statement A] Approved for public release and unlimited distribution Argument Mining, pages 1–7, Online, December 2020. Association for Computational Linguistics. URL...

  33. [33]

    Molecular Biology and Evolution4(4), 406–425 (1987) https://doi.org/10.1093/oxfordjournals.molbev.a040454

    N Saitou and M Nei. The neighbor-joining method: a new method for reconstructing phylo- genetic trees.Molecular Biology and Evolution, 4(4):406–425, 07 1987. ISSN 0737-4038. doi: 10.1093/oxfordjournals.molbev.a040454. URL https://doi.org/10.1093/oxfordjournals. molbev.a040454

  34. [34]

    Liu, and Christopher D

    Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator networks. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1099. URL h...

  35. [35]

    BIGPATENT: A large-scale dataset for abstractive and coherent summarization.CoRR, abs/1906.03741, 2019

    Eva Sharma, Chen Li, and Lu Wang. BIGPATENT: A large-scale dataset for abstractive and coherent summarization.CoRR, abs/1906.03741, 2019. URL http://arxiv.org/abs/1906. 03741

  36. [36]

    Distribution of covid-19 and phylogenetic tree construction of sars-cov-2 in indonesia

    Dora Dayu Rahma Turista, Aesthetica Islamy, Viol Dhea Kharisma, and Arif Nur Muhammad Ansori. Distribution of covid-19 and phylogenetic tree construction of sars-cov-2 in indonesia. J Pure Appl Microbiol, 14(suppl 1):1035–42, 2020

  37. [37]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  38. [38]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019. URLhttps://arxiv.org/abs/1804.07461

  39. [39]

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, WeimingRen, AaranArulraj, XuanHe, ZiyanJiang, TianleLi, MaxKu, KaiWang, AlexZhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024. URLhttps://arxiv.org/abs/2406.01574

  40. [40]

    Confidence sets for phylogenetic trees.Journal of the American Statistical Association, 114(525):235–244, 2019

    Amy Willis. Confidence sets for phylogenetic trees.Journal of the American Statistical Association, 114(525):235–244, 2019

  41. [41]

    Usable xai: 10 strategies towards exploiting explainability in the llm era, 2025

    Xuansheng Wu, Haiyan Zhao, Yaochen Zhu, Yucheng Shi, Fan Yang, Lijie Hu, Tianming Liu, Xiaoming Zhai, Wenlin Yao, Jundong Li, Mengnan Du, and Ninghao Liu. Usable xai: 10 strategies towards exploiting explainability in the llm era, 2025. URLhttps://arxiv.org/ abs/2403.08946

  42. [42]

    Harnessing the power of llms in practice: A survey on chatgpt and beyond

    Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, and Xia Hu. Harnessing the power of llms in practice: A survey on chatgpt and beyond. 2023

  43. [43]

    Phylolm: Inferring the phylogeny of large language models and predicting their performances in benchmarks.arXiv preprint arXiv:2404.04671, 2024

    Nicolas Yax, Pierre-Yves Oudeyer, and Stefano Palminteri. Phylolm: Inferring the phylogeny of large language models and predicting their performances in benchmarks.arXiv preprint arXiv:2404.04671, 2024

  44. [44]

    BERTScore: Evaluating Text Generation with BERT

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019. [Distribution Statement A] Approved for public release and unlimited distribution A Model, Data, and Prompts A.1 Controlled Experiment The primary model used in our controlled experiment is T5-...