arxiv: 2605.02930 · v1 · submitted 2026-04-27 · 💻 cs.NE · cs.LG· stat.ML

Recognition: unknown

Analysis and Explainability of LLMs Via Evolutionary Methods

Shannon K. Gallagher , Swati Rallapalli , Tyler Brooks , Chuck Loughin , Michele Sezgin , Ronald Yurko

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:33 UTC · model grok-4.3

classification 💻 cs.NE cs.LGstat.ML

keywords large language modelsevolutionary methodsphylogenetic treesmodel lineageweight differencesphenotypic analysistraining data influencemodel explainability

0 comments

The pith

Phylogenetic trees built from LLM weights and text outputs recover the structure of their training histories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies evolutionary biology methods to large language models by mapping weights to genotypes and generated text to phenotypes. This produces trees that reconstruct known training relationships in controlled tests and reveal which layers and datasets matter most. The approach works even when models are treated as black boxes, using only output text for analysis. It supplies visualizations that make model lineages and influences easier to inspect. The core goal is to give researchers new tools for tracing how models develop and relate without needing full training records.

Core claim

Relating LLM weights to genotypes and output text to phenotypes allows construction of evolutionary trees that reliably recover the topology of the ground-truth training tree in controlled experiments. Weight-difference analysis identifies the most important layers, phenotypic experiments show that one training dataset contributes more useful information than the others, and an unsupervised evolutionary tree is generated for black-box foundation models, all supported by visualizations of model relationships.

What carries the argument

Phylogenetic tree estimation from distance matrices computed on weight differences (as genotype distances) and on generated text outputs (as phenotype distances).

If this is right

Estimated trees recover the topology of the ground-truth training tree in controlled experiments.
Weight differences between models highlight the layers that contribute most to distinguishing them.
Phenotypic experiments identify that one training dataset supplies more useful information than the others.
Unsupervised trees can be built for black-box foundation models using only output text.
Visualizations clarify evolutionary relationships and lineage among the models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distance-based approach could trace how fine-tuning or continued pretraining shifts a model relative to its base version.
If the analogy holds, the method might help audit model provenance when training details are withheld.
Extending the genotype-phenotype framing to other architectures could allow similar lineage analysis for vision or multimodal models.

Load-bearing premise

The mapping of LLM weights to genotypes and generated text to phenotypes is close enough to biological data that standard phylogenetic methods will produce accurate and interpretable results about model lineage and training influences.

What would settle it

In the controlled experiment, if the estimated evolutionary tree topology fails to match the known order and branching of the training tree, the claim of reliable recovery would be disproved.

Figures

Figures reproduced from arXiv: 2605.02930 by Chuck Loughin, Michele Sezgin, Ronald Yurko, Shannon K. Gallagher, Swati Rallapalli, Tyler Brooks.

**Figure 2.** Figure 2: Original training tree sequence from T5 models fine-tuned on 10 different summarization [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Most important layers based on average distance between all model pairs. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Left: PCA of embeddings for a single prompt from the arXiv summarization dataset [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Trees estimated from distances between output embeddings derived from different datasets. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: PCA shown by dataset and prompt for each of the foundation models. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

read the original abstract

Evolutionary methods have long been useful for analysis and explanation in genetics, biology, ecology, and related fields. In this work, we extend these methods to neural networks, specifically large language models (LLMs), to better analyze and explain relationships among models. We show how relating weights to genotypes and output text to phenotypes can improve our understanding of model lineage, important datasets, the roles of different model layers, and visualization of model relationships. We demonstrate this in a controlled experiment, where our estimated evolutionary trees reliably recover the topology of the ground-truth training tree. We further identify the most important weight layers according to weight differences and show through phenotypic experiments that one training dataset appears to contribute more useful information than the others. Finally, we generate an unsupervised evolutionary tree of black-box foundation models. Throughout, we provide visualizations that support a clearer understanding of evolutionary relationships among LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The controlled tree recovery from LLM weights is likely automatic given the fine-tuning hierarchy, but the phenotypic dataset analysis and unsupervised foundation-model tree add some practical value.

read the letter

The paper maps LLM weights to genotypes and text outputs to phenotypes, then uses phylogenetic methods to build evolutionary trees. In a controlled experiment they recover the ground-truth training topology, and they also rank layers by weight differences, test which training datasets matter more via phenotypes, and produce an unsupervised tree over real black-box models with visualizations. That application to LLMs is new; no prior work in the abstract's framing does exactly this. The controlled recovery, layer ranking, and dataset comparison are the parts that work cleanly and could be useful for auditing model lineages. The unsupervised tree on foundation models is a reasonable extension that readers in interpretability might actually try. The main soft spot is that the topology recovery looks close to tautological. When models are created by successive fine-tuning forks, weight-vector distances will be smaller between more closely related models by construction, so standard distance-based tree methods will reconstruct the hierarchy without any extra evolutionary machinery or phenotypic signal. The paper would be stronger if it showed the method reveals influences beyond obvious similarity or included a baseline comparison to plain clustering. The phenotypic experiments and real-model tree are less exposed to this issue. This is for researchers in LLM auditing and interpretability who want new tools for tracing training influences and dataset contributions. It has enough of a concrete idea and experiment to deserve peer review, though the authors should be asked to address whether the central demonstration adds information beyond what weight similarity already provides.

Referee Report

2 major / 2 minor

Summary. The paper extends evolutionary and phylogenetic methods to LLMs by mapping weights to genotypes and generated text to phenotypes. It claims that in a controlled experiment with models derived along a known training hierarchy, the reconstructed evolutionary trees reliably recover the ground-truth topology. Additional results include identifying important weight layers from differences, phenotypic analysis showing one training dataset contributes more, and an unsupervised tree over black-box foundation models, all supported by visualizations for understanding model relationships and training influences.

Significance. If the topology recovery is demonstrated to be non-trivial rather than an artifact of weight distances on a fine-tuning tree, the work could offer a valuable new lens for analyzing LLM lineages, data influences, and layer roles using established biological tools. The visualizations and real-world application to foundation models provide practical utility for explainability in neural networks, though the overall impact hinges on showing that the evolutionary framing adds explanatory power beyond standard clustering.

major comments (2)

[Abstract and controlled experiment section] Abstract and controlled experiment: the claim that estimated evolutionary trees 'reliably recover the topology of the ground-truth training tree' risks circularity. In a setup where models are forked and fine-tuned along a known hierarchy, weight-vector distances (genotypes) will be smaller for more closely related models by construction; standard distance-based methods (e.g., UPGMA or neighbor-joining) will therefore reconstruct the tree without any evolutionary model or phenotypic signal. The manuscript must specify the distance metric, reconstruction algorithm, and controls (such as phenotype-only reconstruction or comparison to direct hierarchical clustering) to establish that the result is non-trivial.
[Methods / experimental setup] Experimental setup: details are missing on the precise distance metrics for weights and text, tree-building parameters (e.g., any evolutionary model assumptions or software used), and controls for confounders such as model size or architecture variations. These omissions make it impossible to evaluate whether the reported recovery depends on the biological analogy or would occur under any reasonable distance-based clustering.

minor comments (2)

[Visualizations and notation] Clarify notation for genotype/phenotype mappings and ensure all visualizations include legends that explicitly link colors or branches to model lineages and training steps.
[Abstract] The abstract would benefit from a brief quantitative statement of recovery accuracy (e.g., topological similarity metric) rather than the qualitative claim of 'reliable' recovery.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the need for greater methodological transparency. We address each major point below and will revise the manuscript to incorporate the requested specifications and controls.

read point-by-point responses

Referee: [Abstract and controlled experiment section] Abstract and controlled experiment: the claim that estimated evolutionary trees 'reliably recover the topology of the ground-truth training tree' risks circularity. In a setup where models are forked and fine-tuned along a known hierarchy, weight-vector distances (genotypes) will be smaller for more closely related models by construction; standard distance-based methods (e.g., UPGMA or neighbor-joining) will therefore reconstruct the tree without any evolutionary model or phenotypic signal. The manuscript must specify the distance metric, reconstruction algorithm, and controls (such as phenotype-only reconstruction or comparison to direct hierarchical clustering) to establish that the result is non-trivial.

Authors: We acknowledge this risk of circularity when using weight distances alone. Our approach integrates phenotypic signals from generated text to provide an independent measure of model relationships. In revision, we will explicitly detail the distance metric (Euclidean on flattened weight vectors for genotypes; cosine similarity on text embeddings for phenotypes), the reconstruction algorithm (neighbor-joining), and add controls: a phenotype-only tree and direct comparison to hierarchical clustering on the same distances. These will demonstrate the non-trivial contribution of the evolutionary framing. revision: yes
Referee: [Methods / experimental setup] Experimental setup: details are missing on the precise distance metrics for weights and text, tree-building parameters (e.g., any evolutionary model assumptions or software used), and controls for confounders such as model size or architecture variations. These omissions make it impossible to evaluate whether the reported recovery depends on the biological analogy or would occur under any reasonable distance-based clustering.

Authors: We agree these details were insufficiently specified. The revised Methods section will include: distance metrics (L2 norm for weights, cosine on sentence embeddings for text), tree-building parameters (neighbor-joining with no parametric evolutionary model, implemented in BioPython), and confounder controls (all controlled-experiment models share identical architecture and size, varying only by training data and steps). This will allow assessment of whether the phylogenetic approach adds value beyond standard clustering. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper applies established phylogenetic methods (e.g., distance-based tree reconstruction) to LLM weights treated as genotypes and text as phenotypes in a controlled fine-tuning experiment. The reported recovery of the ground-truth training tree topology is a demonstration that the methods produce expected outputs on hierarchically related models, rather than a first-principles derivation or prediction that reduces to a fitted parameter or self-referential definition by the paper's own equations. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to force the central result. The approach remains self-contained against external benchmarks for phylogenetic reconstruction, with the biological analogy serving interpretability rather than mathematical necessity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the validity of the genotype-phenotype analogy for LLMs and the assumption that standard phylogenetic distance measures capture meaningful training relationships.

axioms (1)

domain assumption The mapping of LLM weights to genotypes and generated text to phenotypes is a valid and useful analogy for phylogenetic analysis
This premise is invoked throughout the abstract to justify applying evolutionary methods to neural networks.

pith-pipeline@v0.9.0 · 5464 in / 1187 out tokens · 32852 ms · 2026-05-09T20:33:24.868614+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 21 canonical work pages · 6 internal anchors

[1]

Meteor: An automatic metric for mt evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

2005
[2]

Uniparentalinheritanceofmitochondrialandchloroplastgenes: mechanisms and evolution.Proceedings of the National Academy of Sciences, 92(25):11331–11338, 1995

CWilliamBirkyJr. Uniparentalinheritanceofmitochondrialandchloroplastgenes: mechanisms and evolution.Proceedings of the National Academy of Sciences, 92(25):11331–11338, 1995

1995
[3]

Xai meets llms: A survey of the relation between explainable ai and large language models

Erik Cambria, Lorenzo Malandri, Fabio Mercorio, Navid Nobani, and Andrea Seveso. Xai meets llms: A survey of the relation between explainable ai and large language models.arXiv preprint arXiv:2407.15248, 2024

work page arXiv 2024
[4]

D ialog S um: A Real-Life Scenario Dialogue Summarization Dataset

Yulong Chen, Yang Liu, Liang Chen, and Yue Zhang. DialogSum: A real-life scenario dialogue summarization dataset. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 5062–5074, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.449. URL https://aclanthology.org/2021. findings-acl.449

work page doi:10.18653/v1/2021.findings-acl.449 2021
[5]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024. URL https://arxiv.org/abs/2403.04132

work page internal anchor Pith review arXiv 2024
[6]

Subliminal learning: Language models transmit behavioral traits via hidden signals in data.arXiv preprint arXiv:2507.14805,

Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, and Owain Evans. Subliminal learning: Language models transmit behavioral traits via hidden signals in data.arXiv preprint arXiv:2507.14805, 2025

work page arXiv 2025
[7]

A discourse-aware attention model for abstractive summarization of long documents

Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. A discourse-aware attention model for abstractive summarization of long documents. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers...

work page doi:10.18653/v1/n18-2097 2018
[8]

Assessing llms for high stakes applications

Shannon K Gallagher, Jasmine Ratchford, Tyler Brooks, Bryan P Brown, Eric Heim, William R Nichols, Scott Mcmillan, Swati Rallapalli, Carol J Smith, Nathan VanHoudnos, et al. Assessing llms for high stakes applications. InProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice, pages 103–105, 2024

2024
[9]

SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization

Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. InProceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70–79, Hong Kong, China, November
[10]

doi:10.18653/v1/D19-5409 , pages =

Association for Computational Linguistics. doi: 10.18653/v1/D19-5409. URLhttps: //www.aclweb.org/anthology/D19-5409

work page doi:10.18653/v1/d19-5409
[11]

Phylogenetic scale in ecology and evolution.Global Ecology and Biogeography, 27(2):175–187, 2018

Catherine H Graham, David Storch, and Antonin Machac. Phylogenetic scale in ecology and evolution.Global Ecology and Biogeography, 27(2):175–187, 2018

2018
[12]

Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies.arXiv preprint arXiv:1804.11283, 2018

Max Grusky, Mor Naaman, and Yoav Artzi. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies.arXiv preprint arXiv:1804.11283, 2018

work page arXiv 2018
[13]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URLhttps: //arxiv.org/abs/2009.03300

work page internal anchor Pith review arXiv 2021
[14]

Teaching machines to read and comprehend

Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. volume 28, 2015

2015
[15]

Unsupervised model tree heritage recovery

Eliahu Horwitz, Asaf Shul, and Yedid Hoshen. Unsupervised model tree heritage recovery. arXiv preprint arXiv:2405.18432, 2024

work page arXiv 2024
[16]

Efficient attentions for long document summarization, 2021

Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. Efficient attentions for long document summarization, 2021

2021
[17]

arXiv:2104.14337 (2021), https://arxiv.org/abs/2104.14337

Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, et al. Dynabench: Rethinking benchmarking in nlp.arXiv preprint arXiv:2104.14337, 2021

work page arXiv 2021
[18]

BillSum: A corpus for automatic summarization of US legislation

Anastassia Kornilova and Vladimir Eidelman. BillSum: A corpus for automatic summarization of US legislation. In Lu Wang, Jackie Chi Kit Cheung, Giuseppe Carenini, and Fei Liu, editors, Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 48–56, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1...

work page doi:10.18653/v1/d19-5406 2019
[19]

Summac: Re-visiting nli-based models for inconsistency detection in summarization.Transactions of the Association for Computational Linguistics, 10:163–177, 2022

Philippe Laban, Tobias Schnabel, Paul N Bennett, and Marti A Hearst. Summac: Re-visiting nli-based models for inconsistency detection in summarization.Transactions of the Association for Computational Linguistics, 10:163–177, 2022

2022
[20]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: A comprehensive survey on llm-based evaluation methods, 2024. URL https://arxiv.org/abs/2412.05579

work page internal anchor Pith review arXiv 2024
[21]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Ya- sunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Alexander Cosgrove, Christopher D Manning, Christopher Re, Diana Acosta-Navas, Drew Arad Hudson, Eric Zelikman, Esin Durmus, Faisal [Distribution ...

2023
[22]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

2004
[23]

Automatic evaluation of summaries using n-gram co-occurrence statistics

Chin-Yew Lin and Eduard Hovy. Automatic evaluation of summaries using n-gram co-occurrence statistics. InProceedings of the 2003 human language technology conference of the North American chapter of the association for computational linguistics, pages 150–157, 2003

2003
[24]

Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivo...

2025
[25]

Neural networks–state of art, brief history, basic models and architecture

Bohdan Macukow. Neural networks–state of art, brief history, basic models and architecture. InIFIP international conference on computer information systems and industrial management, pages 3–14. Springer, 2016

2016
[26]

Evolutionary algorithms and neural networks.Studies in computational intelligence, 780(1):43–53, 2019

Seyedali Mirjalili. Evolutionary algorithms and neural networks.Studies in computational intelligence, 780(1):43–53, 2019

2019
[27]

Phylogenetic networks: a review of methods to display evolutionary history

David A Morrison. Phylogenetic networks: a review of methods to display evolutionary history. Annual Research & Review in Biology, 4(10):1518, 2014

2014
[28]

Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization.ArXiv, abs/1808.08745, 2018

work page Pith review arXiv 2018
[29]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

2002
[30]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html

2020
[31]

Comparison of phylogenetic trees.Mathematical biosciences, 53(1-2):131–147, 1981

David F Robinson and Leslie R Foulds. Comparison of phylogenetic trees.Mathematical biosciences, 53(1-2):131–147, 1981

1981
[32]

DebateSum: A large-scale argument mining and summarization dataset

Allen Roush and Arvind Balaji. DebateSum: A large-scale argument mining and summarization dataset. In Elena Cabrio and Serena Villata, editors,Proceedings of the 7th Workshop on [Distribution Statement A] Approved for public release and unlimited distribution Argument Mining, pages 1–7, Online, December 2020. Association for Computational Linguistics. URL...

2020
[33]

Molecular Biology and Evolution4(4), 406–425 (1987) https://doi.org/10.1093/oxfordjournals.molbev.a040454

N Saitou and M Nei. The neighbor-joining method: a new method for reconstructing phylo- genetic trees.Molecular Biology and Evolution, 4(4):406–425, 07 1987. ISSN 0737-4038. doi: 10.1093/oxfordjournals.molbev.a040454. URL https://doi.org/10.1093/oxfordjournals. molbev.a040454

work page doi:10.1093/oxfordjournals.molbev.a040454 1987
[34]

Liu, and Christopher D

Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator networks. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1099. URL h...

work page doi:10.18653/v1/p17-1099 2017
[35]

BIGPATENT: A large-scale dataset for abstractive and coherent summarization.CoRR, abs/1906.03741, 2019

Eva Sharma, Chen Li, and Lu Wang. BIGPATENT: A large-scale dataset for abstractive and coherent summarization.CoRR, abs/1906.03741, 2019. URL http://arxiv.org/abs/1906. 03741

work page arXiv 1906
[36]

Distribution of covid-19 and phylogenetic tree construction of sars-cov-2 in indonesia

Dora Dayu Rahma Turista, Aesthetica Islamy, Viol Dhea Kharisma, and Arif Nur Muhammad Ansori. Distribution of covid-19 and phylogenetic tree construction of sars-cov-2 in indonesia. J Pure Appl Microbiol, 14(suppl 1):1035–42, 2020

2020
[37]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[38]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019. URLhttps://arxiv.org/abs/1804.07461

work page internal anchor Pith review arXiv 2019
[39]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, WeimingRen, AaranArulraj, XuanHe, ZiyanJiang, TianleLi, MaxKu, KaiWang, AlexZhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024. URLhttps://arxiv.org/abs/2406.01574

work page internal anchor Pith review arXiv 2024
[40]

Confidence sets for phylogenetic trees.Journal of the American Statistical Association, 114(525):235–244, 2019

Amy Willis. Confidence sets for phylogenetic trees.Journal of the American Statistical Association, 114(525):235–244, 2019

2019
[41]

Usable xai: 10 strategies towards exploiting explainability in the llm era, 2025

Xuansheng Wu, Haiyan Zhao, Yaochen Zhu, Yucheng Shi, Fan Yang, Lijie Hu, Tianming Liu, Xiaoming Zhai, Wenlin Yao, Jundong Li, Mengnan Du, and Ninghao Liu. Usable xai: 10 strategies towards exploiting explainability in the llm era, 2025. URLhttps://arxiv.org/ abs/2403.08946

work page arXiv 2025
[42]

Harnessing the power of llms in practice: A survey on chatgpt and beyond

Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, and Xia Hu. Harnessing the power of llms in practice: A survey on chatgpt and beyond. 2023

2023
[43]

Phylolm: Inferring the phylogeny of large language models and predicting their performances in benchmarks.arXiv preprint arXiv:2404.04671, 2024

Nicolas Yax, Pierre-Yves Oudeyer, and Stefano Palminteri. Phylolm: Inferring the phylogeny of large language models and predicting their performances in benchmarks.arXiv preprint arXiv:2404.04671, 2024

work page arXiv 2024
[44]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019. [Distribution Statement A] Approved for public release and unlimited distribution A Model, Data, and Prompts A.1 Controlled Experiment The primary model used in our controlled experiment is T5-...

work page internal anchor Pith review arXiv 1904