arxiv: 2605.12714 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs

Jingzhou Jiang, Kar Yan Tam, Yi Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:45 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords layer-wise dynamicsrepresentation learningmodel selectionlayer pruningMTEB tasksembedderslarge language modelshidden states

0 comments

The pith

Layer-wise dynamics in language models reveal performance signals beyond final representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Layer-wise Representation Dynamics as a way to track how hidden states evolve through model layers instead of examining only the final output. Three families of measurements quantify global subspace movement, local neighborhood stability, and alignment with the final layer. When these are computed on dozens of embedders and base LLMs across many retrieval and classification tasks, they expose architectural differences and task-specific patterns that final-layer embeddings alone do not show. The same measurements also turn out to be useful for choosing models without task labels and for deciding which layers to drop at inference time.

Core claim

Applying LRD to 31 models on 30 MTEB tasks reveals architectural and task-level differences that are not apparent from final-layer representations alone. Model-level scores correlate positively with downstream MTEB performance, with end-to-end subspace displacement the strongest predictor. For inference-time pruning, GFMI is the only measurement-guided rule that beats random selection at the 15 percent and 20 percent budgets and shows the best median change at every budget tested.

What carries the argument

Layer-wise Representation Dynamics (LRD), a framework using Frenet measurements for subspace speed and curvature, Neighborhood Retention Score for local neighbor preservation, and Graph Filtration Mutual Information for final-layer alignment.

If this is right

End-to-end subspace displacement serves as the strongest single predictor of downstream MTEB performance for label-free model ranking.
GFMI-based layer selection outperforms random pruning at moderate compute budgets while other LRD scores do not transfer as reliably.
Encoder-based and decoder-based models exhibit distinct layer-wise motion patterns that final embeddings obscure.
Task categories such as retrieval versus classification show different rates of representation change across layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the correlations generalize, LRD scores could be used to design training objectives that encourage or suppress specific layer-wise behaviors.
The same measurements might help diagnose why certain models perform well on narrow tasks but degrade on others by pinpointing the layers where useful structure is lost.
Extending the framework to very large models could test whether pruning guided by GFMI scales to reduce inference cost without retraining.

Load-bearing premise

The three proposed measurements capture dynamics that are causally relevant to downstream performance rather than merely correlated on the tested set of models and tasks.

What would settle it

Re-running the model-selection and pruning experiments on a fresh collection of models and tasks where the LRD scores no longer correlate with performance or fail to beat random pruning would falsify the claim that layer-wise structure supplies useful deployment signals.

Figures

Figures reproduced from arXiv: 2605.12714 by Jingzhou Jiang, Kar Yan Tam, Yi Yang.

**Figure 2.** Figure 2: Layer-pruning score change over 15 model-task cells. GFMI improves over Random by mean at 15% and 20%, and is closest to zero by median at every budget [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

Hidden states change substantially across the layers of modern language models, but most layer-wise analyses focus on one aspect of that change. We propose Layer-wise Representation Dynamics (LRD), a framework with three layer-wise measurement families: Frenet (Grassmann speed and curvature) for global subspace motion, Neighborhood Retention Score (NRS) for local nearest-neighbor retention, and Graph Filtration Mutual Information (GFMI) for alignment with the final layer. Applying LRD to 31 models (encoder-based and decoder-based embedders, plus base LLMs) on 30 MTEB tasks reveals architectural and task-level differences that are not apparent from final-layer representations alone. We then use LRD for two applications: label-free model selection and inference-time layer pruning. For selection, all three model-level scores correlate positively with downstream MTEB performance, with end-to-end subspace displacement (d_{0,L}) the strongest, and the same direction holds on a smaller base-LLM MMLU panel. For pruning, GFMI is the only measurement-guided rule that beats Random at the 15% and 20% budgets and has the best median change at every budget. Frenet is effective only at the lightest budget, while NRS does not transfer from model selection to pruning. These results show that layer-wise structure provides signal for both interpretation and deployment decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LRD bundles three layer metrics and gets some pruning wins, but the selection correlations look vulnerable to scale and family confounders.

read the letter

The main thing to know is that this paper defines LRD as three layer-wise measurements—Frenet for global subspace motion via Grassmann speed and curvature, Neighborhood Retention Score for local neighbor stability, and Graph Filtration Mutual Information for alignment to the final layer—then shows they can support label-free model selection and inference pruning across 31 models on MTEB tasks plus a smaller MMLU set. The pruning result is the clearest practical signal: GFMI beats random at the 15% and 20% budgets and has the best median change at every level tested, while Frenet helps only at the lightest budget and NRS does not transfer at all. That gives a usable rule for dropping layers without labels. The selection side reports positive correlations between the model-level scores and downstream performance, with end-to-end displacement strongest, and the same direction on MMLU. The work also documents architectural and task differences that final-layer views miss. The specific three-metric package applied at this scale is new relative to earlier single-aspect layer studies. The empirical breadth—encoder and decoder embedders plus base LLMs—is a plus and makes the pruning numbers worth testing further. The soft spots are real. No controls for parameter count, embedding dimension, or model family appear in the reported correlations, so the positive links could be proxies for those known factors rather than evidence that the metrics themselves carry causal signal. The abstract supplies no error bars, exclusion criteria, or derivation details, which leaves robustness hard to judge from the given text. NRS working for selection but failing for pruning also suggests the measurements are not interchangeable in the way the framework implies. This is for researchers focused on representation dynamics, model selection, or layer pruning who already work with unlabeled data and want concrete metrics to try. It deserves a serious referee because the scope is decent and the pruning result is falsifiable and practically oriented, even though the causal interpretation needs tighter validation.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Layer-wise Representation Dynamics (LRD) framework with three measurement families: Frenet for global subspace motion using Grassmann speed and curvature, Neighborhood Retention Score (NRS) for local nearest-neighbor retention, and Graph Filtration Mutual Information (GFMI) for alignment with the final layer. Applying LRD to 31 models on 30 MTEB tasks reveals architectural and task-level differences not apparent from final-layer representations. The framework is applied to label-free model selection, where all three scores correlate positively with MTEB performance (d_{0,L} strongest), and to inference-time layer pruning, where GFMI outperforms random at 15% and 20% budgets.

Significance. If the LRD measurements capture dynamics independent of scale and architecture, the work could supply useful tools for interpreting LLM internals and guiding deployment choices such as label-free selection and pruning. The evaluation span across 31 models and two applications is a positive feature, though the empirical correlations require stronger controls to support the claimed utility.

major comments (2)

[Model Selection Application] Model Selection Application: the positive correlations between the three LRD scores (including d_{0,L}) and MTEB performance are reported without regression controls or matching for model family, parameter count, or embedding dimension. The 31 models comprise three distinct families that differ systematically in both average performance and representation evolution, so the correlations may be driven by these confounders rather than by independent signal from the proposed layer-wise measurements.
[Pruning Experiments] Pruning Experiments: GFMI is stated to beat random pruning at the 15% and 20% budgets with the best median change at every budget, yet no error bars, number of runs, task-level variance, or statistical significance tests are provided. This omission prevents assessment of whether the reported improvements are robust or merely within noise.

minor comments (2)

[Abstract] Abstract: the term 'end-to-end subspace displacement (d_{0,L})' is used without a definition or equation reference, reducing immediate clarity for readers unfamiliar with the Frenet family.
[Methodology] Methodology: the exact algorithmic definitions and any hyperparameters for computing Frenet, NRS, and GFMI are not fully detailed in the text, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and describe the revisions that will be incorporated.

read point-by-point responses

Referee: [Model Selection Application] Model Selection Application: the positive correlations between the three LRD scores (including d_{0,L}) and MTEB performance are reported without regression controls or matching for model family, parameter count, or embedding dimension. The 31 models comprise three distinct families that differ systematically in both average performance and representation evolution, so the correlations may be driven by these confounders rather than by independent signal from the proposed layer-wise measurements.

Authors: We agree that the reported correlations would be strengthened by explicit controls for model family, parameter count, and embedding dimension. In the revised manuscript we will add a multiple linear regression that includes dummy variables for the three model families, log(parameter count), and embedding dimension as covariates. We will report the partial coefficients and p-values for each LRD score (with particular emphasis on d_{0,L}). In addition, we will include within-family correlation tables to demonstrate that the positive relationship is not solely an artifact of cross-family differences. These controls will clarify the independent contribution of the layer-wise measurements. revision: yes
Referee: [Pruning Experiments] Pruning Experiments: GFMI is stated to beat random pruning at the 15% and 20% budgets with the best median change at every budget, yet no error bars, number of runs, task-level variance, or statistical significance tests are provided. This omission prevents assessment of whether the reported improvements are robust or merely within noise.

Authors: We acknowledge that the pruning results lack the statistical detail needed to evaluate robustness. In the revision we will report: (i) mean and standard error across five independent runs that differ only in random seed for layer selection; (ii) per-task performance deltas with inter-task variance; and (iii) p-values from paired Wilcoxon signed-rank tests comparing each guided rule against the random baseline at every budget. These additions will allow readers to judge whether the observed median improvements exceed noise. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines three independent measurement families (Frenet subspace dynamics, Neighborhood Retention Score, and Graph Filtration Mutual Information) via explicit geometric and information-theoretic constructions on hidden-state matrices. These are computed directly from layer activations without reference to downstream MTEB or MMLU labels. Observed positive correlations between the resulting model-level aggregates (e.g., end-to-end displacement d_{0,L}) and task performance are reported as empirical findings across 31 models, not as quantities obtained by fitting parameters to the target metrics. No equations reduce the reported scores to fitted inputs by construction, no self-citations supply load-bearing uniqueness theorems, and no ansatzes are smuggled via prior work. The model-selection and pruning applications are post-hoc uses of the pre-defined measurements rather than derivations that collapse to the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities beyond the three new measurement definitions are stated. The measurements themselves function as invented analysis tools whose independent evidence is the reported correlations.

invented entities (3)

Frenet (Grassmann speed and curvature) no independent evidence
purpose: Measure global subspace motion across layers
New geometric descriptor introduced in the LRD framework
Neighborhood Retention Score (NRS) no independent evidence
purpose: Measure local nearest-neighbor retention across layers
New local descriptor introduced in the LRD framework
Graph Filtration Mutual Information (GFMI) no independent evidence
purpose: Measure alignment with final layer via graph filtration
New alignment descriptor introduced in the LRD framework

pith-pipeline@v0.9.0 · 5545 in / 1281 out tokens · 46796 ms · 2026-05-14T21:45:21.962818+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We propose Layer-wise Representation Dynamics (LRD), a framework with three layer-wise measurement families: Frenet (Grassmann speed and curvature) for global subspace motion, Neighborhood Retention Score (NRS) for local nearest-neighbor retention, and Graph Filtration Mutual Information (GFMI) for alignment with the final layer.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
Applying LRD to 31 models on 30 MTEB tasks reveals architectural and task-level differences...

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 8 internal anchors

[1]

Princeton University Press, 2008

P-A Absil, Robert Mahony, and Rodolphe Sepulchre.Optimization algorithms on matrix manifolds. Princeton University Press, 2008

work page 2008
[2]

The falcon series of open language models.arXiv preprint arXiv:2311.16867, 2023

Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxan- dra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, et al. The falcon series of open language models.arXiv preprint arXiv:2311.16867, 2023

work page arXiv 2023
[3]

Laplacian eigenmaps for dimensionality reduction and data representation.Neural computation, 15(6):1373–1396, 2003

Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.Neural computation, 15(6):1373–1396, 2003

work page 2003
[4]

Manifold regularization: A geometric framework for learning from labeled and unlabeled examples.Journal of machine learning research, 7(11), 2006

Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples.Journal of machine learning research, 7(11), 2006

work page 2006
[5]

A full-text learning to rank dataset for medical information retrieval

Vera Boteva, Demian Gholipour, Artem Sokolov, and Stefan Riezler. A full-text learning to rank dataset for medical information retrieval. InEuropean Conference on Information Retrieval (ECIR), 2016

work page 2016
[6]

Efficient intent detection with dual sentence encoders

Iñigo Casanueva, Tadas Temˇcinas, Daniela Gerz, Matthew Henderson, and Ivan Vuli´c. Efficient intent detection with dual sentence encoders. InProceedings of the 2nd workshop on natural language processing for conversational AI, pages 38–45, 2020

work page 2020
[7]

Semeval- 2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval- 2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017), pages 1–14, 2017

work page 2017
[8]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216, 4(5), 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S. Weld. SPECTER: Document-level representation learning using citation-informed transformers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020

work page 2020
[10]

Deconfounded representation similarity for comparison of neural networks.Advances in Neural Information Processing Systems, 35:19138–19151, 2022

Tianyu Cui, Yogesh Kumar, Pekka Marttinen, and Samuel Kaski. Deconfounded representation similarity for comparison of neural networks.Advances in Neural Information Processing Systems, 35:19138–19151, 2022

work page 2022
[11]

When is an embedding model more promising than another?Advances in Neural Information Processing Systems, 37:68330–68379, 2024

Maxime Darrin, Philippe Formont, Ismail B AYED, Jackie C CHEUNG, and Pablo Piantanida. When is an embedding model more promising than another?Advances in Neural Information Processing Systems, 37:68330–68379, 2024

work page 2024
[12]

Topological persistence and simplification.Discrete & computational geometry, 28(4):511–533, 2002

Edelsbrunner, Letscher, and Zomorodian. Topological persistence and simplification.Discrete & computational geometry, 28(4):511–533, 2002

work page 2002
[13]

How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings

Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 55–65, 2019

work page 2019
[14]

Reducing transformer depth on demand with structured dropout.arXiv preprint arXiv:1909.11556, 2019

Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with structured dropout.arXiv preprint arXiv:1909.11556, 2019

work page arXiv 1909
[15]

Not all models localize linguistic knowledge in the same place: A layer-wise probing on bertoids’ representations

Mohsen Fayyaz, Ehsan Aghazadeh, Ali Modarressi, Hosein Mohebbi, and Mohammad Taher Pilehvar. Not all models localize linguistic knowledge in the same place: A layer-wise probing on bertoids’ representations. InProceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 375–388, 2021. 10

work page 2021
[16]

MASSIVE: A 1m-example multilingual natural language understanding dataset with 51 typologically-diverse languages

Jack FitzGerald, Christopher Hench, Charith Peris, Scott Mackie, Kay Rottmann, Ana Sanchez, Aaron Nash, Liam Urbach, Vishesh Kakarala, Richa Singh, et al. MASSIVE: A 1m-example multilingual natural language understanding dataset with 51 typologically-diverse languages. InProceedings of the 60th Annual Meeting of the Association for Computational Linguisti...

work page 2022
[17]

Representation degeneration problem in training natural language generation models.arXiv preprint arXiv:1907.12009, 2019

Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. Representation degeneration problem in training natural language generation models.arXiv preprint arXiv:1907.12009, 2019

work page arXiv 1907
[18]

Representa- tion similarity reveals implicit layer grouping in neural networks

Tian Gao, Amit Dhurandhar, Karthikeyan Natesan Ramamurthy, and Dennis Wei. Representa- tion similarity reveals implicit layer grouping in neural networks. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025

work page 2025
[19]

Rankme: Assessing the downstream performance of pretrained self-supervised representations by their rank

Quentin Garrido, Randall Balestriero, Laurent Najman, and Yann Lecun. Rankme: Assessing the downstream performance of pretrained self-supervised representations by their rank. In International conference on machine learning, pages 10929–10974. PMLR, 2023

work page 2023
[20]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

The unreasonable ineffectiveness of the deeper layers

Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Dan Roberts. The unreasonable ineffectiveness of the deeper layers. InThe Thirteenth International Conference on Learning Representations, 2024

work page 2024
[22]

Alignment and stability of embeddings: measurement and inference improvement.Neurocomputing, 553:126517, 2023

Furkan Gürsoy, Mounir Haddad, and Cécile Bothorel. Alignment and stability of embeddings: measurement and inference improvement.Neurocomputing, 553:126517, 2023

work page 2023
[23]

DBpedia-Entity v2: A test collection for entity search

Faegheh Hasibi, Fedor Nikolaev, Chenyan Xiong, Krisztian Balog, Svein Erik Bratsberg, Alexander Kotov, and Jamie Callan. DBpedia-Entity v2: A test collection for entity search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017

work page 2017
[24]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[25]

A structural probe for finding syntax in word representations

John Hewitt and Christopher D Manning. A structural probe for finding syntax in word representations. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, 2019

work page 2019
[26]

Verspoor, and Timothy Baldwin

Doris Hoogeveen, Karin M. Verspoor, and Timothy Baldwin. CQADupStack: A benchmark data set for community question-answering research. InProceedings of the 20th Australasian Document Computing Symposium (ADCS), 2015

work page 2015
[27]

Eghbal Hosseini and Evelina Fedorenko. Large language models implicitly learn to straighten neural sentence trajectories to construct a predictive representation of natural language.Ad- vances in Neural Information Processing Systems, 36:43918–43930, 2023

work page 2023
[28]

What does bert learn about the structure of language? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 3651–3657, 2019

Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. What does bert learn about the structure of language? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 3651–3657, 2019

work page 2019
[29]

Tracing representation progression: Analyzing and enhancing layer-wise similarity.arXiv preprint arXiv:2406.14479, 2024

Jiachen Jiang, Jinxin Zhou, and Zhihui Zhu. Tracing representation progression: Analyzing and enhancing layer-wise similarity.arXiv preprint arXiv:2406.14479, 2024

work page arXiv 2024
[30]

6g non-terrestrial networks enabled low-altitude economy: Opportunities and challenges.arXiv preprint arXiv:2311.09047, 2023

Yihang Jiang, Xiaoyang Li, Guangxu Zhu, Hang Li, Jing Deng, Kaifeng Han, Chao Shen, Qingjiang Shi, and Rui Zhang. 6g non-terrestrial networks enabled low-altitude economy: Opportunities and challenges.arXiv preprint arXiv:2311.09047, 2023. 11

work page arXiv 2023
[31]

Similarity of neural network representations revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InInternational conference on machine learning, pages 3519–3529. PMlR, 2019

work page 2019
[32]

Representational similarity analysis-connecting the branches of systems neuroscience.Frontiers in systems neuroscience, 2:249, 2008

Nikolaus Kriegeskorte, Marieke Mur, and Peter A Bandettini. Representational similarity analysis-connecting the branches of systems neuroscience.Frontiers in systems neuroscience, 2:249, 2008

work page 2008
[33]

Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

work page 2019
[34]

A continuously growing dataset of sentential paraphrases

Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. A continuously growing dataset of sentential paraphrases. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017

work page 2017
[35]

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models.arXiv preprint arXiv:2405.17428, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

MTOP: A comprehensive multilingual task-oriented semantic parsing benchmark

Haoran Li, Abhinav Arora, Shuohui Chen, Anchit Gupta, Sonal Gupta, and Yashar Mehdad. MTOP: A comprehensive multilingual task-oriented semantic parsing benchmark. InProceed- ings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2021

work page 2021
[37]

Tracing the representation geometry of language models from pretraining to post-training.arXiv preprint arXiv:2509.23024, 2025

Melody Zixuan Li, Kumar Krishna Agrawal, Arna Ghosh, Komal Kumar Teru, Adam Santoro, Guillaume Lajoie, and Blake A Richards. Tracing the representation geometry of language models from pretraining to post-training.arXiv preprint arXiv:2509.23024, 2025

work page arXiv 2025
[38]

Towards General Text Embeddings with Multi-stage Contrastive Learning

Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Slimgpt: Layer-wise structured pruning for large language models.Advances in Neural Information Processing Systems, 37: 107112–107137, 2024

Gui Ling, Ziyang Wang, Yuliang Yan, and Qingwen Liu. Slimgpt: Layer-wise structured pruning for large language models.Advances in Neural Information Processing Systems, 37: 107112–107137, 2024

work page 2024
[40]

Linguistic knowledge and transferability of contextual representations

Nelson F Liu, Matt Gardner, Yonatan Belinkov, Matthew E Peters, and Noah A Smith. Linguistic knowledge and transferability of contextual representations. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1073–1094, 2019

work page 2019
[41]

Maas, Raymond E

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y . Ng, and Christopher Potts. Learning word vectors for sentiment analysis. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011

work page 2011
[42]

A SICK cure for the evaluation of compositional distributional semantic models

Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardini, and Roberto Zamparelli. A SICK cure for the evaluation of compositional distributional semantic models. InProceedings of the Ninth International Conference on Language Resources and Evaluation (LREC), 2014

work page 2014
[43]

Hidden factors and hidden topics: understanding rating dimensions with review text

Julian McAuley and Jure Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. InProceedings of the 7th ACM conference on Recommender systems, pages 165–172, 2013

work page 2013
[44]

Shortgpt: Layers in large language models are more redundant than you expect

Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20192–20204, 2025

work page 2025
[45]

Sfrembedding-mistral: enhance text retrieval with transfer learning.Salesforce AI Research Blog, 3(6), 2024

Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Sfrembedding-mistral: enhance text retrieval with transfer learning.Salesforce AI Research Blog, 3(6), 2024. 12

work page 2024
[46]

Arctic-embed: Scalable, efficient, and accurate text embedding models.arXiv preprint arXiv:2405.05374, 2024

Luke Merrick, Danmei Xu, Gaurav Nuti, and Daniel Campos. Arctic-embed: Scalable, efficient, and accurate text embedding models.arXiv preprint arXiv:2405.05374, 2024

work page arXiv 2024
[47]

Insights on representational similarity in neural networks with canonical correlation.Advances in neural information processing systems, 31, 2018

Ari Morcos, Maithra Raghu, and Samy Bengio. Insights on representational similarity in neural networks with canonical correlation.Advances in neural information processing systems, 31, 2018

work page 2018
[48]

Mteb: Massive text embedding benchmark

Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. Mteb: Massive text embedding benchmark. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037, 2023

work page 2014
[49]

Generative representational instruction tuning.arXiv preprint arXiv:2402.09906, 2024

Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Generative representational instruction tuning.arXiv preprint arXiv:2402.09906, 2024

work page arXiv 2024
[50]

Nomic embed: Training a reproducible long context text embedder.arXiv preprint arXiv:2402.01613, 2024

Zach Nussbaum, John X Morris, Brandon Duderstadt, and Andriy Mulyar. Nomic embed: Training a reproducible long context text embedder.arXiv preprint arXiv:2402.01613, 2024

work page arXiv 2024
[51]

I wish i would have loved this one, but i didn’t–a multilingual dataset for counterfactual detection in product review

James O’Neill, Polina Rozenshtein, Ryuichi Kiryo, Motoko Kubota, and Danushka Bollegala. I wish i would have loved this one, but i didn’t–a multilingual dataset for counterfactual detection in product review. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7092–7108, 2021

work page 2021
[52]

Elements of information theory, 1992

SJD Phoenix. Elements of information theory, 1992

work page 1992
[53]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019
[54]

Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability.Advances in neural information processing systems, 30, 2017

Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability.Advances in neural information processing systems, 30, 2017

work page 2017
[55]

Visualizing and measuring the geometry of bert.Advances in neural information processing systems, 32, 2019

Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda B Viegas, Andy Coenen, Adam Pearce, and Been Kim. Visualizing and measuring the geometry of bert.Advances in neural information processing systems, 32, 2019

work page 2019
[56]

Sentence-bert: Sentence embeddings using siamese bert- networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), pages 3982–3992, 2019

work page 2019
[57]

Kirk Roberts, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, Kyle Lo, Ian Soboroff, Ellen V oorhees, Lucy Lu Wang, and William R. Hersh. Searching for scientific evidence in a pandemic: An overview of TREC-COVID.Journal of the American Medical Informatics Association, 2021

work page 2021
[58]

CARER: Contextualized affect representations for emotion recognition

Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu, and Yi-Shin Chen. CARER: Contextualized affect representations for emotion recognition. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018

work page 2018
[59]

Adversarial domain adaptation for duplicate question detection

Darsh Shah, Tao Lei, Alessandro Moschitti, Salvatore Romeo, and Preslav Nakov. Adversarial domain adaptation for duplicate question detection. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018

work page 2018
[60]

Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013, 2025

Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013, 2025

work page arXiv 2025
[61]

Demystifying the roles of llm layers in retrieval, knowledge, and reasoning

Xinyuan Song, Keyu Wang, PengXiang Li, Lu Yin, and Shiwei Liu. Demystifying the roles of llm layers in retrieval, knowledge, and reasoning. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 16792–16796. IEEE, 2026. 13

work page 2026
[62]

jina- embeddings-v3: Multilingual embeddings with task lora.arXiv preprint arXiv:2409.10173, 2024

Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Nan Wang, et al. jina- embeddings-v3: Multilingual embeddings with task lora.arXiv preprint arXiv:2409.10173, 2024

work page arXiv 2024
[63]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

A global geometric framework for nonlinear dimensionality reduction.science, 290(5500):2319–2323, 2000

Joshua B Tenenbaum, Vin de Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction.science, 290(5500):2319–2323, 2000

work page 2000
[65]

Bert rediscovers the classical nlp pipeline

Ian Tenney, Dipanjan Das, and Ellie Pavlick. Bert rediscovers the classical nlp pipeline. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 4593–4601, 2019

work page 2019
[66]

BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021

work page 2021
[67]

Evaluation of word vector representations by subspace alignment

Yulia Tsvetkov, Manaal Faruqui, Wang Ling, Guillaume Lample, and Chris Dyer. Evaluation of word vector representations by subspace alignment. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2049–2054, 2015

work page 2015
[68]

The geometry of hidden representations of large transformer models

Lucrezia Valeriani, Diego Doimo, Francesca Cuturello, Alessandro Laio, Alessio Ansuini, and Alberto Cazzaniga. The geometry of hidden representations of large transformer models. Advances in Neural Information Processing Systems, 36:51234–51252, 2023

work page 2023
[69]

The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives

Elena V oita, Rico Sennrich, and Ivan Titov. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pa...

work page 2019
[70]

Retrieval of the best counterargument without prior topic knowledge

Henning Wachsmuth, Shahbaz Syed, and Benno Stein. Retrieval of the best counterargument without prior topic knowledge. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), 2018

work page 2018
[71]

Fact or fiction: Verifying scientific claims

David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. Fact or fiction: Verifying scientific claims. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534–7550, 2020

work page 2020
[72]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[73]

Generalized shape metrics on neural representations.Advances in neural information processing systems, 34:4738–4750, 2021

Alex H Williams, Erin Kunz, Simon Kornblith, and Scott Linderman. Generalized shape metrics on neural representations.Advances in neural information processing systems, 34:4738–4750, 2021

work page 2021
[74]

Mind: A large-scale dataset for news recommendation

Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu, et al. Mind: A large-scale dataset for news recommendation. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 3597–3606, 2020

work page 2020
[75]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018. 14

work page 2018
[76]

Arctic-embed 2.0: Multilingual retrieval without compromise.arXiv preprint arXiv:2412.04506, 2024

Puxuan Yu, Luke Merrick, Gaurav Nuti, and Daniel Campos. Arctic-embed 2.0: Multilingual retrieval without compromise.arXiv preprint arXiv:2412.04506, 2024

work page arXiv 2024
[77]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[78]

Beyond the final layer: Intermediate representations for better multilingual calibration in large language models.arXiv preprint arXiv:2510.03136, 2025

Ej Zhou, Caiqi Zhang, Tiancheng Hu, Chengzu Li, Nigel Collier, Ivan Vuli´c, and Anna Korho- nen. Beyond the final layer: Intermediate representations for better multilingual calibration in large language models.arXiv preprint arXiv:2510.03136, 2025. 15 A Formula Details and Conventions A.1 Primitive conventions This subsection records the finite-sample co...

work page arXiv 2025