pith. sign in

arxiv: 2605.16608 · v2 · pith:XRQACSIDnew · submitted 2026-05-15 · 💻 cs.LG · cs.CL

To MRL or not to MRL: Text Embeddings are Robust to Truncation Without Matryoshka Learning, Except In Heavy Truncation Scenarios

Pith reviewed 2026-06-30 19:01 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords text embeddingsMatryoshka Representation Learningtruncationembedding reductiondownstream tasksmodel comparison
0
0 comments X

The pith

Text embeddings from standard models remain competitive after truncation unless size drops by 80 percent or more, often matching or beating Matryoshka-trained versions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Matryoshka Representation Learning trains text encoders so their vectors stay useful when simply cut to smaller sizes. The paper applies the same truncation schedule to encoders trained both with and without MRL and measures performance on downstream tasks. Results show that non-MRL embeddings perform as well as or better than MRL ones at all but the most extreme reductions. Only when vectors are shortened by 80 percent or greater does the MRL approach pull ahead. The extra training cost of MRL therefore appears worthwhile mainly when heavy truncation is the goal.

Core claim

Truncated embeddings of models trained without Matryoshka Representation Learning are competitive with, and often outperform, models trained with MRL unless the embeddings are reduced in size by at least 80 percent.

What carries the argument

Side-by-side application of MRL-style truncation to both MRL-trained and standard text encoders, measuring downstream task performance.

Load-bearing premise

The non-MRL and MRL models are trained under sufficiently comparable conditions that performance differences can be attributed to the training method rather than other factors.

What would settle it

A controlled experiment in which MRL models consistently outperform non-MRL models at truncation levels below 80 percent reduction across the same architectures and tasks.

Figures

Figures reproduced from arXiv: 2605.16608 by Daniel Ruffinelli, Simone Paolo Ponzetto, Sotaro Takeshita, Yurina Takeshita.

Figure 1
Figure 1. Figure 1: (Top) Robustness of open text encoders as truncation levels increase looks the same whether trained with or without MRL. (Bottom) When models differ only in their use of MRL, truncation on non-MRL models is superior unless heavy truncation is applied. more flexibility in this regard, Matryoshka Rep￾resentation Learning (MRL) (Kusupati et al., 2022) is an approach that adds additional terms to the training … view at source ↗
Figure 2
Figure 2. Figure 2: Performance on NanoBEIR (top) and MTEB (bottom) of text embeddings truncated at various sizes, relative to the performance of the corresponding full￾size embeddings. et al. (2025), as other aspects typically differen￾tiate new models from prior work, e.g. training recipe (Neelakantan et al., 2022; Sturua et al., 2024). This makes a proper comparison prohib￾itely expensive. However, we do conduct a more con… view at source ↗
Figure 4
Figure 4. Figure 4: Standard deviation across embedding dimen [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Validation loss curve for contrastive learning with and without MRL for all model pairs. Our training [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Absolute performance on NanoBEIR (top) and MTEB (bottom) of text embeddings by smaller models [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Absolute performance on NanoBEIR (top) and MTEB (bottom) of text embeddings by larger models [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance on BEIR and MTEB benchmarks of five pairs of encoders trained with and without MRL. [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Standard deviations of values taken by each dimension when encoding different texts. We observe that [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Performance of smaller open text encoders in NanoBEIR datasets. [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 10
Figure 10. Figure 10: Standard deviations of values taken by each dimension when encoding different texts. We observe that [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Performance of larger open text encoders in NanoBEIR datasets. [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Performance of smaller open text encoders in MTEB datasets. [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 12
Figure 12. Figure 12: Performance of larger open text encoders in NanoBEIR datasets. [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Performance of larger open text encoders in MTEB datasets. [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: BERT base performance on each of the BEIR datasets. [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 14
Figure 14. Figure 14: Performance of larger open text encoders in MTEB datasets. [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: BERT large performance on each of the NanoBEIR datasets. [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: RoBERTa base performance on each of the BEIR datasets. [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: RoBERTa large performance on each of the NanoBEIR datasets. [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗
Figure 17
Figure 17. Figure 17: RoBERTa base performance on each of the BEIR datasets. [PITH_FULL_IMAGE:figures/full_fig_p015_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: T5 base performance on each of the NanoBEIR datasets. [PITH_FULL_IMAGE:figures/full_fig_p016_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: BERT base performance on each of the MTEB datasets. [PITH_FULL_IMAGE:figures/full_fig_p016_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: BERT large performance on each of the MTEB datasets. [PITH_FULL_IMAGE:figures/full_fig_p017_20.png] view at source ↗
Figure 20
Figure 20. Figure 20: BERT base performance on each of the MTEB datasets. [PITH_FULL_IMAGE:figures/full_fig_p016_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: RoBERTa base performance on each of the MTEB datasets. [PITH_FULL_IMAGE:figures/full_fig_p017_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: RoBERTa large performance on each of the MTEB datasets. [PITH_FULL_IMAGE:figures/full_fig_p018_22.png] view at source ↗
Figure 22
Figure 22. Figure 22: RoBERTa base performance on each of the MTEB datasets. [PITH_FULL_IMAGE:figures/full_fig_p017_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: T5 base performance on each of the MTEB datasets. [PITH_FULL_IMAGE:figures/full_fig_p018_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: T5 base performance on each of the MTEB datasets. [PITH_FULL_IMAGE:figures/full_fig_p018_24.png] view at source ↗
read the original abstract

Matryoshka Representation Learning (MRL) is a widely adopted approach for training text encoders so they provide useful text representations at various sizes, available by simply truncating the resulting vectors at sizes pre-determined at training time. Recent works have shown that randomly truncating text embeddings has minimal impact in downstream performance unless vectors are reduced in size by at least 70%, suggesting that embeddings are already robust to truncation without the use of MRL. However, no prior work has compared random truncation to MRL, so it is unclear how the two methods compare as effective embedding reduction methods. In this paper, we study this by applying the same truncation used by MRL to models trained with and without MRL. Our results across several models and downstream tasks show that, unless heavily truncating embeddings (i.e. reducing their size by at least 80%), truncated embeddings of non-MRL models are competitive with, and often outperform models trained with MRL. This suggests that truncation robustness may not necessarily come from MRL, and that the choice of spending the additional training cost of MRL depends on whether heavy truncation is desired. We make our code available for reproduction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that text embeddings from non-MRL models remain competitive with (and often outperform) those from MRL-trained models under random truncation, except when truncation is heavy (at least 80% size reduction). This is based on experiments applying MRL-style truncation to several models across downstream tasks, leading to the conclusion that MRL's extra training cost is justified only for heavy-truncation use cases. Code is released for reproduction.

Significance. If the central empirical comparison holds under matched conditions, the result would indicate that truncation robustness is largely inherent to standard embeddings rather than requiring the MRL objective for moderate reductions, with potential implications for training efficiency. The release of reproduction code strengthens verifiability.

major comments (1)
  1. [Abstract] Abstract: the central claim that performance differences can be attributed to the presence/absence of MRL requires that non-MRL and MRL models differ only in the training objective. No details are provided on whether base architectures, pre-training corpora, fine-tuning data, batch sizes, learning rates, or training steps were matched; without this, the truncation-robustness comparison cannot isolate the effect of MRL.
minor comments (1)
  1. The abstract would be clearer if it listed the specific model families, downstream tasks, and exact truncation ratios (e.g., 50%, 70%, 80%) used in the comparison.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and constructive feedback. We address the major comment below and will incorporate revisions as noted.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that performance differences can be attributed to the presence/absence of MRL requires that non-MRL and MRL models differ only in the training objective. No details are provided on whether base architectures, pre-training corpora, fine-tuning data, batch sizes, learning rates, or training steps were matched; without this, the truncation-robustness comparison cannot isolate the effect of MRL.

    Authors: We agree that a fully controlled experiment isolating only the MRL objective would require retraining matched models from scratch under identical conditions, which is outside the scope of this work. Our study instead evaluates publicly available models (both MRL-trained and standard non-MRL encoders) under identical truncation and evaluation protocols. While this prevents strict isolation of MRL's contribution from other training differences, the results still show that non-MRL models achieve competitive or superior performance under moderate truncation. We will revise the abstract, introduction, and add a limitations paragraph to clarify that we compare existing models rather than claiming a pure causal effect of the MRL objective alone, and to note the lack of matched training details as a caveat. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivations or self-referential reductions

full rationale

This paper performs an empirical study comparing truncation robustness of text embeddings from models trained with versus without Matryoshka Representation Learning (MRL). The abstract and full text describe training several models, applying truncation, and reporting downstream task performance; no equations, derivations, fitted parameters renamed as predictions, or self-citation load-bearing uniqueness theorems appear. The central claim rests on experimental results rather than any chain that reduces to its own inputs by construction. The comparability of training conditions is an experimental design assumption (addressable via replication), not a circularity issue under the defined patterns. No steps qualify for the enumerated kinds of circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical comparisons across models and tasks; the abstract introduces no free parameters, new axioms beyond standard embedding evaluation practices, or invented entities.

axioms (1)
  • domain assumption Downstream task performance is a valid proxy for embedding quality
    Standard assumption in text embedding research.

pith-pipeline@v0.9.1-grok · 5765 in / 1048 out tokens · 46895 ms · 2026-06-30T19:01:12.840128+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MM-Matryoshka: Towards Budget-Elastic Visual Document Retrieval via a 2D Multimodal Matryoshka Training Framework

    cs.CV 2026-06 unverdicted novelty 6.0

    MM-Matryoshka is a 2D Matryoshka training framework enabling budget-elastic ColPali-style multi-vector visual document retrieval along dimension and layer without separate models per budget.

Reference graph

Works this paper leans on

6 extracted references · 5 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Efficient Natural Language Response Suggestion for Smart Reply

    Scaling diffusion language models via adapta- tion from autoregressive models. InThe Thirteenth International Conference on Learning Representa- tions. Faegheh Hasibi, Fedor Nikolaev, Chenyan Xiong, Krisz- tian Balog, Svein Erik Bratsberg, Alexander Kotov, and Jamie Callan. 2017. Dbpedia-entity v2: A test collection for entity search. InProceedings of the...

  2. [2]

    Phillip Keung, Yichao Lu, György Szarvas, and Noah A

    Extensions of lipschitz mappings into a hilbert space.Contemporary mathematics, page 1. Phillip Keung, Yichao Lu, György Szarvas, and Noah A. Smith. 2020. The multilingual Amazon reviews cor- pus. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4563–4568, Online. Association for Computational Linguis...

  3. [3]

    Text and Code Embeddings by Contrastive Pre-Training

    Learning word vectors for sentiment analysis. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Lin- guistics. Maggie, Phil Culliton, and Wei Chen. 2020. Tweet sentiment extraction. https://kaggle.com/ competitions/twee...

  4. [4]

    jina-embeddings-v3: Multilingual em- beddings with task lora

    Ms marco: A human generated machine read- ing comprehension dataset. Jianmo Ni, Gustavo Hernandez Abrego, Noah Con- stant, Ji Ma, Keith Hall, Daniel Cer, and Yinfei Yang. 2022a. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. InFindings of the Association for Computational Linguistics: ACL 2022, pages 1864–1874, Dublin, Irela...

  5. [5]

    Multilingual E5 Text Embeddings: A Technical Report

    Retrieval of the best counterargument without prior topic knowledge. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 241–251, Melbourne, Australia. Association for Computational Linguistics. David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Ha...

  6. [6]

    A broad-coverage challenge corpus for sen- tence understanding through inference. InProceed- ings of the 2018 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguis- tics. Zhilin Yang, Pe...