pith. sign in

arxiv: 2605.23033 · v1 · pith:6WJKGJBCnew · submitted 2026-05-21 · 💻 cs.LG · cs.AI

Uncovering the Latent Potential of Deep Intermediate Representations

Pith reviewed 2026-05-25 05:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords intermediate representationslayer selectiontransfer learningrepresentation geometryfoundation modelsgeometric regularizationdeep networksembedding subspaces
0
0 comments X

The pith

Task-relevant information in deep models is distributed non-monotonically across layers and cannot be recovered by naive aggregation of embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Foundational models produce a hierarchy of embeddings whose semantic content and geometry change with depth. The paper establishes that the information useful for downstream tasks does not increase monotonically toward the final layer and is missed by conventional practices of taking only the last representation or averaging layers. A geometric analysis shows that effective transfer requires locating the specific layers whose embeddings best encode task-discriminative structure under orthogonality and isotropy constraints. The authors supply an explicit selection procedure together with a regularization term that aligns fine-tuning to this structure, yielding accuracy gains that widen with greater model depth.

Core claim

The central claim is that task-relevant information is distributed non-monotonically across layers and cannot be recovered by naïve aggregation. Effective transfer requires identifying which layers encode task-discriminative structure based on their geometric organization. The authors introduce LOES, a spectral method that identifies task-discriminative subspaces by minimizing residual error under orthogonality and isotropy constraints. They also propose GeoReg to enforce simplicial structure on class manifolds during fine-tuning. This yields consistent outperformance across architectures and modalities, with gains increasing as depth grows, while exposing layer-wise semantic distributions.

What carries the argument

Layer-wise Optimal Embedding Selection (LOES), a constructive spectral method that identifies task-discriminative subspaces by minimizing residual error under orthogonality and isotropy constraints.

If this is right

  • LOES outperforms standard baselines across architectures, depths, modalities, and data regimes.
  • Performance gains from the method increase as model depth grows.
  • The selection reveals how semantic factors are distributed across layers, supporting cross-lingual and cross-modal interpretability.
  • Enforcing simplicial class-manifold structure during fine-tuning stabilizes representation geometry.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectures could be modified to expose or preserve selected intermediate outputs rather than routing everything through the final layer.
  • The same geometric selection criterion might be used to decide which layers to prune or distill without retraining the entire network.
  • The non-monotonic pattern may differ systematically between vision and language models, offering a diagnostic for modality-specific layer roles.

Load-bearing premise

Identifying the layers that encode task-discriminative structure by minimizing residual error under orthogonality and isotropy constraints is what makes transfer effective.

What would settle it

An experiment in which LOES-selected intermediate-layer embeddings yield no accuracy improvement over the final layer on a held-out transfer task.

Figures

Figures reproduced from arXiv: 2605.23033 by Aniket Khandelwal, Anubha Gupta, Arnesh Batra, Arush Gumber, Jashn Khemani.

Figure 1
Figure 1. Figure 1: Standard vs. LOES-based transfer learning: (a) conven￾tional transfer learning uses a single encoder layer, typically the final layer, for downstream prediction, whereas (b) LOES selects and fuses multiple task-relevant layers from the encoder hierarchy using target supervision, enabling transfer that exploits comple￾mentary information across layers. 1. Introduction While foundational models (Baevski et a… view at source ↗
Figure 2
Figure 2. Figure 2: GeoReg prevents representation collapse during fine￾tuning. Validation accuracy with trainable BERT Base (Devlin et al., 2019) on TweetEval - Emoji (Barbieri et al., 2020; 2018) Dataset. Without GeoReg (green), accuracy degrades after ∼10k steps despite initial gains. GeoReg (magenta) maintains stable per￾formance. The last-layer baseline (blue) exhibits similar collapse. Dots indicate best checkpoint. sel… view at source ↗
Figure 3
Figure 3. Figure 3: Layer-wise representation geometry for CLIP-B/32 on Stanford Cars. Effective rank (top; higher means more dimen￾sions contribute) and isotropy score (bottom; higher means a flatter covariance eigenspectrum) peak in mid layers. Stars mark LOES￾selected layers, which align with high-rank, near-isotropic repre￾sentations. tions where centroids span low volume, indicating collapse toward a degenerate subspace.… view at source ↗
Figure 4
Figure 4. Figure 4: LOES score distribution across encoder depth (lower is better). Models pretrained exclusively on ImageNet (ViT-IN21k, MAE, DeiT) exhibit monotonically decreasing scores toward final layers, indicating task-discriminative information concentrates at depth. CLIP, pretrained on 400M diverse image-text pairs, shows comparatively flatter profiles with competitive scores in mid-depth layers, consistent with the … view at source ↗
Figure 5
Figure 5. Figure 5: LOES boosts performance and leads to faster convergence across multiple downstream tasks like classification, segmentation and regression using popular foundation models like DINOv2, ModernBERT and CLIP. pendix Table A13), LOES consistently selects early layers alongside the final layer ([0, 9, 11] for DINOv2; [0, 3, 11] for BEiT), with BEiT showing the largest gain (+3.14 mIoU on Cityscapes). We additiona… view at source ↗
Figure 6
Figure 6. Figure 6: Cross-lingual evaluation on Amazon Massive Sce￾nario (mBERT-base). Left: LOES (k=4) outperforms baselines, with larger gains on underrepresented languages (Hindi +6.5%, Arabic +7.2%, Urdu +10.9% over last-3). Right: LOES consis￾tently selects mid-depth layer 6 alongside the final layer across languages, indicating cross-lingually transferable structure at inter￾mediate depths. Cross-lingual results on Amaz… view at source ↗
Figure 7
Figure 7. Figure 7: 2D sensitivity sweep over α and γ on MTOP (ModernBERT-base, K=4). The accuracy surface is a broad plateau: the default (α=1.0, γ=0.5) reaches 95.90%, within 0.25 percentage points of the grid optimum (96.15%), and even the weakest configuration in the grid (94.80%) outperforms the last-layer baseline (81.37%) by more than 13 points. LOES is therefore insensitive to fine hyperparameter tuning within a wide … view at source ↗
Figure 8
Figure 8. Figure 8: t-SNE visualizations comparing standard last-layer representations with LOES-selected layer fusion on CIFAR-100 (top, using DINOv2) and ASVspoof 2019 (bottom, using Wav2Vec 2.0). On CIFAR-100, simple concatenation of the last three layers exhibits moderate class mixing, whereas LOES (k = 3; layers 6, 7, and last) produces tighter and better-separated clusters, demonstrating the advantage of selective layer… view at source ↗
Figure 9
Figure 9. Figure 9: Epoch-wise validation accuracy comparing LOES (k=3) and last-layer baselines across datasets and models. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Normalized eigenspectrum across encoder layers on Stanford Cars. Each row shows the top-50 eigenvalues (log-scale) of the layer-wise covariance matrix; brighter colors indicate higher eigenvalues. Green borders mark LOES-selected layers. CLIP selects mid-depth layers (4–6) with flatter spectra, while DINOv2 selects later layers (7, 10, 11), reflecting their distinct pretraining paradigms. A.6.2. EIGENSPEC… view at source ↗
read the original abstract

Foundational Models pretrained on huge amount of data learn representations that evolve across depth, forming a hierarchy of embeddings with distinct semantic content and geometric structure. Contrary to the widespread practice of using only the final layer or shallow mixtures, we show that task-relevant information is distributed non-monotonically across layers and cannot be recovered by na\"ive aggregation. Through a geometric and empirical study across multiple modalities, we show that effective transfer depends on identifying which layers encode task-discriminative structure and how their embeddings are geometrically organized. We introduce Layer-wise Optimal Embedding Selection (LOES), a constructive spectral method that identifies task-discriminative subspaces by minimizing residual error under orthogonality and isotropy constraints. To align fine-tuning with this selection principle, we further propose Geometric Regularization Loss (GeoReg), which enforces a simplicial structure on class manifolds and stabilizes representation geometry during fine-tuning. Across a wide range of architectures, depths, modalities, and data regimes, LOES consistently outperforms standard baselines, with gains that grow as model depth increases. Beyond accuracy, our method reveals how semantic factors are distributed across layers, thereby enabling cross-lingual and cross-modal interpretability analyses. Together, our results provide strong evidence that layerwise embedding geometry is not incidental but central to how deep models represent and transfer knowledge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims that task-relevant information in pretrained foundational models is distributed non-monotonically across layers and cannot be recovered by naïve aggregation of embeddings. It introduces LOES, a constructive spectral method that selects task-discriminative subspaces by minimizing residual error under orthogonality and isotropy constraints, along with GeoReg, a regularization loss that enforces simplicial structure on class manifolds during fine-tuning. Empirical results across architectures, depths, modalities, and data regimes are said to show LOES outperforming standard baselines (with gains increasing with depth) while also enabling cross-lingual and cross-modal interpretability analyses.

Significance. If the central claims hold with rigorous validation, the work would offer a principled geometric approach to exploiting intermediate representations, challenging the default use of final-layer embeddings in transfer learning and providing tools for both performance gains and interpretability. The constructive, parameter-light character of LOES and the reported depth-scaling behavior would be notable strengths.

minor comments (2)
  1. The abstract asserts consistent outperformance and non-monotonicity but supplies no dataset names, model architectures, validation protocols, error bars, or statistical tests; these details are required to evaluate the empirical claims.
  2. Notation for the spectral method (e.g., the precise residual-error objective and the orthogonality/isotropy constraints) is not defined in the provided text, making it impossible to verify whether LOES is parameter-free or reduces to a known procedure.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of the manuscript and for acknowledging the potential significance of a geometric approach to layer-wise embeddings. No specific major comments were provided in the report, so we have no point-by-point responses to offer at this stage. We remain available to address any additional questions or clarifications the referee may raise.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract and available description introduce LOES as a spectral method that minimizes residual error under orthogonality and isotropy constraints, with empirical outperformance reported across depths and modalities. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations are present in the supplied text. The central claims rest on geometric analysis and experimental results rather than reducing to input definitions or prior author work by construction. This matches the expected non-circular case for a method-proposal paper whose derivation chain is not shown to collapse internally.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5773 in / 1131 out tokens · 23946 ms · 2026-05-25T05:34:38.188064+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

88 extracted references · 37 canonical work pages · 12 internal anchors

  1. [1]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Imagebind: One embedding space to bind them all , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  2. [2]

    Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

  3. [4]

    arXiv preprint arXiv:2406.01506 , year=

    The geometry of categorical and hierarchical concepts in large language models , author=. arXiv preprint arXiv:2406.01506 , year=

  4. [5]

    arXiv preprint arXiv:2510.06477 , year=

    Attention sinks and compression valleys in llms are two sides of the same coin , author=. arXiv preprint arXiv:2510.06477 , year=

  5. [6]

    arXiv preprint arXiv:2509.23024 , year=

    Tracing the representation geometry of language models from pretraining to post-training , author=. arXiv preprint arXiv:2509.23024 , year=

  6. [7]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=

  7. [8]

    arXiv preprint arXiv:2509.00833 , year=

    Segdino: An efficient design for medical and natural image segmentation with dino-v3 , author=. arXiv preprint arXiv:2509.00833 , year=

  8. [9]

    arXiv preprint arXiv:2004.06499 , year=

    What's so special about BERT's layers? A closer look at the NLP pipeline in monolingual and multilingual models , author=. arXiv preprint arXiv:2004.06499 , year=

  9. [10]

    and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke

    Peters, Matthew E. and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke. Deep Contextualized Word Representations. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10...

  10. [11]

    Learnable Layer Selection and Model Fusion for Speech Self-Supervised Learning Models , author=. Proc. Interspeech 2024 , pages=

  11. [12]

    Understanding intermediate layers using linear classifier probes

    Understanding intermediate layers using linear classifier probes , author=. arXiv preprint arXiv:1610.01644 , year=

  12. [13]

    Advances in neural information processing systems , volume=

    Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability , author=. Advances in neural information processing systems , volume=

  13. [14]

    International conference on machine learning , pages=

    Similarity of neural network representations revisited , author=. International conference on machine learning , pages=. 2019 , organization=

  14. [15]

    Advances in neural information processing systems , volume=

    On exact computation with an infinitely wide neural net , author=. Advances in neural information processing systems , volume=

  15. [16]

    Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

    Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , author=. arXiv preprint arXiv:1312.6120 , year=

  16. [17]

    Advances in Neural Information Processing Systems , volume=

    The prevalence of neural collapse in neural multivariate regression , author=. Advances in Neural Information Processing Systems , volume=

  17. [18]

    Technometrics , volume=

    Ridge regression: Biased estimation for nonorthogonal problems , author=. Technometrics , volume=. 1970 , publisher=

  18. [19]

    arXiv preprint arXiv:2601.00276 , year=

    Task-Driven Kernel Flows: Label Rank Compression and Laplacian Spectral Filtering , author=. arXiv preprint arXiv:2601.00276 , year=

  19. [20]

    Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Anisotropy is inherent to self-attention in transformers , author=. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  20. [21]

    The Low-Rank Simplicity Bias in Deep Networks, March 2023 , author=

  21. [22]

    Technometrics , volume=

    Ridge regularization: An essential concept in data science , author=. Technometrics , volume=. 2020 , publisher=

  22. [23]

    SIAM review , volume=

    Updating the inverse of a matrix , author=. SIAM review , volume=. 1989 , publisher=

  23. [24]

    Proceedings of deep learning inside out (DeeLIO): The 2nd workshop on knowledge extraction and integration for deep learning architectures , pages=

    Low anisotropy sense retrofitting (laser): Towards isotropic and sense enriched representations , author=. Proceedings of deep learning inside out (DeeLIO): The 2nd workshop on knowledge extraction and integration for deep learning architectures , pages=

  24. [25]

    International Conference on Analysis of Images, Social Networks and Texts , pages=

    Shrink the longest: improving latent space isotropy with simplicial geometry , author=. International Conference on Analysis of Images, Social Networks and Texts , pages=. 2024 , organization=

  25. [26]

    VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

    Vicreg: Variance-invariance-covariance regularization for self-supervised learning , author=. arXiv preprint arXiv:2105.04906 , year=

  26. [27]

    LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

    Lejepa: Provable and scalable self-supervised learning without the heuristics , author=. arXiv preprint arXiv:2511.08544 , year=

  27. [28]

    and Gardner, Matt and Belinkov, Yonatan and Peters, Matthew E

    Liu, Nelson F. and Gardner, Matt and Belinkov, Yonatan and Peters, Matthew E. and Smith, Noah A. Linguistic Knowledge and Transferability of Contextual Representations. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi...

  28. [29]

    Proceedings of the 57th Conference of the Association for Computational Linguistics,

    Tenney, Ian and Das, Dipanjan and Pavlick, Ellie. BERT Rediscovers the Classical NLP Pipeline. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1452

  29. [30]

    The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives

    Voita, Elena and Sennrich, Rico and Titov, Ivan. The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2...

  30. [31]

    What ' s so special about BERT ' s layers? A closer look at the NLP pipeline in monolingual and multilingual models

    de Vries, Wietse and van Cranenburgh, Andreas and Nissim, Malvina. What ' s so special about BERT ' s layers? A closer look at the NLP pipeline in monolingual and multilingual models. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.389

  31. [32]

    ArXiv , year=

    Efficient Streaming Language Models with Attention Sinks , author=. ArXiv , year=

  32. [33]

    arXiv: Computation and Language , year=

    On Identifiability in Transformers , author=. arXiv: Computation and Language , year=

  33. [34]

    ArXiv , year=

    When Attention Sink Emerges in Language Models: An Empirical View , author=. ArXiv , year=

  34. [35]

    ArXiv , year=

    Why do LLMs attend to the first token? , author=. ArXiv , year=

  35. [36]

    ArXiv , year=

    The Geometry of Categorical and Hierarchical Concepts in Large Language Models , author=. ArXiv , year=

  36. [37]

    ArXiv , year=

    Language Modeling Is Compression , author=. ArXiv , year=

  37. [38]

    Guillotine Regularization: Why removing layers is needed to improve generalization in Self-Supervised Learning , author=. Trans. Mach. Learn. Res. , year=

  38. [39]

    Perception Encoder: The best visual embeddings are not at the output of the network

    Bolya, Daniel and Huang, Po-Yao and Sun, Peize and Cho, Jang Hyun and Madotto, Andrea and Wei, Chen and Ma, Tengyu and Zhi, Jiale and Rajasegaran, Jathushan and Rasheed, Hanoona and Wang, Junke and Monteiro, Marco and Xu, Hu and Dong, Shiyu and Ravi, Nikhila and Li, Daniel and Dollár, Piotr and Feichtenhofer, Christoph , title =. arXiv preprint arXiv:2504...

  39. [40]

    Layer by Layer: Uncovering Hidden Representations in Language Models

    Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, Ravid Shwartz-Ziv , title =. arXiv preprint arXiv:2502.02013 , year =

  40. [41]

    2025 , eprint=

    Exploring Concept Depth: How Large Language Models Acquire Knowledge and Concept at Different Layers? , author=. 2025 , eprint=

  41. [42]

    arXiv preprint arXiv:2406.19384 , year=

    The remarkable robustness of llms: Stages of inference? , author=. arXiv preprint arXiv:2406.19384 , year=

  42. [43]

    arXiv preprint arXiv:2403.02181 , year=

    Not all layers of llms are necessary during inference , author=. arXiv preprint arXiv:2403.02181 , year=

  43. [44]

    European conference on computer vision , pages=

    Colorful image colorization , author=. European conference on computer vision , pages=. 2016 , organization=

  44. [45]

    2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=

    Advancing chart question answering with robust chart component recognition , author=. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=. 2025 , organization=

  45. [46]

    arXiv preprint arXiv:2401.08541 , year=

    Scalable pre-training of large autoregressive image models , author=. arXiv preprint arXiv:2401.08541 , year=

  46. [47]

    International conference on machine learning , pages=

    Generative pretraining from pixels , author=. International conference on machine learning , pages=. 2020 , organization=

  47. [48]

    arXiv preprint arXiv:2501.05453 , year=

    An empirical study of autoregressive pre-training from videos , author=. arXiv preprint arXiv:2501.05453 , year=

  48. [49]

    arXiv preprint arXiv:2502.10927 , year=

    The underlying structures of self-attention: symmetry, directionality, and emergent dynamics in Transformer training , author=. arXiv preprint arXiv:2502.10927 , year=

  49. [50]

    International conference on machine learning , pages=

    Rankme: Assessing the downstream performance of pretrained self-supervised representations by their rank , author=. International conference on machine learning , pages=. 2023 , organization=

  50. [51]

    LevyScore: A Fast Sample-Wise Confidence Score of Pretrained Joint Embedding Model , author=

  51. [52]

    Meta-learning with differentiable closed-form solvers

    Meta-learning with differentiable closed-form solvers , author=. arXiv preprint arXiv:1805.08136 , year=

  52. [53]

    arXiv preprint arXiv:2405.15471 , year=

    Emergence of a high-dimensional abstraction phase in language transformers , author=. arXiv preprint arXiv:2405.15471 , year=

  53. [54]

    Findings of the Association for Computational Linguistics: EACL 2024 , pages=

    The shape of learning: Anisotropy and intrinsic dimensions in transformer-based models , author=. Findings of the Association for Computational Linguistics: EACL 2024 , pages=

  54. [55]

    2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition , volume =

    SUN database: Large-scale scene recognition from abbey to zoo , author =. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition , volume =. doi:10.1109/CVPR.2010.5539970 , keywords =

  55. [56]

    International Journal of Computer Vision , volume = 119, pages =

    SUN Database: Exploring a Large Collection of Scene Categories , author =. International Journal of Computer Vision , volume = 119, pages =

  56. [57]

    Berg and Li Fei-Fei , Title =

    Olga Russakovsky and Jia Deng and Hao Su and Jonathan Krause and Sanjeev Satheesh and Sean Ma and Zhiheng Huang and Andrej Karpathy and Aditya Khosla and Michael Bernstein and Alexander C. Berg and Li Fei-Fei , Title =. 2015 , journal =. doi:10.1007/s11263-015-0816-y , volume=

  57. [58]

    Collecting a large-scale dataset of fine-grained cars , author=

  58. [59]

    2024 , eprint=

    Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference , author=. 2024 , eprint=

  59. [60]

    Decoupled Weight Decay Regularization

    Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

  60. [61]

    DINOv2: Learning Robust Visual Features without Supervision

    Dinov2: Learning robust visual features without supervision , author=. arXiv preprint arXiv:2304.07193 , year=

  61. [62]

    DINOv3

    Dinov3 , author=. arXiv preprint arXiv:2508.10104 , year=

  62. [63]

    International conference on machine learning , pages=

    Training data-efficient image transformers & distillation through attention , author=. International conference on machine learning , pages=. 2021 , organization=

  63. [64]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  64. [65]

    Advances in neural information processing systems , volume=

    wav2vec 2.0: A framework for self-supervised learning of speech representations , author=. Advances in neural information processing systems , volume=

  65. [66]

    2011 , publisher=

    The caltech-ucsd birds-200-2011 dataset , author=. 2011 , publisher=

  66. [67]

    2009 , publisher=

    Learning multiple layers of features from tiny images , author=. 2009 , publisher=

  67. [68]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Describing textures in the wild , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  68. [69]

    Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages=

    Mteb: Massive text embedding benchmark , author=. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages=

  69. [70]

    arXiv preprint arXiv:2502.13595 , year=

    Mmteb: Massive multilingual text embedding benchmark , author=. arXiv preprint arXiv:2502.13595 , year=

  70. [71]

    doi:10.18653/v1/D18-1404 , editor =

    Saravia, Elvis and Liu, Hsien-Chi Toby and Huang, Yen-Hao and Wu, Junlin and Chen, Yi-Shin , booktitle =. doi:10.18653/v1/D18-1404 , editor =

  71. [72]

    arXiv , author =:2204.08582 , primaryclass =

    MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages , year =. arXiv , author =:2204.08582 , primaryclass =

  72. [73]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , doi =

    O. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , doi =

  73. [74]

    doi:10.18653/v1/2021.eacl-main.257 , editor =

    Li, Haoran and Arora, Abhinav and Chen, Shuohui and Gupta, Anchit and Gupta, Sonal and Mehdad, Yashar , booktitle =. doi:10.18653/v1/2021.eacl-main.257 , editor =

  74. [75]

    Efficient Intent Detection with Dual Sentence Encoders , url =

    Casanueva, I. Efficient Intent Detection with Dual Sentence Encoders , url =. Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI , doi =

  75. [76]

    Tweet Sentiment Extraction , url =

    Maggie, Phil Culliton, Wei Chen , publisher =. Tweet Sentiment Extraction , url =

  76. [77]

    Jigsaw Unintended Bias in Toxicity Classification , url =

    cjadams and Daniel Borkan and inversion and Jeffrey Sorensen and Lucas Dixon and Lucy Vasserman and nithum , publisher =. Jigsaw Unintended Bias in Toxicity Classification , url =

  77. [78]

    Computer Speech & Language , volume=

    ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech , author=. Computer Speech & Language , volume=. 2020 , publisher=

  78. [79]

    IEEE transactions on affective computing , volume=

    Crema-d: Crowd-sourced emotional multimodal actors dataset , author=. IEEE transactions on affective computing , volume=. 2014 , publisher=

  79. [80]

    ArXiv e-prints , archivePrefix = "arXiv", eprint =

    Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. ArXiv e-prints , archivePrefix = "arXiv", eprint =

  80. [81]

    Amazon Products Dataset 2023 (1.4M Products) , url=

    Asaniczka , year=. Amazon Products Dataset 2023 (1.4M Products) , url=. doi:10.34740/KAGGLE/DS/3798081 , publisher=

Showing first 80 references.