pith. sign in

arxiv: 2605.09352 · v1 · submitted 2026-05-10 · 💻 cs.AI

The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?

Pith reviewed 2026-05-12 03:34 UTC · model grok-4.3

classification 💻 cs.AI
keywords multimodal convergencedirectional analysisrepresentation learninginformation bottlenecklanguage attractorneural alignmentfeature density
0
0 comments X

The pith

Language's semantic structure serves as the attractor for convergence of representations from other modalities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why neural networks trained separately on images, point clouds, and text develop aligned internal representations. By introducing an asymmetric measure called cycle-kNN, it reveals that non-language modalities consistently shift their neighborhood structures toward those of language models, while the reverse movement is weaker. This directional pattern appears across model families and scales, remaining hidden under standard symmetric similarity checks. The authors trace the asymmetry to language's more compact feature regions and the effects of information compression during optimization. If the pattern holds, language provides the natural endpoint toward which multimodal representations evolve.

Core claim

Directional analysis with cycle-kNN across dozens of independently trained unimodal models shows non-language modalities move toward the neighborhood structure of language significantly more than the reverse. Mechanistic traces link this to feature density asymmetry, where language occupies the most compact regions of space. The Information Bottleneck framework interprets the directionality as the result of compression favoring discrete, compositional forms. This leads to the Wittgensteinian Representation Hypothesis: the semantic structure of language is the asymptotic attractor of multimodal representation convergence.

What carries the argument

cycle-kNN, an asymmetric alignment measure using cycle-consistent nearest neighbors that exposes directional convergence invisible to symmetric metrics.

If this is right

  • Multimodal training will favor language-like discrete and compositional structures under continued optimization.
  • The asymmetry persists uniformly across scales and architectures.
  • Symmetric similarity measures will continue to miss the underlying direction of convergence.
  • Information compression objectives inherently bias representations toward language forms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Anchoring new multimodal systems to pretrained language representations could speed up alignment.
  • The same directional logic may appear in human cross-modal learning when language is involved.
  • Architectures that deliberately increase feature density in non-language streams could reduce the observed pull.

Load-bearing premise

The observed directional asymmetry arises from language's feature density and compression dynamics rather than from training artifacts, data distributions, or properties of the cycle-kNN measure.

What would settle it

A collection of models or a controlled experiment in which the directional preference disappears once representational density is matched or when a different asymmetric measure is applied.

Figures

Figures reproduced from arXiv: 2605.09352 by Chao Tao, Dongyue Wu, Haifeng Li, Jiajie Teng, Jingdong Chen, Run Shao, Zhaoyang Zhang.

Figure 1
Figure 1. Figure 1: Overview of directional convergence analysis. (a) Symmetric measures (e.g., CKA) detect convergence but cannot reveal its direction. (b) CYCLE-KNN is inherently asymmetric; analyzing both directions reveals a consistent directional bias: vision representations approach language more than the reverse (∆ = +0.010, p < 0.05, across all 22 model pairs). (c) Three modalities span an abstraction hierarchy; repre… view at source ↗
Figure 2
Figure 2. Figure 2: Directional asymmetry across modality pairs. Bars show the directional gap ∆ = CYCLE-KNN(A → B) − CYCLE-KNN(B → A) for each cross-modality combination (k = 10). All ∆ > 0, confirming convergence toward the more abstract modality. Gray markers indicate symmetric measures (CKA, mutual kNN), which yield ∆ ≡ 0 by construction on the same model pairs. ∗∗p < 0.01, ∗p < 0.05 (permutation test, n= 1000). We extrac… view at source ↗
Figure 3
Figure 3. Figure 3: Systematic directional asymmetry across all model pairs. (a) CYCLE-KNN (Vision → Language) and (b) CYCLE-KNN (Language → Vision) score matrices for all 22 vision × 29 language model pairs (k= 10, WiT-1024 dataset). Both panels share the same color scale. Panel (a) is systematically brighter than (b), confirming that Language neighborhoods are more coherent when probed from Vision. (c) Element-wise differen… view at source ↗
Figure 4
Figure 4. Figure 4: Intra-modality representational consensus. Top: pairwise CKA heatmaps (shared color scale). Bottom: violin plots confirm ordering Language > Vision > Point Cloud (p < 0.001, Mann–Whitney U). Consistent with language being the convergence attractor [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Scale-invariant directionality. Per-model ∆m vs. parameter count for ten model families. (a) Vision↔Language. (b) PC↔Language. 60/61 combinations (98.4%) have ∆ > 0, confirming scale-invariance. 0.0 0.2 0.4 0.6 0.8 1.0 Normalized Layer Depth 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 P airwise M e a n Dista n c e D dense ! sparse inverted-U low D (compact) Vision (n=4) Language (n=4) Point Cloud (n=3) [PITH_FULL_IMA… view at source ↗
Figure 6
Figure 6. Figure 6: Layer-wise representational density. Pairwise mean distance D (on ℓ2-normalised features) across normalised layer depth for representative models from each modality. Vision models (blue) show monotonically increasing D (dense→sparse); language models (red) follow an inverted-U pattern, reaching maximum compactness at later layers; point cloud models (green) show variable density patterns. Bold lines: per-m… view at source ↗
Figure 7
Figure 7. Figure 7: Synthetic validation: ∆ increases monotonically with density ratio ρ. Each panel shows a different manifold generator (8 types spanning 1D–3D intrinsic dimensionality). X is a compact reference (σbase noise) and Y is dispersed with noise scaled by ρ ∈ [1, 5]. All curves confirm that ∆ = S(Y →X) − S(X→Y ) > 0 once ρ > 1, and increases monotonically, validating that CYCLE-KNN correctly detects asymmetric nei… view at source ↗
Figure 8
Figure 8. Figure 8: Layer-pair CYCLE-KNN heatmaps for representative model pairs from each cross￾modality combination. Each panel shows the CYCLE-KNN score (color) for all layer combinations between a source model (y-axis) and a target model (x-axis). Top row: Language→Vision (Qwen2- 0.5B → ViT-base), Vision→Language (ViT-base → Qwen2-0.5B), and 3D→Language (PointGPT → Qwen2). Bottom row: Language→3D, 3D→Vision, and Vision→3D… view at source ↗
Figure 9
Figure 9. Figure 9: k-Sensitivity analysis of directional asymmetry. (a) The directional gap ∆ = S(A→ B) − S(B →A) remains positive and stable across k ∈ {1, 3, 5, 10, 20, 50} for all three direction pairs. The sign of ∆ never flips, confirming that the observed directionality is not an artifact of the specific neighborhood size. (b) Permutation-test p-values remain below 0.05 for all conditions, indicating statistical signif… view at source ↗
Figure 10
Figure 10. Figure 10 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Layer-wise pairwise mean distance D curves for all models, grouped by modality: Language (29 models, left), Vision (22 models, center), and Point Cloud (7 models, right). Individual model curves are shown in light color; the bold curve indicates the modality mean. Language models exhibit an inverted-U profile (compression in final layers), vision models show monotonic increase, and point cloud models main… view at source ↗
read the original abstract

Understanding why independently trained neural networks from different modalities converge toward shared representations, and where this convergence leads, remains an open question in representation learning. All existing evidence relies on symmetric similarity measures, which can detect convergence but are structurally blind to its direction. We introduce directional convergence analysis using cycle-kNN, an asymmetric alignment measure, applied across dozens of independently trained unimodal models spanning point clouds, vision, and language. We uncover a consistent directional asymmetry: non-language modalities move toward the neighborhood structure of language significantly more than the reverse, and this pattern holds across all model families and scales--yet is entirely invisible to symmetric measures. Mechanistic analysis traces the directionality to feature density asymmetry, whereby language representations occupy the most compact regions of representational space. The Information Bottleneck framework provides a principled interpretation: optimization under compression drives representations toward discrete, compositional structures characteristic of language. We formalize this as the Wittgensteinian Representation Hypothesis: the semantic structure of language is the asymptotic attractor of multimodal representation convergence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces cycle-kNN, an asymmetric alignment measure, to analyze directional convergence in representations from independently trained unimodal models across point clouds, vision, and language. It reports a consistent asymmetry in which non-language modalities align toward language neighborhood structure more than the reverse, invisible to symmetric measures, and attributes this to language's greater feature density and compactness. Using the Information Bottleneck framework, it interprets the pattern as evidence that language semantic structure acts as an asymptotic attractor, formalizing this as the Wittgensteinian Representation Hypothesis.

Significance. If the directional asymmetry is robustly established and causally tied to intrinsic properties of language representations under compression, the work would offer a new lens on multimodal representation learning and a methodological tool (cycle-kNN) for detecting directionality that symmetric metrics miss. The cross-scale and cross-family consistency is a positive empirical observation, though its interpretation requires stronger controls.

major comments (3)
  1. [Mechanistic analysis] The mechanistic analysis tracing directionality to feature density asymmetry does not include controlled ablations that hold architecture, training objective, and data statistics fixed while varying only modality; without these, confounds from model families, optimization trajectories, or data distributions cannot be ruled out as the source of the observed cycle-kNN asymmetry.
  2. [Results on directional convergence] The results reporting consistent directional asymmetry across dozens of models supply no quantitative metrics, error bars, statistical tests, or controls for cycle-kNN sensitivity to embedding norms or tokenization granularity, leaving the strength of the central claim unclear.
  3. [Discussion and hypothesis formalization] The Wittgensteinian Representation Hypothesis is constructed directly from the directional observations and interpreted via the Information Bottleneck without an independent derivation, out-of-sample prediction, or falsifiable test that would distinguish the attractor claim from alternative explanations.
minor comments (2)
  1. [Methods] Clarify the precise definition and implementation details of cycle-kNN (e.g., choice of k, handling of ties) in the methods section to allow reproducibility.
  2. [Introduction] Add explicit references to prior work on the Information Bottleneck in representation learning and on asymmetric similarity measures to better situate the contribution.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and insightful comments, which help clarify the strengths and areas for improvement in our work. We address each major comment point by point below, providing our response and indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: The mechanistic analysis tracing directionality to feature density asymmetry does not include controlled ablations that hold architecture, training objective, and data statistics fixed while varying only modality; without these, confounds from model families, optimization trajectories, or data distributions cannot be ruled out as the source of the observed cycle-kNN asymmetry.

    Authors: We acknowledge the value of fully isolated ablations, but note that such controls are inherently limited by the distinct nature of modalities (e.g., point clouds vs. images vs. text require different data collection, preprocessing, and model architectures). Our empirical design instead leverages diversity: the directional asymmetry is observed consistently across multiple independent model families and scales per modality, which reduces the likelihood of family-specific or trajectory-specific artifacts. We will revise the mechanistic section to explicitly discuss these potential confounds, provide additional details on feature density measurement, and include sensitivity checks where feasible. However, we cannot perform the exact controlled experiments requested without new data collection outside the current scope. revision: partial

  2. Referee: The results reporting consistent directional asymmetry across dozens of models supply no quantitative metrics, error bars, statistical tests, or controls for cycle-kNN sensitivity to embedding norms or tokenization granularity, leaving the strength of the central claim unclear.

    Authors: We agree that adding quantitative rigor will strengthen the claims. In the revised manuscript, we will report average cycle-kNN asymmetry values with standard deviations and error bars across the model sets, include statistical tests (e.g., paired Wilcoxon signed-rank tests) to assess significance of the directional effect, and add controls by analyzing normalized embeddings and varying tokenization granularity for language models. These updates will be incorporated into the results and methods sections. revision: yes

  3. Referee: The Wittgensteinian Representation Hypothesis is constructed directly from the directional observations and interpreted via the Information Bottleneck without an independent derivation, out-of-sample prediction, or falsifiable test that would distinguish the attractor claim from alternative explanations.

    Authors: The hypothesis is motivated by the observed pattern and the Information Bottleneck as an interpretive lens rather than a standalone derivation. To address this, we will add a more formal mathematical statement of the hypothesis, outline specific falsifiable predictions (such as convergence patterns for modalities with controlled feature densities or under varying compression), and discuss how to distinguish the attractor account from alternatives like data-distribution effects. These elements will be added to the discussion section. revision: yes

standing simulated objections not resolved
  • Fully controlled ablations that hold architecture, training objective, and data statistics fixed while varying only modality are not feasible in this study due to fundamental differences in how data and models are constructed for each modality.

Circularity Check

1 steps flagged

Hypothesis formalization restates observed asymmetry without independent derivation

specific steps
  1. renaming known result [Abstract]
    "We uncover a consistent directional asymmetry: non-language modalities move toward the neighborhood structure of language significantly more than the reverse... Mechanistic analysis traces the directionality to feature density asymmetry... The Information Bottleneck framework provides a principled interpretation: optimization under compression drives representations toward discrete, compositional structures characteristic of language. We formalize this as the Wittgensteinian Representation Hypothesis: the semantic structure of language is the asymptotic attractor of multimodal representation."

    The hypothesis is presented as a formalization of the attractor property, but it directly renames and elevates the observed directional convergence (non-language to language) and its feature-density explanation into a named principle. No separate derivation or predictive test is shown; the 'asymptotic attractor' status is equivalent to the empirical pattern by interpretive construction.

full rationale

The paper's central claim is constructed by observing directional asymmetry via cycle-kNN, attributing it to feature density, invoking the Information Bottleneck for interpretation, and then naming the pattern as the Wittgensteinian Representation Hypothesis. This is interpretive organization of empirical results rather than a mathematical reduction or out-of-sample prediction. No equations, fitted parameters called predictions, or self-citation chains are present in the provided text that would force the result by construction. The analysis remains self-contained as an observational study with post-hoc framing, warranting only moderate circularity concern.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that cycle-kNN validly measures directional alignment and that the Information Bottleneck supplies the correct causal mechanism; the hypothesis itself is the primary novel construct without external falsification.

axioms (2)
  • domain assumption Cycle-kNN accurately captures directional neighborhood alignment between representation spaces of different modalities
    Invoked to detect the asymmetry invisible to symmetric measures.
  • domain assumption Optimization under the Information Bottleneck drives representations toward discrete, compositional structures characteristic of language
    Used to interpret why language is the attractor.
invented entities (1)
  • Wittgensteinian Representation Hypothesis no independent evidence
    purpose: Formal name and statement that language semantic structure is the asymptotic attractor
    Newly proposed construct based on the directional findings.

pith-pipeline@v0.9.0 · 5485 in / 1284 out tokens · 77373 ms · 2026-05-12T03:34:29.695265+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 8 internal anchors

  1. [1]

    International Conference on Machine Learning (ICML) , year=

    The Platonic Representation Hypothesis , author=. International Conference on Machine Learning (ICML) , year=

  2. [2]

    arXiv preprint arXiv:2501.15652 , year=

    Indra's Net: the Interplay Between Perception and Reasoning Representations in Multimodal Models , author=. arXiv preprint arXiv:2501.15652 , year=

  3. [3]

    The semantic hub hypothesis: Lan- guage models share semantic representations across languages and modalities.arXiv preprint arXiv:2411.04986, 2024

    The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities , author=. arXiv preprint arXiv:2411.04986 , year=

  4. [4]

    International Conference on Machine Learning (ICML) , year=

    Similarity of Neural Network Representations Revisited , author=. International Conference on Machine Learning (ICML) , year=

  5. [5]

    Frontiers in Systems Neuroscience , volume=

    Representational Similarity Analysis -- Connecting the Branches of Systems Neuroscience , author=. Frontiers in Systems Neuroscience , volume=

  6. [6]

    The information bottleneck method

    The Information Bottleneck Method , author=. arXiv preprint physics/0004057 , year=

  7. [7]

    Neural Computation , volume=

    The Deterministic Information Bottleneck , author=. Neural Computation , volume=

  8. [8]

    International Conference on Learning Representations (ICLR) , year=

    Similarity of Neural Network Models: A Survey of Functional and Representational Measures , author=. International Conference on Learning Representations (ICLR) , year=

  9. [9]

    PNAS Nexus , volume=

    Ranking the Information Content of Distance Measures , author=. PNAS Nexus , volume=

  10. [10]

    arXiv preprint arXiv:2505.17101 , year=

    Connecting the Dots: Representation Convergence in Unimodal Models , author=. arXiv preprint arXiv:2505.17101 , year=

  11. [11]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Neural Thermodynamics: Entropic Forces in Deep and Universal Representation Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  12. [12]

    1921 , publisher=

    Tractatus Logico-Philosophicus , author=. 1921 , publisher=

  13. [13]

    International Conference on Machine Learning (ICML) , year=

    Learning Transferable Visual Models From Natural Language Supervision , author=. International Conference on Machine Learning (ICML) , year=

  14. [14]

    Transactions on Machine Learning Research , year=

    DINOv2: Learning Robust Visual Features without Supervision , author=. Transactions on Machine Learning Research , year=

  15. [15]

    LLaMA: Open and Efficient Foundation Language Models

    LLaMA: Open and Efficient Foundation Language Models , author=. arXiv preprint arXiv:2302.13971 , year=

  16. [16]

    Crosslingual generalization through multitask finetuning

    Crosslingual Generalization through Multitask Finetuning , author=. arXiv preprint arXiv:2211.01786 , year=

  17. [17]

    Qwen2.5 Technical Report

    Qwen2.5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=

  18. [18]

    International Conference on Learning Representations (ICLR) , year=

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations (ICLR) , year=

  19. [19]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Masked Autoencoders Are Scalable Vision Learners , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  20. [20]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    PointGPT: Auto-regressively Generative Pre-training from Point Clouds , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  21. [21]

    European Conference on Computer Vision (ECCV) , year=

    Masked Autoencoders for Point Cloud Self-supervised Learning , author=. European Conference on Computer Vision (ECCV) , year=

  22. [22]

    Opening the Black Box of Deep Neural Networks via Information

    Opening the Black Box of Deep Neural Networks via Information , author=. arXiv preprint arXiv:1703.00810 , year=

  23. [23]

    Journal of Statistical Mechanics: Theory and Experiment , year=

    On the Information Bottleneck Theory of Deep Learning , author=. Journal of Statistical Mechanics: Theory and Experiment , year=

  24. [24]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    The Indra Representation Hypothesis for Multimodal Alignment , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  25. [25]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    The Universal Normal Embedding , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  26. [26]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Closeness in Distribution Does Not Imply Representation Similarity , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  27. [27]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Blind Match: Visual-Language Correspondence Without Parallel Data , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  28. [28]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    STRUCTURE: Aligning Representations with Limited Paired Data , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  29. [29]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Harnessing Frozen Unimodal Encoders for Multimodal Alignment , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  30. [30]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Learning Shared Representations from Unpaired Multimodal Data , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  31. [31]

    International Conference on Learning Representations (ICLR) , year=

    Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models , author=. International Conference on Learning Representations (ICLR) , year=

  32. [32]

    International Conference on Machine Learning (ICML) , year=

    Aligning Multimodal Representations via Information Bottleneck , author=. International Conference on Machine Learning (ICML) , year=

  33. [33]

    International Conference on Machine Learning (ICML) , year=

    Understanding the Emergence of Multimodal Representation Alignment , author=. International Conference on Machine Learning (ICML) , year=

  34. [34]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    ConFu: Higher-Order Contrastive Fusion for Multimodal Alignment , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  35. [35]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    An Omnivorous Vision Encoder , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  36. [36]

    International Conference on Machine Learning (ICML) , year=

    Functional Alignment Can Mislead , author=. International Conference on Machine Learning (ICML) , year=

  37. [37]

    International Conference on Machine Learning (ICML) , year=

    Universal Statistical Structure of Natural Datasets , author=. International Conference on Machine Learning (ICML) , year=

  38. [38]

    International Conference on Learning Representations (ICLR) , year=

    Representational Alignment Between Supervised and Self-Supervised Contrastive Learning , author=. International Conference on Learning Representations (ICLR) , year=

  39. [39]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  40. [40]

    European Conference on Computer Vision (ECCV) , year=

    ShapeLLM: Universal 3D Object Understanding for Embodied Interaction , author=. European Conference on Computer Vision (ECCV) , year=

  41. [41]

    DINOv3

    DINOv3 , author=. arXiv preprint arXiv:2508.10104 , year=

  42. [42]

    arXiv preprint , year=

    Qwen3 Technical Report , author=. arXiv preprint , year=

  43. [43]

    InternLM2 Technical Report

    InternLM2 Technical Report , author=. arXiv preprint arXiv:2403.17297 , year=

  44. [44]

    Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , year=

    WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning , author=. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , year=

  45. [45]

    ShapeNet: An Information-Rich 3D Model Repository

    ShapeNet: An Information-Rich 3D Model Repository , author=. arXiv preprint arXiv:1512.03012 , year=

  46. [46]

    International Conference on Learning Representations (ICLR) , year=

    Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth , author=. International Conference on Learning Representations (ICLR) , year=

  47. [47]

    Scaling Laws for Neural Language Models

    Scaling Laws for Neural Language Models , author=. arXiv preprint arXiv:2001.08361 , year=

  48. [48]

    International Conference on Machine Learning (ICML) , year=

    Training Objective Drives Representation Similarity Consistency Across Datasets , author=. International Conference on Machine Learning (ICML) , year=

  49. [49]

    International Conference on Machine Learning (ICML) , year=

    The Butterfly Effect in Model Training , author=. International Conference on Machine Learning (ICML) , year=

  50. [50]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Multi-modal Contrastive Learning: Intrinsic Dimension and Temperature Selection , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  51. [51]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Scaling Language-centric Omnimodal Representation Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  52. [52]

    International Conference on Learning Representations (ICLR) , year=

    Towards a Learning Theory of Representation Alignment , author=. International Conference on Learning Representations (ICLR) , year=

  53. [53]

    IEEE Information Theory Workshop (ITW) , year=

    Deep Learning and the Information Bottleneck Principle , author=. IEEE Information Theory Workshop (ITW) , year=