The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?

Chao Tao; Dongyue Wu; Haifeng Li; Jiajie Teng; Jingdong Chen; Run Shao; Zhaoyang Zhang

arxiv: 2605.09352 · v1 · submitted 2026-05-10 · 💻 cs.AI

The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?

Zhaoyang Zhang , Run Shao , Dongyue Wu , Jiajie Teng , Chao Tao , Jingdong Chen , Haifeng Li This is my paper

Pith reviewed 2026-05-12 03:34 UTC · model grok-4.3

classification 💻 cs.AI

keywords multimodal convergencedirectional analysisrepresentation learninginformation bottlenecklanguage attractorneural alignmentfeature density

0 comments

The pith

Language's semantic structure serves as the attractor for convergence of representations from other modalities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why neural networks trained separately on images, point clouds, and text develop aligned internal representations. By introducing an asymmetric measure called cycle-kNN, it reveals that non-language modalities consistently shift their neighborhood structures toward those of language models, while the reverse movement is weaker. This directional pattern appears across model families and scales, remaining hidden under standard symmetric similarity checks. The authors trace the asymmetry to language's more compact feature regions and the effects of information compression during optimization. If the pattern holds, language provides the natural endpoint toward which multimodal representations evolve.

Core claim

Directional analysis with cycle-kNN across dozens of independently trained unimodal models shows non-language modalities move toward the neighborhood structure of language significantly more than the reverse. Mechanistic traces link this to feature density asymmetry, where language occupies the most compact regions of space. The Information Bottleneck framework interprets the directionality as the result of compression favoring discrete, compositional forms. This leads to the Wittgensteinian Representation Hypothesis: the semantic structure of language is the asymptotic attractor of multimodal representation convergence.

What carries the argument

cycle-kNN, an asymmetric alignment measure using cycle-consistent nearest neighbors that exposes directional convergence invisible to symmetric metrics.

If this is right

Multimodal training will favor language-like discrete and compositional structures under continued optimization.
The asymmetry persists uniformly across scales and architectures.
Symmetric similarity measures will continue to miss the underlying direction of convergence.
Information compression objectives inherently bias representations toward language forms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Anchoring new multimodal systems to pretrained language representations could speed up alignment.
The same directional logic may appear in human cross-modal learning when language is involved.
Architectures that deliberately increase feature density in non-language streams could reduce the observed pull.

Load-bearing premise

The observed directional asymmetry arises from language's feature density and compression dynamics rather than from training artifacts, data distributions, or properties of the cycle-kNN measure.

What would settle it

A collection of models or a controlled experiment in which the directional preference disappears once representational density is matched or when a different asymmetric measure is applied.

Figures

Figures reproduced from arXiv: 2605.09352 by Chao Tao, Dongyue Wu, Haifeng Li, Jiajie Teng, Jingdong Chen, Run Shao, Zhaoyang Zhang.

**Figure 1.** Figure 1: Overview of directional convergence analysis. (a) Symmetric measures (e.g., CKA) detect convergence but cannot reveal its direction. (b) CYCLE-KNN is inherently asymmetric; analyzing both directions reveals a consistent directional bias: vision representations approach language more than the reverse (∆ = +0.010, p < 0.05, across all 22 model pairs). (c) Three modalities span an abstraction hierarchy; repre… view at source ↗

**Figure 2.** Figure 2: Directional asymmetry across modality pairs. Bars show the directional gap ∆ = CYCLE-KNN(A → B) − CYCLE-KNN(B → A) for each cross-modality combination (k = 10). All ∆ > 0, confirming convergence toward the more abstract modality. Gray markers indicate symmetric measures (CKA, mutual kNN), which yield ∆ ≡ 0 by construction on the same model pairs. ∗∗p < 0.01, ∗p < 0.05 (permutation test, n= 1000). We extrac… view at source ↗

**Figure 3.** Figure 3: Systematic directional asymmetry across all model pairs. (a) CYCLE-KNN (Vision → Language) and (b) CYCLE-KNN (Language → Vision) score matrices for all 22 vision × 29 language model pairs (k= 10, WiT-1024 dataset). Both panels share the same color scale. Panel (a) is systematically brighter than (b), confirming that Language neighborhoods are more coherent when probed from Vision. (c) Element-wise differen… view at source ↗

**Figure 4.** Figure 4: Intra-modality representational consensus. Top: pairwise CKA heatmaps (shared color scale). Bottom: violin plots confirm ordering Language > Vision > Point Cloud (p < 0.001, Mann–Whitney U). Consistent with language being the convergence attractor [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Scale-invariant directionality. Per-model ∆m vs. parameter count for ten model families. (a) Vision↔Language. (b) PC↔Language. 60/61 combinations (98.4%) have ∆ > 0, confirming scale-invariance. 0.0 0.2 0.4 0.6 0.8 1.0 Normalized Layer Depth 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 P airwise M e a n Dista n c e D dense ! sparse inverted-U low D (compact) Vision (n=4) Language (n=4) Point Cloud (n=3) [PITH_FULL_IMA… view at source ↗

**Figure 6.** Figure 6: Layer-wise representational density. Pairwise mean distance D (on ℓ2-normalised features) across normalised layer depth for representative models from each modality. Vision models (blue) show monotonically increasing D (dense→sparse); language models (red) follow an inverted-U pattern, reaching maximum compactness at later layers; point cloud models (green) show variable density patterns. Bold lines: per-m… view at source ↗

**Figure 7.** Figure 7: Synthetic validation: ∆ increases monotonically with density ratio ρ. Each panel shows a different manifold generator (8 types spanning 1D–3D intrinsic dimensionality). X is a compact reference (σbase noise) and Y is dispersed with noise scaled by ρ ∈ [1, 5]. All curves confirm that ∆ = S(Y →X) − S(X→Y ) > 0 once ρ > 1, and increases monotonically, validating that CYCLE-KNN correctly detects asymmetric nei… view at source ↗

**Figure 8.** Figure 8: Layer-pair CYCLE-KNN heatmaps for representative model pairs from each crossmodality combination. Each panel shows the CYCLE-KNN score (color) for all layer combinations between a source model (y-axis) and a target model (x-axis). Top row: Language→Vision (Qwen2- 0.5B → ViT-base), Vision→Language (ViT-base → Qwen2-0.5B), and 3D→Language (PointGPT → Qwen2). Bottom row: Language→3D, 3D→Vision, and Vision→3D… view at source ↗

**Figure 9.** Figure 9: k-Sensitivity analysis of directional asymmetry. (a) The directional gap ∆ = S(A→ B) − S(B →A) remains positive and stable across k ∈ {1, 3, 5, 10, 20, 50} for all three direction pairs. The sign of ∆ never flips, confirming that the observed directionality is not an artifact of the specific neighborhood size. (b) Permutation-test p-values remain below 0.05 for all conditions, indicating statistical signif… view at source ↗

**Figure 10.** Figure 10 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Layer-wise pairwise mean distance D curves for all models, grouped by modality: Language (29 models, left), Vision (22 models, center), and Point Cloud (7 models, right). Individual model curves are shown in light color; the bold curve indicates the modality mean. Language models exhibit an inverted-U profile (compression in final layers), vision models show monotonic increase, and point cloud models main… view at source ↗

read the original abstract

Understanding why independently trained neural networks from different modalities converge toward shared representations, and where this convergence leads, remains an open question in representation learning. All existing evidence relies on symmetric similarity measures, which can detect convergence but are structurally blind to its direction. We introduce directional convergence analysis using cycle-kNN, an asymmetric alignment measure, applied across dozens of independently trained unimodal models spanning point clouds, vision, and language. We uncover a consistent directional asymmetry: non-language modalities move toward the neighborhood structure of language significantly more than the reverse, and this pattern holds across all model families and scales--yet is entirely invisible to symmetric measures. Mechanistic analysis traces the directionality to feature density asymmetry, whereby language representations occupy the most compact regions of representational space. The Information Bottleneck framework provides a principled interpretation: optimization under compression drives representations toward discrete, compositional structures characteristic of language. We formalize this as the Wittgensteinian Representation Hypothesis: the semantic structure of language is the asymptotic attractor of multimodal representation convergence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces cycle-kNN, an asymmetric alignment measure, to analyze directional convergence in representations from independently trained unimodal models across point clouds, vision, and language. It reports a consistent asymmetry in which non-language modalities align toward language neighborhood structure more than the reverse, invisible to symmetric measures, and attributes this to language's greater feature density and compactness. Using the Information Bottleneck framework, it interprets the pattern as evidence that language semantic structure acts as an asymptotic attractor, formalizing this as the Wittgensteinian Representation Hypothesis.

Significance. If the directional asymmetry is robustly established and causally tied to intrinsic properties of language representations under compression, the work would offer a new lens on multimodal representation learning and a methodological tool (cycle-kNN) for detecting directionality that symmetric metrics miss. The cross-scale and cross-family consistency is a positive empirical observation, though its interpretation requires stronger controls.

major comments (3)

[Mechanistic analysis] The mechanistic analysis tracing directionality to feature density asymmetry does not include controlled ablations that hold architecture, training objective, and data statistics fixed while varying only modality; without these, confounds from model families, optimization trajectories, or data distributions cannot be ruled out as the source of the observed cycle-kNN asymmetry.
[Results on directional convergence] The results reporting consistent directional asymmetry across dozens of models supply no quantitative metrics, error bars, statistical tests, or controls for cycle-kNN sensitivity to embedding norms or tokenization granularity, leaving the strength of the central claim unclear.
[Discussion and hypothesis formalization] The Wittgensteinian Representation Hypothesis is constructed directly from the directional observations and interpreted via the Information Bottleneck without an independent derivation, out-of-sample prediction, or falsifiable test that would distinguish the attractor claim from alternative explanations.

minor comments (2)

[Methods] Clarify the precise definition and implementation details of cycle-kNN (e.g., choice of k, handling of ties) in the methods section to allow reproducibility.
[Introduction] Add explicit references to prior work on the Information Bottleneck in representation learning and on asymmetric similarity measures to better situate the contribution.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and insightful comments, which help clarify the strengths and areas for improvement in our work. We address each major comment point by point below, providing our response and indicating planned revisions where appropriate.

read point-by-point responses

Referee: The mechanistic analysis tracing directionality to feature density asymmetry does not include controlled ablations that hold architecture, training objective, and data statistics fixed while varying only modality; without these, confounds from model families, optimization trajectories, or data distributions cannot be ruled out as the source of the observed cycle-kNN asymmetry.

Authors: We acknowledge the value of fully isolated ablations, but note that such controls are inherently limited by the distinct nature of modalities (e.g., point clouds vs. images vs. text require different data collection, preprocessing, and model architectures). Our empirical design instead leverages diversity: the directional asymmetry is observed consistently across multiple independent model families and scales per modality, which reduces the likelihood of family-specific or trajectory-specific artifacts. We will revise the mechanistic section to explicitly discuss these potential confounds, provide additional details on feature density measurement, and include sensitivity checks where feasible. However, we cannot perform the exact controlled experiments requested without new data collection outside the current scope. revision: partial
Referee: The results reporting consistent directional asymmetry across dozens of models supply no quantitative metrics, error bars, statistical tests, or controls for cycle-kNN sensitivity to embedding norms or tokenization granularity, leaving the strength of the central claim unclear.

Authors: We agree that adding quantitative rigor will strengthen the claims. In the revised manuscript, we will report average cycle-kNN asymmetry values with standard deviations and error bars across the model sets, include statistical tests (e.g., paired Wilcoxon signed-rank tests) to assess significance of the directional effect, and add controls by analyzing normalized embeddings and varying tokenization granularity for language models. These updates will be incorporated into the results and methods sections. revision: yes
Referee: The Wittgensteinian Representation Hypothesis is constructed directly from the directional observations and interpreted via the Information Bottleneck without an independent derivation, out-of-sample prediction, or falsifiable test that would distinguish the attractor claim from alternative explanations.

Authors: The hypothesis is motivated by the observed pattern and the Information Bottleneck as an interpretive lens rather than a standalone derivation. To address this, we will add a more formal mathematical statement of the hypothesis, outline specific falsifiable predictions (such as convergence patterns for modalities with controlled feature densities or under varying compression), and discuss how to distinguish the attractor account from alternatives like data-distribution effects. These elements will be added to the discussion section. revision: yes

standing simulated objections not resolved

Fully controlled ablations that hold architecture, training objective, and data statistics fixed while varying only modality are not feasible in this study due to fundamental differences in how data and models are constructed for each modality.

Circularity Check

1 steps flagged

Hypothesis formalization restates observed asymmetry without independent derivation

specific steps

renaming known result [Abstract]
"We uncover a consistent directional asymmetry: non-language modalities move toward the neighborhood structure of language significantly more than the reverse... Mechanistic analysis traces the directionality to feature density asymmetry... The Information Bottleneck framework provides a principled interpretation: optimization under compression drives representations toward discrete, compositional structures characteristic of language. We formalize this as the Wittgensteinian Representation Hypothesis: the semantic structure of language is the asymptotic attractor of multimodal representation."

The hypothesis is presented as a formalization of the attractor property, but it directly renames and elevates the observed directional convergence (non-language to language) and its feature-density explanation into a named principle. No separate derivation or predictive test is shown; the 'asymptotic attractor' status is equivalent to the empirical pattern by interpretive construction.

full rationale

The paper's central claim is constructed by observing directional asymmetry via cycle-kNN, attributing it to feature density, invoking the Information Bottleneck for interpretation, and then naming the pattern as the Wittgensteinian Representation Hypothesis. This is interpretive organization of empirical results rather than a mathematical reduction or out-of-sample prediction. No equations, fitted parameters called predictions, or self-citation chains are present in the provided text that would force the result by construction. The analysis remains self-contained as an observational study with post-hoc framing, warranting only moderate circularity concern.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that cycle-kNN validly measures directional alignment and that the Information Bottleneck supplies the correct causal mechanism; the hypothesis itself is the primary novel construct without external falsification.

axioms (2)

domain assumption Cycle-kNN accurately captures directional neighborhood alignment between representation spaces of different modalities
Invoked to detect the asymmetry invisible to symmetric measures.
domain assumption Optimization under the Information Bottleneck drives representations toward discrete, compositional structures characteristic of language
Used to interpret why language is the attractor.

invented entities (1)

Wittgensteinian Representation Hypothesis no independent evidence
purpose: Formal name and statement that language semantic structure is the asymptotic attractor
Newly proposed construct based on the directional findings.

pith-pipeline@v0.9.0 · 5485 in / 1284 out tokens · 77373 ms · 2026-05-12T03:34:29.695265+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

The Information Bottleneck framework provides a principled interpretation: optimization under compression drives representations toward discrete, compositional structures characteristic of language.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

language representations occupy the most compact regions of representational space
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

WRH: the semantic structure of language is the asymptotic attractor of multimodal representation convergence

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 8 internal anchors

[1]

International Conference on Machine Learning (ICML) , year=

The Platonic Representation Hypothesis , author=. International Conference on Machine Learning (ICML) , year=

work page
[2]

arXiv preprint arXiv:2501.15652 , year=

Indra's Net: the Interplay Between Perception and Reasoning Representations in Multimodal Models , author=. arXiv preprint arXiv:2501.15652 , year=

work page arXiv
[3]

The semantic hub hypothesis: Lan- guage models share semantic representations across languages and modalities.arXiv preprint arXiv:2411.04986, 2024

The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities , author=. arXiv preprint arXiv:2411.04986 , year=

work page arXiv
[4]

International Conference on Machine Learning (ICML) , year=

Similarity of Neural Network Representations Revisited , author=. International Conference on Machine Learning (ICML) , year=

work page
[5]

Frontiers in Systems Neuroscience , volume=

Representational Similarity Analysis -- Connecting the Branches of Systems Neuroscience , author=. Frontiers in Systems Neuroscience , volume=

work page
[6]

The information bottleneck method

The Information Bottleneck Method , author=. arXiv preprint physics/0004057 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Neural Computation , volume=

The Deterministic Information Bottleneck , author=. Neural Computation , volume=

work page
[8]

International Conference on Learning Representations (ICLR) , year=

Similarity of Neural Network Models: A Survey of Functional and Representational Measures , author=. International Conference on Learning Representations (ICLR) , year=

work page
[9]

PNAS Nexus , volume=

Ranking the Information Content of Distance Measures , author=. PNAS Nexus , volume=

work page
[10]

arXiv preprint arXiv:2505.17101 , year=

Connecting the Dots: Representation Convergence in Unimodal Models , author=. arXiv preprint arXiv:2505.17101 , year=

work page internal anchor Pith review arXiv
[11]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Neural Thermodynamics: Entropic Forces in Deep and Universal Representation Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[12]

1921 , publisher=

Tractatus Logico-Philosophicus , author=. 1921 , publisher=

work page 1921
[13]

International Conference on Machine Learning (ICML) , year=

Learning Transferable Visual Models From Natural Language Supervision , author=. International Conference on Machine Learning (ICML) , year=

work page
[14]

Transactions on Machine Learning Research , year=

DINOv2: Learning Robust Visual Features without Supervision , author=. Transactions on Machine Learning Research , year=

work page
[15]

LLaMA: Open and Efficient Foundation Language Models

LLaMA: Open and Efficient Foundation Language Models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Crosslingual generalization through multitask ﬁnetuning

Crosslingual Generalization through Multitask Finetuning , author=. arXiv preprint arXiv:2211.01786 , year=

work page arXiv
[17]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

International Conference on Learning Representations (ICLR) , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations (ICLR) , year=

work page
[19]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Masked Autoencoders Are Scalable Vision Learners , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[20]

Advances in Neural Information Processing Systems (NeurIPS) , year=

PointGPT: Auto-regressively Generative Pre-training from Point Clouds , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[21]

European Conference on Computer Vision (ECCV) , year=

Masked Autoencoders for Point Cloud Self-supervised Learning , author=. European Conference on Computer Vision (ECCV) , year=

work page
[22]

Opening the Black Box of Deep Neural Networks via Information

Opening the Black Box of Deep Neural Networks via Information , author=. arXiv preprint arXiv:1703.00810 , year=

work page Pith review arXiv
[23]

Journal of Statistical Mechanics: Theory and Experiment , year=

On the Information Bottleneck Theory of Deep Learning , author=. Journal of Statistical Mechanics: Theory and Experiment , year=

work page
[24]

Advances in Neural Information Processing Systems (NeurIPS) , year=

The Indra Representation Hypothesis for Multimodal Alignment , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[25]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

The Universal Normal Embedding , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[26]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Closeness in Distribution Does Not Imply Representation Similarity , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[27]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Blind Match: Visual-Language Correspondence Without Parallel Data , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[28]

Advances in Neural Information Processing Systems (NeurIPS) , year=

STRUCTURE: Aligning Representations with Limited Paired Data , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[29]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Harnessing Frozen Unimodal Encoders for Multimodal Alignment , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[30]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Learning Shared Representations from Unpaired Multimodal Data , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[31]

International Conference on Learning Representations (ICLR) , year=

Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models , author=. International Conference on Learning Representations (ICLR) , year=

work page
[32]

International Conference on Machine Learning (ICML) , year=

Aligning Multimodal Representations via Information Bottleneck , author=. International Conference on Machine Learning (ICML) , year=

work page
[33]

International Conference on Machine Learning (ICML) , year=

Understanding the Emergence of Multimodal Representation Alignment , author=. International Conference on Machine Learning (ICML) , year=

work page
[34]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

ConFu: Higher-Order Contrastive Fusion for Multimodal Alignment , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[35]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

An Omnivorous Vision Encoder , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[36]

International Conference on Machine Learning (ICML) , year=

Functional Alignment Can Mislead , author=. International Conference on Machine Learning (ICML) , year=

work page
[37]

International Conference on Machine Learning (ICML) , year=

Universal Statistical Structure of Natural Datasets , author=. International Conference on Machine Learning (ICML) , year=

work page
[38]

International Conference on Learning Representations (ICLR) , year=

Representational Alignment Between Supervised and Self-Supervised Contrastive Learning , author=. International Conference on Learning Representations (ICLR) , year=

work page
[39]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[40]

European Conference on Computer Vision (ECCV) , year=

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction , author=. European Conference on Computer Vision (ECCV) , year=

work page
[41]

DINOv3

DINOv3 , author=. arXiv preprint arXiv:2508.10104 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

arXiv preprint , year=

Qwen3 Technical Report , author=. arXiv preprint , year=

work page
[43]

InternLM2 Technical Report

InternLM2 Technical Report , author=. arXiv preprint arXiv:2403.17297 , year=

work page internal anchor Pith review arXiv
[44]

Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , year=

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning , author=. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , year=

work page
[45]

ShapeNet: An Information-Rich 3D Model Repository

ShapeNet: An Information-Rich 3D Model Repository , author=. arXiv preprint arXiv:1512.03012 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

International Conference on Learning Representations (ICLR) , year=

Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth , author=. International Conference on Learning Representations (ICLR) , year=

work page
[47]

Scaling Laws for Neural Language Models

Scaling Laws for Neural Language Models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[48]

International Conference on Machine Learning (ICML) , year=

Training Objective Drives Representation Similarity Consistency Across Datasets , author=. International Conference on Machine Learning (ICML) , year=

work page
[49]

International Conference on Machine Learning (ICML) , year=

The Butterfly Effect in Model Training , author=. International Conference on Machine Learning (ICML) , year=

work page
[50]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Multi-modal Contrastive Learning: Intrinsic Dimension and Temperature Selection , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[51]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Scaling Language-centric Omnimodal Representation Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[52]

International Conference on Learning Representations (ICLR) , year=

Towards a Learning Theory of Representation Alignment , author=. International Conference on Learning Representations (ICLR) , year=

work page
[53]

IEEE Information Theory Workshop (ITW) , year=

Deep Learning and the Information Bottleneck Principle , author=. IEEE Information Theory Workshop (ITW) , year=

work page

[1] [1]

International Conference on Machine Learning (ICML) , year=

The Platonic Representation Hypothesis , author=. International Conference on Machine Learning (ICML) , year=

work page

[2] [2]

arXiv preprint arXiv:2501.15652 , year=

Indra's Net: the Interplay Between Perception and Reasoning Representations in Multimodal Models , author=. arXiv preprint arXiv:2501.15652 , year=

work page arXiv

[3] [3]

The semantic hub hypothesis: Lan- guage models share semantic representations across languages and modalities.arXiv preprint arXiv:2411.04986, 2024

The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities , author=. arXiv preprint arXiv:2411.04986 , year=

work page arXiv

[4] [4]

International Conference on Machine Learning (ICML) , year=

Similarity of Neural Network Representations Revisited , author=. International Conference on Machine Learning (ICML) , year=

work page

[5] [5]

Frontiers in Systems Neuroscience , volume=

Representational Similarity Analysis -- Connecting the Branches of Systems Neuroscience , author=. Frontiers in Systems Neuroscience , volume=

work page

[6] [6]

The information bottleneck method

The Information Bottleneck Method , author=. arXiv preprint physics/0004057 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Neural Computation , volume=

The Deterministic Information Bottleneck , author=. Neural Computation , volume=

work page

[8] [8]

International Conference on Learning Representations (ICLR) , year=

Similarity of Neural Network Models: A Survey of Functional and Representational Measures , author=. International Conference on Learning Representations (ICLR) , year=

work page

[9] [9]

PNAS Nexus , volume=

Ranking the Information Content of Distance Measures , author=. PNAS Nexus , volume=

work page

[10] [10]

arXiv preprint arXiv:2505.17101 , year=

Connecting the Dots: Representation Convergence in Unimodal Models , author=. arXiv preprint arXiv:2505.17101 , year=

work page internal anchor Pith review arXiv

[11] [11]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Neural Thermodynamics: Entropic Forces in Deep and Universal Representation Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page

[12] [12]

1921 , publisher=

Tractatus Logico-Philosophicus , author=. 1921 , publisher=

work page 1921

[13] [13]

International Conference on Machine Learning (ICML) , year=

Learning Transferable Visual Models From Natural Language Supervision , author=. International Conference on Machine Learning (ICML) , year=

work page

[14] [14]

Transactions on Machine Learning Research , year=

DINOv2: Learning Robust Visual Features without Supervision , author=. Transactions on Machine Learning Research , year=

work page

[15] [15]

LLaMA: Open and Efficient Foundation Language Models

LLaMA: Open and Efficient Foundation Language Models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Crosslingual generalization through multitask ﬁnetuning

Crosslingual Generalization through Multitask Finetuning , author=. arXiv preprint arXiv:2211.01786 , year=

work page arXiv

[17] [17]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

International Conference on Learning Representations (ICLR) , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations (ICLR) , year=

work page

[19] [19]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Masked Autoencoders Are Scalable Vision Learners , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page

[20] [20]

Advances in Neural Information Processing Systems (NeurIPS) , year=

PointGPT: Auto-regressively Generative Pre-training from Point Clouds , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page

[21] [21]

European Conference on Computer Vision (ECCV) , year=

Masked Autoencoders for Point Cloud Self-supervised Learning , author=. European Conference on Computer Vision (ECCV) , year=

work page

[22] [22]

Opening the Black Box of Deep Neural Networks via Information

Opening the Black Box of Deep Neural Networks via Information , author=. arXiv preprint arXiv:1703.00810 , year=

work page Pith review arXiv

[23] [23]

Journal of Statistical Mechanics: Theory and Experiment , year=

On the Information Bottleneck Theory of Deep Learning , author=. Journal of Statistical Mechanics: Theory and Experiment , year=

work page

[24] [24]

Advances in Neural Information Processing Systems (NeurIPS) , year=

The Indra Representation Hypothesis for Multimodal Alignment , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page

[25] [25]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

The Universal Normal Embedding , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page

[26] [26]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Closeness in Distribution Does Not Imply Representation Similarity , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page

[27] [27]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Blind Match: Visual-Language Correspondence Without Parallel Data , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page

[28] [28]

Advances in Neural Information Processing Systems (NeurIPS) , year=

STRUCTURE: Aligning Representations with Limited Paired Data , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page

[29] [29]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Harnessing Frozen Unimodal Encoders for Multimodal Alignment , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page

[30] [30]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Learning Shared Representations from Unpaired Multimodal Data , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page

[31] [31]

International Conference on Learning Representations (ICLR) , year=

Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models , author=. International Conference on Learning Representations (ICLR) , year=

work page

[32] [32]

International Conference on Machine Learning (ICML) , year=

Aligning Multimodal Representations via Information Bottleneck , author=. International Conference on Machine Learning (ICML) , year=

work page

[33] [33]

International Conference on Machine Learning (ICML) , year=

Understanding the Emergence of Multimodal Representation Alignment , author=. International Conference on Machine Learning (ICML) , year=

work page

[34] [34]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

ConFu: Higher-Order Contrastive Fusion for Multimodal Alignment , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page

[35] [35]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

An Omnivorous Vision Encoder , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page

[36] [36]

International Conference on Machine Learning (ICML) , year=

Functional Alignment Can Mislead , author=. International Conference on Machine Learning (ICML) , year=

work page

[37] [37]

International Conference on Machine Learning (ICML) , year=

Universal Statistical Structure of Natural Datasets , author=. International Conference on Machine Learning (ICML) , year=

work page

[38] [38]

International Conference on Learning Representations (ICLR) , year=

Representational Alignment Between Supervised and Self-Supervised Contrastive Learning , author=. International Conference on Learning Representations (ICLR) , year=

work page

[39] [39]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page

[40] [40]

European Conference on Computer Vision (ECCV) , year=

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction , author=. European Conference on Computer Vision (ECCV) , year=

work page

[41] [41]

DINOv3

DINOv3 , author=. arXiv preprint arXiv:2508.10104 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

arXiv preprint , year=

Qwen3 Technical Report , author=. arXiv preprint , year=

work page

[43] [43]

InternLM2 Technical Report

InternLM2 Technical Report , author=. arXiv preprint arXiv:2403.17297 , year=

work page internal anchor Pith review arXiv

[44] [44]

Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , year=

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning , author=. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , year=

work page

[45] [45]

ShapeNet: An Information-Rich 3D Model Repository

ShapeNet: An Information-Rich 3D Model Repository , author=. arXiv preprint arXiv:1512.03012 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

International Conference on Learning Representations (ICLR) , year=

Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth , author=. International Conference on Learning Representations (ICLR) , year=

work page

[47] [47]

Scaling Laws for Neural Language Models

Scaling Laws for Neural Language Models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001

[48] [48]

International Conference on Machine Learning (ICML) , year=

Training Objective Drives Representation Similarity Consistency Across Datasets , author=. International Conference on Machine Learning (ICML) , year=

work page

[49] [49]

International Conference on Machine Learning (ICML) , year=

The Butterfly Effect in Model Training , author=. International Conference on Machine Learning (ICML) , year=

work page

[50] [50]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Multi-modal Contrastive Learning: Intrinsic Dimension and Temperature Selection , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page

[51] [51]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Scaling Language-centric Omnimodal Representation Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page

[52] [52]

International Conference on Learning Representations (ICLR) , year=

Towards a Learning Theory of Representation Alignment , author=. International Conference on Learning Representations (ICLR) , year=

work page

[53] [53]

IEEE Information Theory Workshop (ITW) , year=

Deep Learning and the Information Bottleneck Principle , author=. IEEE Information Theory Workshop (ITW) , year=

work page