Geometric Stability: The Missing Axis of Representations

Prashant C. Raju

arxiv: 2601.09173 · v4 · submitted 2026-01-14 · 💻 cs.LG · cs.CL· q-bio.QM· stat.ML

Geometric Stability: The Missing Axis of Representations

Prashant C. Raju This is my paper

Pith reviewed 2026-05-16 14:57 UTC · model grok-4.3

classification 💻 cs.LG cs.CLq-bio.QMstat.ML

keywords geometric stabilityShesharepresentational similarityCKAmanifold compressionneural representationsmodel evaluationsplit-half correlation

0 comments

The pith

Geometric stability measures how reliably a representation's pairwise distance structure holds under perturbation, separate from similarity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard tools like CKA only check alignment between representational spaces but miss whether the internal geometry is robust. It introduces geometric stability as a distinct quality, measured by Shesha through split-half correlations of representational dissimilarity matrices built from complementary feature subsets. This approach detects compression damage to manifold structure because Shesha changes under orthogonal transformations that leave similarity metrics unchanged. Tests across 2463 encoders in seven domains show stability and similarity are uncorrelated, with geometry-preserving changes making them redundant and compression making them anti-correlated. When applied to pretrained models, this independence reveals that top transfer performers often carry a geometric tax of low stability.

Core claim

Shesha quantifies geometric stability as the split-half correlation of RDMs from complementary feature subsets. Unlike CKA and Procrustes, Shesha is not invariant to orthogonal transformations of the feature space, so it registers compression-induced damage to distance structure that similarity metrics overlook. Spectral analysis shows stability retains sensitivity across the eigenspectrum after top components are removed. Across domains, stability and similarity prove empirically independent, arising from opposing effects of different transformations.

What carries the argument

Shesha, the split-half correlation of representational dissimilarity matrices from complementary feature subsets, which tracks self-consistency of pairwise distances under feature perturbation.

Load-bearing premise

Split-half correlations of RDMs from feature subsets meaningfully quantify robustness to general perturbations rather than capturing only subset-specific artifacts.

What would settle it

An orthogonal transformation of the feature space that preserves all pairwise distances but alters manifold curvature, after which Shesha scores change while CKA remains fixed.

Figures

Figures reproduced from arXiv: 2601.09173 by Prashant C. Raju.

**Figure 1.** Figure 1: Stability and similarity are independent dimensions of representational geometry. (a) Spectral Sensitivity: CKA (red) collapses after removing just the single top principal component, while Shesha (blue) retains sensitivity to the spectral tail. CKA measures dominant variance; Shesha measures full manifold geometry. (b) Universality: Across 2,463 encoder configurations spanning seven domains, Shesha and CK… view at source ↗

**Figure 2.** Figure 2: ). At n = 400, scores ranged from 0.254 (vit_tiny_patch16_224 on CIFAR-100) to 0.803 (efficientnet_b2 on CIFAR-10) with mean 0.622. At n = 1600, scores ranged from 0.288 (vit_tiny_patch16_224 on CIFAR-100) to 0.790 (swin_tiny on CIFAR-10) with mean 0.622, showing negligible change in the distribution. The mean absolute drift across all 30 model-dataset combinations was |∆¯ | = 0.0115, well below the 0.05 … view at source ↗

**Figure 3.** Figure 3: Model Leaderboard. Ranking of 15 architectures by Shesha score (feature split). Bar segments show contributions from CIFAR-10 (teal) and CIFAR-100 (blue). Modern architectures with attention or dense connectivity achieve higher geometric stability [PITH_FULL_IMAGE:figures/full_fig_p027_3.png] view at source ↗

**Figure 4.** Figure 4: Seed Stability. Comparison of Shesha scores computed with two different random seeds (Seed A=100 vs. Seed B=200). Points align closely with the diagonal identity line, indicating high reproducibility across random initializations [PITH_FULL_IMAGE:figures/full_fig_p031_4.png] view at source ↗

**Figure 5.** Figure 5: Spectral Sensitivity Analysis. We measure metric responses as the top k principal components are progressively removed from a power-law representation. (A) Shesha degrades gracefully while all similarity metrics (CKA, PWCKA, Procrustes) collapse after removing just 1 PC. (B) Comparison with whitened Shesha shows high correlation (ρ = 0.999), though whitening reduces baseline stability. (C) Shesha robustnes… view at source ↗

**Figure 6.** Figure 6: Metric Dissociation. A scatter plot of Shesha vs. Debiased CKA using balanced sampling across four stability/similarity quadrants. The presence of distinct clusters, particularly the High Stability/Low Similarity quadrant (blue), confirms that stability is mathematically distinct from similarity. The low correlation (ρ = 0.20) indicates that Shesha measures intrinsic geometric consistency largely independe… view at source ↗

**Figure 7.** Figure 7: Construct Validity: Ground Truth Recovery. Shesha scores plotted against parametrically controlled stability levels (signal-to-noise ratio) in synthetic representations. The metric shows a nearperfect monotonic response (ρ = 0.990) to the underlying ground truth, confirming high sensitivity to geometric consistency. 7.1.5 Invariance Properties We additionally verified that Shesha exhibits expected invaria… view at source ↗

**Figure 8.** Figure 8: Experimental Logic for Metric Independence. We employ a double dissociation strategy to validate Shesha against CKA. Following the Johnson-Lindenstrauss lemma (Johnson and Lindenstrauss, 1984; Dasgupta and Gupta, 2002), random projections (top) preserve geometric structure (high Stability) while discarding specific semantic features (low Similarity). Conversely, aggressive PCA (bottom) preserves dominant … view at source ↗

**Figure 9.** Figure 9: Task-aligned stability is required for real-world control. Comparison of supervised Shesha (task-aligned geometric consistency) versus unsupervised stability (feature-partition consistency) in predicting steering effectiveness. A significant gap: unsupervised stability predicts steering in the synthetic setting (ρ = 0.77) where the data manifold is fully aligned with the task structure but completely fail… view at source ↗

**Figure 10.** Figure 10: Negative controls validate methodology. (A) Shuffled label control: Supervised Shesha computed with true labels (dark) versus randomly permuted labels (light). The complete collapse of Shesha under label shuffling (0.60 → −0.001 for Synthetic, 0.23 → −0.001 for SST-2, 0.02 → −0.001 for MNLI; all p < 10−10) confirms that the metric captures genuine task-relevant structure rather than spurious geometric pat… view at source ↗

**Figure 11.** Figure 11: Model characteristics associated with steerability. Top 5 (colored) and bottom 5 (gray) models ranked by steering effectiveness (max_drop) for SST-2 (left) and MNLI (right). Consistent patterns emerge across tasks: the most steerable models are from the BGE, E5, and GTE families, all trained with supervised contrastive objectives. The least steerable models are unsupervised variants (unsup-simcse, e5-base… view at source ↗

**Figure 12.** Figure 12: Geometric stability predicts linear steerability across all experimental settings. Scatter plots show supervised Shesha (computed on held-out Set A) versus steering effectiveness (max_drop, evaluated on disjoint Set B) for each model. (A) Synthetic sentiment data (n = 69 models): ρ = 0.894, p < 10−24 . (B) SST-2 binary sentiment (n = 35 models): ρ = 0.962, p < 10−19 . (C) MNLI ternary NLI (n = 35 models):… view at source ↗

**Figure 13.** Figure 13: Shesha captures unique variance beyond class separability. Comparison of raw Spearman correlations (solid bars) and partial correlations controlling for Fisher discriminant and silhouette score (hatched bars). While Shesha and Fisher show similar raw correlations with steering effectiveness, Shesha maintains large partial correlations (ρ = 0.62-0.76, all p < 0.001) after controlling for separability. This… view at source ↗

**Figure 14.** Figure 14: Correlation Structure Across All Datasets. Spearman correlation heatmaps for all six datasets. 0.15 0.20 0.25 0.30 0.35 SHESHA-Var 0.3 0.4 0.5 0.6 0.7 0.8 0.9 SHESHA-FS r = -0.01 CIFAR-10 DINOv2 CLIP EVA 0.20 0.25 0.30 0.35 0.40 SHESHA-Var 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 SHESHA-FS r = -0.24* CIFAR-100 DINOv2 CLIP EVA 0.3 0.4 0.5 0.6 0.7 0.8 0.9 SHESHA-Var 0.2 0.4 0.6 0.8 1.0 SHESHA-FS r = -0.14 Flowers-10… view at source ↗

**Figure 15.** Figure 15: Shesha-Var vs. Shesha-FS Across Datasets. DINOv2 models (red) cluster in the highVar/low-FS region; CLIP models (blue) maintain high stability. 52 [PITH_FULL_IMAGE:figures/full_fig_p052_15.png] view at source ↗

**Figure 16.** Figure 16: Metric Distributions Across Datasets. Violin plots comparing distributions. 9.2.3 The DINOv2 Paradox [PITH_FULL_IMAGE:figures/full_fig_p053_16.png] view at source ↗

**Figure 17.** Figure 17: Cross-Dataset Correlation Comparison. Spearman correlations between metric pairs across all datasets. 54 [PITH_FULL_IMAGE:figures/full_fig_p054_17.png] view at source ↗

**Figure 18.** Figure 18: Architecture Family Performance Across Datasets. Mean Shesha-Var (left) and SheshaFS (right) by family. 9.2.6 Cross-Dataset Rank Stability [PITH_FULL_IMAGE:figures/full_fig_p055_18.png] view at source ↗

**Figure 19.** Figure 19: Cross-Dataset Rank Stability. Scatter plots comparing Shesha-FS rankings between dataset pairs. 56 [PITH_FULL_IMAGE:figures/full_fig_p056_19.png] view at source ↗

**Figure 20.** Figure 20: Geometric Stability Heatmap: Family × Dataset. Red = unstable, green = stable [PITH_FULL_IMAGE:figures/full_fig_p057_20.png] view at source ↗

**Figure 21.** Figure 21: Post-training drift varies substantially across model families. Geometric drift between base and instruction-tuned model pairs, measured by Shesha and CKA, aggregated by model family (23 pairs total). The Shesha/CKA ratio ranges from 1.1× (BLOOM) to 5.2× (Llama), indicating that Shesha consistently detects greater representational reorganization than CKA. Families with larger ratios exhibit more distribut… view at source ↗

**Figure 22.** Figure 22: Drift magnitude varies by prompt type. Mean geometric drift across 23 base-instruct pairs, stratified by prompt category. Factual and descriptive prompts induce the largest Shesha/CKA ratios (2.37× and 2.28×), while instruction prompts show the smallest ratio (1.44×). This pattern suggests that instruction tuning most strongly reshapes representations for instruction-following inputs (reducing the Shesha-… view at source ↗

**Figure 23.** Figure 23: Metric response to Gaussian noise perturbation. Mean drift across 16 causal LMs as noise magnitude increases (σ ∈ [0, 0.5]). Shesha exhibits the steepest response curve, reaching 71% drift at σ = 0.5 compared to 43% for CKA and 42% for Procrustes. At low noise levels (σ < 0.1), Procrustes shows elevated sensitivity relative to Shesha and CKA, foreshadowing the false alarm behavior characterized in Experim… view at source ↗

**Figure 24.** Figure 24: Drift scales with LoRA initialization magnitude. Mean drift across 16 causal LMs as LoRA initialization scale increases (rank fixed at 8). All metrics exhibit exponential growth with init scale, but Procrustes maintains a consistent offset above Shesha and CKA across the entire range. At minimal perturbation (init scale = 10−3 ), Procrustes already registers detectable drift while Shesha and CKA remain ne… view at source ↗

**Figure 25.** Figure 25: Predictive validity of geometric drift metrics. Scatter plots showing the relationship between drift magnitude and functional degradation (accuracy drop) across 26 sentence embedding models under Gaussian noise perturbation (σ ∈ [0.01, 0.5]). All three metrics exhibit strong correlation with accuracy loss: Shesha (ρ = 0.927), CKA (ρ = 0.937), and Procrustes (ρ = 0.935). Each point represents one model at… view at source ↗

**Figure 26.** Figure 26: Drift trajectories across noise levels. Evolution of Shesha, CKA, Procrustes, and accuracy drop as Gaussian noise magnitude increases (σ ∈ [0, 0.5]) for four representative sentence embedding models. The horizontal dashed line indicates a 5% detection threshold. Procrustes consistently exceeds this threshold at lower noise levels than Shesha or CKA, illustrating its heightened sensitivity to geometric pe… view at source ↗

**Figure 27.** Figure 27: Shesha provides earlier warning than CKA. (Left) Distribution of which metric first exceeded the 5% detection threshold across 26 sentence embedding models. Shesha detected drift earlier in 73% of cases (19/26), with the remaining 27% tied; CKA never detected first. (Right) Among the 19 cases where metrics diverged, Shesha achieved a 100% win rate. This early warning advantage stems from Shesha’s equal we… view at source ↗

**Figure 28.** Figure 28: ROC analysis for drift detection on the LoRA perturbation benchmark. All metrics [PITH_FULL_IMAGE:figures/full_fig_p074_28.png] view at source ↗

**Figure 29.** Figure 29: False alarm analysis reveals Procrustes oversensitivity. (a) In the stable regime (accuracy drop < 1%), Procrustes triggers false alarms in 44% of cases compared to only 7.3% for Shesha and CKA (a 6× difference). (b) At minimal perturbation where functional performance is unchanged, Procrustes reports 1.50% drift versus 0.04% for Shesha (a 37× inflation). This demonstrates that Procrustes detects rigid ge… view at source ↗

**Figure 30.** Figure 30: Transfer experiments summary. (A) Experiment 1 results for Label-RDM Alignment. (B) [PITH_FULL_IMAGE:figures/full_fig_p078_30.png] view at source ↗

**Figure 31.** Figure 31 [PITH_FULL_IMAGE:figures/full_fig_p087_31.png] view at source ↗

**Figure 32.** Figure 32: Stability-magnitude correlation is robust across distance metrics. Bar chart showing Spearman correlations with 95% bootstrap CIs (error bars) for three distance computation methods: Euclidean (standard L2 in PCA space), Whitened (Mahalanobis-scaled coordinates), and k-NN (local control centroids). All methods achieve strong correlations (ρ > 0.74) across all datasets. Notably, whitening substantially imp… view at source ↗

**Figure 33.** Figure 33: Combinatorial perturbations exhibit higher geometric stability than single-gene perturbations. Violin plots showing stability distributions for single-gene versus combinatorial (multi-gene) perturbations in Norman et al. (CRISPRa) and Dixit et al. (CRISPRi). Combinatorial perturbations show significantly higher stability in both datasets (Mann-Whitney U, p < 10−9 ), suggesting that multi-target interventi… view at source ↗

**Figure 34.** Figure 34: Discordant perturbations reveal regulatory specificity. Stability versus magnitude for all perturbations in Norman et al. (2019), with CEBP family members (CEBPA, CEBPB, CEBPE) and KLF1 combinations highlighted. CEBP perturbations cluster below the trend line (lower stability relative to their high magnitude), consistent with CEBPA’s known role as a pleiotropic master regulator. KLF1 perturbations cluster… view at source ↗

**Figure 35.** Figure 35: Geometric Instability of CEBPA. Unlike the coherent arrest seen in KLF1, CEBPA targets drive orthogonal programs. ⃗vmetab pushes the cell state out of the differentiation manifold (blue tube), resulting in a geometrically incoherent state (θ ≫ 0) that cannot be mapped back to a valid lineage trajectory. KLF1 (lineage-specific). KLF1 (Krüppel-like factor 1) is an erythroid-specific transcription factor ess… view at source ↗

**Figure 36.** Figure 36: Geometric Coherence of KLF1. Unlike CEBPA, the downstream targets of KLF1 (globins, cell cycle) generate vectors (⃗vglobin, ⃗vcycle) that are locally collinear with the erythroid differentiation manifold (blue tube). Consequently, perturbation results in a magnitude shift along the trajectory (red vector, differentiation arrest) rather than an incoherent expansion into off-manifold space (θ ≈ 0 ◦ ). This … view at source ↗

**Figure 37.** Figure 37: Geometric stability predicts neural-behavioral coupling. Each point represents one brain area in one session (n = 228). Geometric stability (Shesha) correlates significantly with trial-by-trial neural-behavioral coupling (ρ = 0.18, p = 0.005), indicating that regions with more stable representational geometry show tighter correspondence between neural state magnitude and behavioral outcome. Points are col… view at source ↗

**Figure 38.** Figure 38: Regional hierarchy of geometric vs. temporal stability. (A) Geometric stability (Shesha) is highest in action-related regions (Striatum, Motor) and lowest in Hippocampus. (B) Temporal stability (centroid similarity) shows an opposing pattern, with sensory regions (Thalamus, Visual) most stable and Striatum least stable. This dissociation indicates that geometric and temporal stability capture independent … view at source ↗

read the original abstract

Representational similarity analysis and related methods have become standard tools for comparing the internal geometries of neural networks and biological systems. These methods measure what is represented, the alignment between two representational spaces, but not whether that structure is robust. We introduce geometric stability, a distinct dimension of representational quality that quantifies how reliably a representation's pairwise distance structure holds under perturbation. Our metric, Shesha, measures self-consistency through split-half correlation of representational dissimilarity matrices constructed from complementary feature subsets. A key formal property distinguishes stability from similarity: Shesha is not invariant to orthogonal transformations of the feature space, unlike CKA and Procrustes, enabling it to detect compression-induced damage to manifold structure that similarity metrics cannot see. Spectral analysis reveals the mechanism: similarity metrics collapse after removing the top principal component, while stability retains sensitivity across the eigenspectrum. Across 2463 encoder configurations in seven domains -- language, vision, audio, video, protein sequences, molecular profiles, and neural population recordings -- stability and similarity are empirically uncorrelated ($\rho=-0.01$). A regime analysis shows this independence arises from opposing effects: geometry-preserving transformations make the metrics redundant, while compression makes them anti-correlated, canceling in aggregate. Applied to 94 pretrained models across 6 datasets, stability exposes a "geometric tax": DINOv2, the top-performing model for transfer learning, ranks last in geometric stability on 5/6 datasets. Contrastive alignment and hierarchical architecture predict stability, providing actionable guidance for model selection in deployment contexts where representational reliability matters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Shesha adds a distinct stability axis via split-half RDM correlation, but its non-invariance to orthogonal transforms likely stems from fixed feature partitioning rather than intrinsic manifold properties.

read the letter

The main thing to know is that this paper defines geometric stability as a separate axis from representational similarity and introduces Shesha, a split-half correlation of RDMs on complementary feature subsets, to measure it. They report near-zero correlation with CKA across 2463 encoder setups in seven domains and show that top transfer models like DINOv2 rank low on stability, calling it a geometric tax. The scale of the empirical sweep and the regime analysis separating geometry-preserving from compression cases are the clearest strengths; those numbers give the independence claim some weight and point to practical model-selection implications when reliability matters more than alignment alone. The formal distinction that Shesha is not orthogonally invariant is presented cleanly and lets them argue it can see compression damage that similarity metrics miss. The soft spot is exactly where the stress-test flags it: because pairwise distances are preserved under orthogonal transforms, any non-invariance has to come from the arbitrary fixed split of dimensions into halves. That makes Shesha sensitive to which features are grouped together, so the metric could be picking up subset artifacts instead of general robustness to perturbations. The paper would need explicit checks that results hold under random or multiple splits and that stability scores predict actual downstream reliability under noise or compression. Without those, the claim that it detects manifold damage beyond what CKA sees remains provisional. This is worth a serious referee for people working on representation evaluation and deployment robustness. The experiments are broad enough to justify review even if the metric definition needs tightening.

Referee Report

3 major / 2 minor

Summary. The paper introduces geometric stability as a distinct dimension of representational quality separate from similarity, quantified by the Shesha metric: split-half correlation of representational dissimilarity matrices (RDMs) built from complementary feature subsets. It claims Shesha is not invariant to orthogonal transformations of the feature space (unlike CKA and Procrustes), enabling detection of compression-induced manifold damage. Spectral analysis is said to show similarity metrics collapsing after top-PC removal while stability retains eigenspectrum sensitivity. Large-scale experiments on 2463 encoders across seven domains report near-zero correlation (ρ=-0.01) between stability and similarity, with regime analysis attributing this to opposing effects under geometry-preserving vs. compression transformations. Applied to 94 pretrained models, stability reveals a 'geometric tax' where DINOv2 ranks last on 5/6 datasets, and contrastive/hierarchical designs predict higher stability.

Significance. If the core distinction holds, the work provides a new evaluation axis for representational reliability under perturbation, with potential value for model selection in deployment settings where robustness matters. The scale of the empirical study (multiple domains, pretrained-model sweep) and the reported uncorrelation are strengths that could influence how the field assesses learned geometries beyond alignment metrics.

major comments (3)

[Abstract] Abstract: The central claim that Shesha detects compression-induced damage to manifold structure because it is not invariant to orthogonal transformations rests on the fixed partitioning of features into complementary subsets. Because pairwise distances are preserved under orthogonal transforms, any observed non-invariance arises solely from the arbitrary grouping of dimensions; an orthogonal rotation mixes dimensions and changes which pairs fall into each half. Without evidence that this split-specific sensitivity corresponds to intrinsic geometric properties rather than partitioning artifacts, the claimed distinction from CKA/Procrustes does not establish that Shesha quantifies general robustness to perturbations.
[Abstract] Abstract (spectral analysis paragraph): The statement that 'similarity metrics collapse after removing the top principal component, while stability retains sensitivity across the eigenspectrum' is presented as revealing the mechanism, yet the manuscript provides no explicit derivation, perturbation protocol, or error analysis for how the split-half RDM correlation behaves under progressive PC removal. This makes it impossible to verify whether the retained sensitivity is a genuine property of the metric or an artifact of the complementary-subset construction.
[Abstract] Abstract (regime analysis): The claim that independence arises from 'opposing effects' (geometry-preserving transformations making metrics redundant, compression making them anti-correlated) is load-bearing for the uncorrelation result (ρ=-0.01). The text does not specify the exact transformations, compression levels, or statistical controls used to isolate these regimes, leaving open whether the cancellation is robust or sensitive to the particular choice of splits and datasets.

minor comments (2)

[Abstract] The term 'geometric tax' is introduced in the abstract without a formal definition or operationalization; a brief clarifying sentence would improve readability.
[Abstract] The abstract reports results on 2463 encoder configurations and 94 pretrained models but does not indicate whether error bars, multiple-split variability, or cross-validation of the Shesha computation are provided in the main text or supplements.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us clarify the key distinctions in our work. We address each major comment below and have made substantial revisions to the manuscript to provide the requested derivations, specifications, and evidence.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that Shesha detects compression-induced damage to manifold structure because it is not invariant to orthogonal transformations rests on the fixed partitioning of features into complementary subsets. Because pairwise distances are preserved under orthogonal transforms, any observed non-invariance arises solely from the arbitrary grouping of dimensions; an orthogonal rotation mixes dimensions and changes which pairs fall into each half. Without evidence that this split-specific sensitivity corresponds to intrinsic geometric properties rather than partitioning artifacts, the claimed distinction from CKA/Procrustes does not establish that Shesha quantifies general robustness to perturbations.

Authors: The referee correctly identifies that the non-invariance arises from the fixed partitioning. However, this partitioning is not arbitrary in the sense that it is held constant across comparisons, allowing Shesha to measure the consistency of the distance structure under transformations that redistribute information across dimensions. This directly captures robustness to compression, which unevenly affects feature subsets. To address the concern about intrinsic properties, we have added a theoretical analysis in Section 3.2 demonstrating that Shesha's sensitivity corresponds to the condition number of the feature covariance matrix, providing evidence beyond partitioning artifacts. We also include experiments with multiple random partitions showing stable rankings. revision: yes
Referee: [Abstract] Abstract (spectral analysis paragraph): The statement that 'similarity metrics collapse after removing the top principal component, while stability retains sensitivity across the eigenspectrum' is presented as revealing the mechanism, yet the manuscript provides no explicit derivation, perturbation protocol, or error analysis for how the split-half RDM correlation behaves under progressive PC removal. This makes it impossible to verify whether the retained sensitivity is a genuine property of the metric or an artifact of the complementary-subset construction.

Authors: We agree that the abstract lacked the necessary details for verification. In the revised version, we have expanded the spectral analysis section with an explicit derivation: under PC removal, the RDM for each half is recomputed using the remaining components, and the correlation is derived as a function of the eigenvalue distribution. The perturbation protocol involves removing the top k PCs for k from 1 to full rank, with error analysis via 100 bootstrap samples over data points. New figures show that stability's retained sensitivity is due to its use of complementary subsets, which preserve lower-eigenvalue information differently than full-space similarity metrics. revision: yes
Referee: [Abstract] Abstract (regime analysis): The claim that independence arises from 'opposing effects' (geometry-preserving transformations making metrics redundant, compression making them anti-correlated) is load-bearing for the uncorrelation result (ρ=-0.01). The text does not specify the exact transformations, compression levels, or statistical controls used to isolate these regimes, leaving open whether the cancellation is robust or sensitive to the particular choice of splits and datasets.

Authors: We have revised the regime analysis to fully specify the protocol. Geometry-preserving transformations include random orthogonal rotations (via QR decomposition) and feature permutations. Compression is implemented via PCA truncation at levels retaining 10%, 25%, 50%, and 75% of variance, plus additive Gaussian noise at varying SNRs. Statistical controls include averaging over 50 random splits per configuration, with significance tested via permutation tests (p<0.001 for the opposing effects). Supplementary material now includes the full set of transformations and confirms the ρ=-0.01 is robust across domains and split choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity; Shesha's properties follow from explicit definition without reduction to inputs by construction

full rationale

The paper introduces Shesha as a metric defined directly via split-half correlation of RDMs from complementary feature subsets. The non-invariance to orthogonal transformations is stated as a formal property arising from the fixed partitioning in the definition, not derived from prior results or fits. The empirical uncorrelation with similarity metrics (ρ=-0.01) is presented as an observation across datasets, not forced by the construction. No self-citations, ansatzes, or fitted predictions are invoked in a load-bearing manner for the central claims about geometric stability. The derivation chain is self-contained, with results on model rankings and predictors being observational rather than circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the definition of Shesha and standard assumptions from representational analysis; no free parameters are fitted to target results, and no new physical entities are postulated.

axioms (1)

domain assumption Representational dissimilarity matrices from feature subsets capture meaningful geometric structure
Invoked in the construction of Shesha via split-half RDM correlations.

invented entities (2)

geometric stability no independent evidence
purpose: A distinct dimension of representational quality measuring robustness to perturbation
Newly introduced as separate from similarity; no independent evidence provided beyond the metric definition.
Shesha metric no independent evidence
purpose: Quantifies geometric stability through split-half correlation of RDMs
Newly defined in the paper; no external validation or falsifiable prediction outside the definition.

pith-pipeline@v0.9.0 · 5580 in / 1334 out tokens · 61706 ms · 2026-05-16T14:57:50.542633+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Geometric Phase Transition Enables Extreme Hippocampal Memory Capacity
q-bio.NC 2026-05 unverdicted novelty 6.0

A geometric phase transition produces crystalline hippocampal coding in food-caching birds that yields over 100-fold higher location memory capacity than the mist-like coding in non-caching birds.
Geometric coherence of single-cell CRISPR perturbations reveals regulatory architecture and predicts cellular stress
q-bio.QM 2026-04 unverdicted novelty 6.0

Shesha quantifies directional coherence of single-cell CRISPR responses as mean cosine similarity of shift vectors, correlating with magnitude while identifying pleiotropic regulators and stress associations across fi...
Geometric coherence of single-cell CRISPR perturbations reveals regulatory architecture and predicts cellular stress
q-bio.QM 2026-04 unverdicted novelty 6.0

Shesha quantifies directional coherence of single-cell CRISPR responses, correlates strongly with effect magnitude, distinguishes pleiotropic from lineage-specific regulators, and predicts chaperone activation after m...
The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models
cs.LG 2026-04 unverdicted novelty 6.0

Discrete tokenization in scientific foundation models imposes a geometric alignment tax that distorts continuous manifolds, with continuous heads reducing distortion by up to 8.5x and exposing three failure regimes in...
From Syntax to Semantics: Geometric Stability as the Missing Axis of Perturbation Biology
q-bio.QM 2026-02 unverdicted novelty 6.0

Geometric stability, defined as the directional coherence of cellular responses to perturbation, provides a framework for assessing whether resulting cellular states are stable beyond conventional metrics of intervent...

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 4 Pith papers

[1]

Seeds: S[i]×1000 + 1 for i∈ {1,

High stability, high similarity(Q1): Representations derived from the same latent structure (α= 0.9 ) with small additive noise (σ= 0.1 ). Seeds: S[i]×1000 + 1 for i∈ {1, . . . ,15} . Results: Shesha= 0.701±0.003, CKA= 0.998±0.000

work page
[2]

Seeds: S[i]×1000 + 2 and S[i]×1000 + 3 for each pair

High stability, low similarity(Q2): Independent high-signal representations (α= 0.9 ) with different latent draws. Seeds: S[i]×1000 + 2 and S[i]×1000 + 3 for each pair. Results: Shesha= 0.701±0.004, CKA= 0.001±0.010

work page
[3]

Seeds: S[i]×1000 + 4 and S[i]×1000 + 5 for each pair

Low stability, low similarity(Q3): Independent noise representations ( α= 0.1 ). Seeds: S[i]×1000 + 4 and S[i]×1000 + 5 for each pair. Results: Shesha = 0.001±0.003 , CKA =−0.001±0.010

work page
[4]

sanity check

Low stability, high similarity(Q4): Adversarial quadrant constructed via rejection sam- pling. We generated pairs where X∼ N(0, I) 200×256 and Y=X+N(0,0.15 2I), accepting only samples where Shesha <0.4 and CKA >0.4 . This creates representa- tions with aligned sample geometry (high CKA) but inconsistent feature-split structure (low Shesha). Acceptance rat...

work page 2021
[5]

Train a logistic regression probe on 250 samples from Set B 43

work page
[6]

Extract the weight vectorwas the steering direction

work page
[7]

Forα∈ {−2,−1.5, ...,1.5,2}, compute the steered embeddings:e ′ =e+α ˆw

work page
[8]

Evaluate the probe accuracy on the remaining 250 test samples

work page
[9]

convergence

Recordmax_drop= acc 0 −min α acc(α) Negative controls. •Shuffled labels: Recompute all supervised metrics with permuted labels •Random directions: Average max_drop over 20 random unit vectors per split 8.1.2 Results Primary finding: Stability predicts steerability.Supervised geometric stability showed a strong correlation with steering effectiveness: ρ(Sh...

work page arXiv 2013
[10]

State-of-the-art prediction: Supervised Shesha achieves ρ >0.89 with steering effective- ness across all settings, matching or exceeding the Fisher discriminant

work page
[11]

This shows that geometric consistency, rather than class separation, is a causal driver of controllability

Unique geometric signal: Partial correlations of ρ∈[0.62,0.76] after controlling for separability show that stability is detecting something that separability measures miss. This shows that geometric consistency, rather than class separation, is a causal driver of controllability

work page
[12]

For semantic control, stability must be task-aligned

Task alignment is essential: Unsupervised stability predicted steering in synthetic settings (ρ= 0.77 ), but it failed on real-world tasks (ρ≈0.10 -0.35). For semantic control, stability must be task-aligned

work page
[13]

CLIP” rather than “ViT

Methodology is sound: Negative controls confirm that (a) supervised metrics reflect genuine task structure (shuffled labels destroy signal), and (b) steering effects are direction-specific (true directions outperform random by1.3-10.8×). Model characteristics.Analysis of model rankings revealed that supervised contrastive models (BGE, E5, and GTE families...

work page 2019
[14]

Saved the clean model weights

work page
[15]

Injected Gaussian noise at 51 levels:α∈ {0.00,0.01,0.02, . . . ,0.50}

work page
[16]

For each parameter tensorθ, added noise:θ ′ =θ+N(0, α·std(θ))

work page
[17]

Embedded 800 SST-2 validation samples (balanced classes)

work page
[18]

Computed drift metrics and downstream classification accuracy (5-fold CV , logistic regres- sion)

work page
[19]

passage:

Restored clean weights before next noise level 69 This protocol simulates parameter corruption from quantization errors, bit rot, or fine-tuning drift. Each (model,α) combination used a deterministic seed for reproducibility across runs. Embedding Details.For SentenceTransformer models, we used the native encode() method. For models loaded with AutoModel,...

work page 2022
[20]

Use Shesha as the primary drift metric.It provides the best combination of predictive validity (ρ≥0.92 ) and low false alarm rate (7%), detecting functionally relevant geometric changes while ignoring harmless rigid transformations

work page
[21]

Use Procrustes for maximum sensitivity when false alarms are acceptable.In scenarios where any geometric change warrants investigation (e.g., security-critical deployments), Procrustes provides the earliest possible warning, but expect 6×more false positives

work page
[22]

If CKA remains stable, the perturbation may be recoverable; if CKA has also dropped, functional degradation is likely

Use CKA as a confirmation signal.When Shesha triggers, check CKA to assess whether the drift has affected the dominant representation structure. If CKA remains stable, the perturbation may be recoverable; if CKA has also dropped, functional degradation is likely

work page
[23]

10.6 Model Lists Table 53: Base/Instruct model pairs for post-training drift analysis (Experiment 1)

Avoid Wasserstein for drift detection.Sliced Wasserstein distance proved insufficiently sensitive, failing to detect drift until catastrophic collapse in most models. 10.6 Model Lists Table 53: Base/Instruct model pairs for post-training drift analysis (Experiment 1). Base Model Instruct Model Params HuggingFaceTB/SmolLM-135M SmolLM-135M-Instruct 0.14B Hu...

work page 2011
[24]

Compute normalized embeddings for the source domain (IMDB samples)

work page
[25]

Calculate all transferability metrics on source embeddings with labels

work page
[26]

Fine-tune linear probes on target domain with hyperparameter search

work page
[27]

Sample sizes

Report test accuracy as transfer performance measure The linear probes that were used included the following: • logistic regression (C∈ {0.1,1,10}) • ridge classifier (α∈ {1,10}) • LDA • nearest centroid The best probe was selected based on validation set accuracy. Sample sizes. • Experiment 1: ktotal ∈ {16,32,64,128,256,512} training examples (balanced a...

work page 2021
[28]

Unsupervised geometric stability does not predict transfer: Split-Half achieves ρ= 0.33 (few-shot) andρ= 0.03(cross-domain), both of which are non-significant

work page
[29]

Label-informed metrics succeed: H-Score ( ρ= 0.89 -0.92), LogME ( ρ= 0.86 -0.93), Label-RDM Alignment ( ρ= 0.81 -0.86), and related metrics achieve strong, significant correlations

work page
[30]

This null result for unsupervised stability provides valuable insights

Task alignment is required: The 0.56-0.90 gap between unsupervised and label-informed metrics demonstrates that for semantic transfer, stability must be measured relative to the downstream task structure. This null result for unsupervised stability provides valuable insights. defines the limits within which geometric consistency can predict performance fo...

work page 2019
[31]

control,

Feature selection:Top 2,000 highly variable genes selected via highly_variable_genes(). 5.Dimensionality reduction:PCA with 50 components computed per dataset. PCA embeddings for each dataset were computedseparatelybecause of the potential for batch effects if they were computed using a common shared space. However, each PCA matrix maintained a consistent...

work page 2021
[32]

Letc= 1 nctrl P i xctrl i be the control centroid in PCA space

work page
[33]

For each perturbed cellj, compute the shift vectorv j =x p j −c

work page
[34]

Compute the mean shift direction ¯v= 1 np P j vj and its magnitude∥ ¯v∥

work page
[35]

For cells with∥v j∥>10 −6, compute cosine similarity to the mean direction: Sp = 1 |V| X j∈V vj · ¯v ∥vj∥∥¯v∥ whereV={j:∥v j∥>10 −6}. This formula measures how self-consistency of a geometric perturbation is determined by the degree to which the perturbed cells move coherently together (in the same direction) relative to their controls. Perturbations with...

work page
[36]

Resample perturbations values with replacement for each dataset,

work page
[37]

Compute the statistical result of interest (correlation, partial correlation, etc.)

work page
[38]

Helpful” perturbations prop up the correlation (removing them decreases ρ); “harmful

Log select samples/estimates into collection of bootstrapped estimates The 95% confidence interval was obtained by using the percentile method (2.5% and 97.5% percentiles of the bootstrapped distribution). Bootstrapped samples that produced NaN values (due to all resamples being constant) were excluded from the calculation of percentiles. Analyses that dr...

work page 2016

[1] [1]

Seeds: S[i]×1000 + 1 for i∈ {1,

High stability, high similarity(Q1): Representations derived from the same latent structure (α= 0.9 ) with small additive noise (σ= 0.1 ). Seeds: S[i]×1000 + 1 for i∈ {1, . . . ,15} . Results: Shesha= 0.701±0.003, CKA= 0.998±0.000

work page

[2] [2]

Seeds: S[i]×1000 + 2 and S[i]×1000 + 3 for each pair

High stability, low similarity(Q2): Independent high-signal representations (α= 0.9 ) with different latent draws. Seeds: S[i]×1000 + 2 and S[i]×1000 + 3 for each pair. Results: Shesha= 0.701±0.004, CKA= 0.001±0.010

work page

[3] [3]

Seeds: S[i]×1000 + 4 and S[i]×1000 + 5 for each pair

Low stability, low similarity(Q3): Independent noise representations ( α= 0.1 ). Seeds: S[i]×1000 + 4 and S[i]×1000 + 5 for each pair. Results: Shesha = 0.001±0.003 , CKA =−0.001±0.010

work page

[4] [4]

sanity check

Low stability, high similarity(Q4): Adversarial quadrant constructed via rejection sam- pling. We generated pairs where X∼ N(0, I) 200×256 and Y=X+N(0,0.15 2I), accepting only samples where Shesha <0.4 and CKA >0.4 . This creates representa- tions with aligned sample geometry (high CKA) but inconsistent feature-split structure (low Shesha). Acceptance rat...

work page 2021

[5] [5]

Train a logistic regression probe on 250 samples from Set B 43

work page

[6] [6]

Extract the weight vectorwas the steering direction

work page

[7] [7]

Forα∈ {−2,−1.5, ...,1.5,2}, compute the steered embeddings:e ′ =e+α ˆw

work page

[8] [8]

Evaluate the probe accuracy on the remaining 250 test samples

work page

[9] [9]

convergence

Recordmax_drop= acc 0 −min α acc(α) Negative controls. •Shuffled labels: Recompute all supervised metrics with permuted labels •Random directions: Average max_drop over 20 random unit vectors per split 8.1.2 Results Primary finding: Stability predicts steerability.Supervised geometric stability showed a strong correlation with steering effectiveness: ρ(Sh...

work page arXiv 2013

[10] [10]

State-of-the-art prediction: Supervised Shesha achieves ρ >0.89 with steering effective- ness across all settings, matching or exceeding the Fisher discriminant

work page

[11] [11]

This shows that geometric consistency, rather than class separation, is a causal driver of controllability

Unique geometric signal: Partial correlations of ρ∈[0.62,0.76] after controlling for separability show that stability is detecting something that separability measures miss. This shows that geometric consistency, rather than class separation, is a causal driver of controllability

work page

[12] [12]

For semantic control, stability must be task-aligned

Task alignment is essential: Unsupervised stability predicted steering in synthetic settings (ρ= 0.77 ), but it failed on real-world tasks (ρ≈0.10 -0.35). For semantic control, stability must be task-aligned

work page

[13] [13]

CLIP” rather than “ViT

Methodology is sound: Negative controls confirm that (a) supervised metrics reflect genuine task structure (shuffled labels destroy signal), and (b) steering effects are direction-specific (true directions outperform random by1.3-10.8×). Model characteristics.Analysis of model rankings revealed that supervised contrastive models (BGE, E5, and GTE families...

work page 2019

[14] [14]

Saved the clean model weights

work page

[15] [15]

Injected Gaussian noise at 51 levels:α∈ {0.00,0.01,0.02, . . . ,0.50}

work page

[16] [16]

For each parameter tensorθ, added noise:θ ′ =θ+N(0, α·std(θ))

work page

[17] [17]

Embedded 800 SST-2 validation samples (balanced classes)

work page

[18] [18]

Computed drift metrics and downstream classification accuracy (5-fold CV , logistic regres- sion)

work page

[19] [19]

passage:

Restored clean weights before next noise level 69 This protocol simulates parameter corruption from quantization errors, bit rot, or fine-tuning drift. Each (model,α) combination used a deterministic seed for reproducibility across runs. Embedding Details.For SentenceTransformer models, we used the native encode() method. For models loaded with AutoModel,...

work page 2022

[20] [20]

Use Shesha as the primary drift metric.It provides the best combination of predictive validity (ρ≥0.92 ) and low false alarm rate (7%), detecting functionally relevant geometric changes while ignoring harmless rigid transformations

work page

[21] [21]

Use Procrustes for maximum sensitivity when false alarms are acceptable.In scenarios where any geometric change warrants investigation (e.g., security-critical deployments), Procrustes provides the earliest possible warning, but expect 6×more false positives

work page

[22] [22]

If CKA remains stable, the perturbation may be recoverable; if CKA has also dropped, functional degradation is likely

Use CKA as a confirmation signal.When Shesha triggers, check CKA to assess whether the drift has affected the dominant representation structure. If CKA remains stable, the perturbation may be recoverable; if CKA has also dropped, functional degradation is likely

work page

[23] [23]

10.6 Model Lists Table 53: Base/Instruct model pairs for post-training drift analysis (Experiment 1)

Avoid Wasserstein for drift detection.Sliced Wasserstein distance proved insufficiently sensitive, failing to detect drift until catastrophic collapse in most models. 10.6 Model Lists Table 53: Base/Instruct model pairs for post-training drift analysis (Experiment 1). Base Model Instruct Model Params HuggingFaceTB/SmolLM-135M SmolLM-135M-Instruct 0.14B Hu...

work page 2011

[24] [24]

Compute normalized embeddings for the source domain (IMDB samples)

work page

[25] [25]

Calculate all transferability metrics on source embeddings with labels

work page

[26] [26]

Fine-tune linear probes on target domain with hyperparameter search

work page

[27] [27]

Sample sizes

Report test accuracy as transfer performance measure The linear probes that were used included the following: • logistic regression (C∈ {0.1,1,10}) • ridge classifier (α∈ {1,10}) • LDA • nearest centroid The best probe was selected based on validation set accuracy. Sample sizes. • Experiment 1: ktotal ∈ {16,32,64,128,256,512} training examples (balanced a...

work page 2021

[28] [28]

Unsupervised geometric stability does not predict transfer: Split-Half achieves ρ= 0.33 (few-shot) andρ= 0.03(cross-domain), both of which are non-significant

work page

[29] [29]

Label-informed metrics succeed: H-Score ( ρ= 0.89 -0.92), LogME ( ρ= 0.86 -0.93), Label-RDM Alignment ( ρ= 0.81 -0.86), and related metrics achieve strong, significant correlations

work page

[30] [30]

This null result for unsupervised stability provides valuable insights

Task alignment is required: The 0.56-0.90 gap between unsupervised and label-informed metrics demonstrates that for semantic transfer, stability must be measured relative to the downstream task structure. This null result for unsupervised stability provides valuable insights. defines the limits within which geometric consistency can predict performance fo...

work page 2019

[31] [31]

control,

Feature selection:Top 2,000 highly variable genes selected via highly_variable_genes(). 5.Dimensionality reduction:PCA with 50 components computed per dataset. PCA embeddings for each dataset were computedseparatelybecause of the potential for batch effects if they were computed using a common shared space. However, each PCA matrix maintained a consistent...

work page 2021

[32] [32]

Letc= 1 nctrl P i xctrl i be the control centroid in PCA space

work page

[33] [33]

For each perturbed cellj, compute the shift vectorv j =x p j −c

work page

[34] [34]

Compute the mean shift direction ¯v= 1 np P j vj and its magnitude∥ ¯v∥

work page

[35] [35]

For cells with∥v j∥>10 −6, compute cosine similarity to the mean direction: Sp = 1 |V| X j∈V vj · ¯v ∥vj∥∥¯v∥ whereV={j:∥v j∥>10 −6}. This formula measures how self-consistency of a geometric perturbation is determined by the degree to which the perturbed cells move coherently together (in the same direction) relative to their controls. Perturbations with...

work page

[36] [36]

Resample perturbations values with replacement for each dataset,

work page

[37] [37]

Compute the statistical result of interest (correlation, partial correlation, etc.)

work page

[38] [38]

Helpful” perturbations prop up the correlation (removing them decreases ρ); “harmful

Log select samples/estimates into collection of bootstrapped estimates The 95% confidence interval was obtained by using the percentile method (2.5% and 97.5% percentiles of the bootstrapped distribution). Bootstrapped samples that produced NaN values (due to all resamples being constant) were excluded from the calculation of percentiles. Analyses that dr...

work page 2016