Recognition: no theorem link
Dimensional Coactivation for Representational Consistency in Frozen Vision Foundation Models
Pith reviewed 2026-05-12 02:46 UTC · model grok-4.3
The pith
Dimensional Coactivation checks whether the same feature dimensions coactivate across semantic regions in a single frozen model input.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Dimensional Coactivation (DCA) measures representational consistency by comparing whether the same feature dimensions coactivate across semantic subregions of one input, deliberately avoiding centering, L2 normalization, and full Gram coupling since the coordinate system is fixed and raw magnitudes carry signal. With frozen DINOv3 features an eyes-mouth-nose fingerprint reaches 0.9106 AUC on CelebDF-v2 and 0.9289 on DFD under FF++ c23 cross-dataset transfer. Ablations confirm the design: centering drops CelebDF-v2 AUC to 0.459, L2 normalization to 0.862, cross-dimension coupling to 0.478, and replacement of DINOv3 by FaRL drops it to 0.582.
What carries the argument
Dimensional Coactivation (DCA): a per-dimension instrument that checks coactivation of the same feature dimensions across semantic subregions without centering or normalization.
If this is right
- Deepfake detectors can extract eyes-mouth-nose fingerprints from frozen DINOv3 features for high cross-dataset AUC without retraining the backbone.
- Standard centering and normalization operations erase the intra-sample signal that DCA is designed to capture.
- DCA performance is tied to models like DINOv3 that maintain stable per-dimension coordinate systems rather than region extraction alone.
- Replacing the backbone with FaRL sharply reduces AUC, showing that the method depends on the specific coordinate system of the chosen model.
Where Pith is reading between the lines
- DCA could be applied to anomaly detection in natural scenes to test whether intra-image coherence correlates with other visual tasks.
- Future model training that explicitly preserves raw magnitude information might increase the usefulness of such coherence probes.
- Testing DCA on temporal sequences or 3D data could check whether the same per-dimension consistency holds across time or depth.
- Pairing DCA with other internal probes might map what the fixed coordinate system encodes without additional supervision.
Load-bearing premise
The learned coordinate system stays fixed within any single input so that raw feature magnitudes directly signal coherence between regions.
What would settle it
A collection of deepfake faces engineered to preserve the exact per-dimension coactivation patterns of real faces across eyes, mouth, and nose yet still scored as fake by DCA would challenge whether the measure truly detects the claimed representational break.
Figures
read the original abstract
Frozen vision foundation models do not merely extract features; they organize images through a learned coordinate system. We ask whether that coordinate system remains internally coherent within a single input. This leads to Representational Consistency: the study of whether a frozen foundation model represents one sample coherently across its semantic subregions. We introduce Dimensional Coactivation (DCA), a per-dimension instrument for measuring this coherence. DCA compares semantic regions by asking whether the same feature dimensions coactivate across them. Unlike classical similarity measures, it deliberately avoids centering, L2 normalization, and full Gram coupling. These operations are useful when comparing different models or distributions, but they are mismatched to the intra-sample setting, where the coordinate system is fixed and raw magnitude carries signal. Deepfake detection provides a natural validation task. Synthetic faces may reproduce plausible eyes, noses, and mouths while breaking the representational structure that links those regions in real faces. Using frozen DINOv3 features, DCA exposes this break: an eyes-mouth-nose fingerprint achieves 0.9106 AUC on CelebDF-v2 and 0.9289 on DFD under FF++ c23 cross-dataset transfer. The design is also sharply validated by ablation: reintroducing centering collapses CelebDF-v2 AUC to 0.459, L2 normalization reduces it to 0.862, and cross-dimension coupling reduces it to 0.478. Finally, replacing DINOv3 with FaRL collapses CelebDF-v2 AUC to 0.582. DCA therefore depends on a stable per-dimension coordinate system, not on region extraction alone. These results position DCA as an instrument for measuring intra-sample representational coherence in frozen foundation models, with deepfake detection as the first validation task.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Dimensional Coactivation (DCA) as a per-dimension instrument to measure intra-sample representational consistency in frozen vision foundation models. It argues that the learned coordinate system remains fixed within a single input and that raw feature magnitudes carry signal, making centering, L2 normalization, and full Gram coupling mismatched; DCA is validated as a deepfake detector via an eyes-mouth-nose fingerprint on DINOv3 features, reporting AUCs of 0.9106 on CelebDF-v2 and 0.9289 on DFD under FF++ c23 cross-dataset transfer, with ablations showing sharp drops when the design choices are altered.
Significance. If the results hold, DCA offers a lightweight, training-free probe for the internal coherence of representations in frozen foundation models, with immediate utility for detecting synthetic content that preserves local appearance but breaks cross-region consistency. The empirical ablations (centering, L2, cross-dimension coupling, and model swap) provide direct evidence that performance is tied to the proposed per-dimension, magnitude-preserving approach rather than region extraction alone.
major comments (1)
- Results section (AUC reporting): the headline AUC values (0.9106, 0.9289) and ablation deltas are presented without error bars, standard deviations, or statistical significance tests; this makes it difficult to determine whether the observed differences (e.g., 0.9106 vs. 0.459 for centering) exceed what would be expected from sampling variability alone.
minor comments (3)
- The exact procedure for defining and extracting the eyes-mouth-nose regions (and constructing the fingerprint) should be stated with pseudocode or a small diagram to ensure reproducibility.
- A brief comparison table contrasting DCA with classical measures (cosine, Pearson, etc.) on the same intra-sample task would help readers see the claimed mismatch more concretely.
- The manuscript should explicitly state whether region masks are obtained via ground-truth annotations, off-the-shelf detectors, or model attention; this choice affects the interpretation of the ablation results.
Simulated Author's Rebuttal
We thank the referee for their positive recommendation of minor revision and for the constructive comment on result reporting. We address the point below.
read point-by-point responses
-
Referee: Results section (AUC reporting): the headline AUC values (0.9106, 0.9289) and ablation deltas are presented without error bars, standard deviations, or statistical significance tests; this makes it difficult to determine whether the observed differences (e.g., 0.9106 vs. 0.459 for centering) exceed what would be expected from sampling variability alone.
Authors: We agree that measures of variability would improve the presentation. The reported differences are large in absolute terms (0.9106 vs. 0.459 for centering; 0.9106 vs. 0.862 for L2 normalization; 0.9106 vs. 0.478 for cross-dimension coupling), making it unlikely that they arise from sampling variability alone. Nevertheless, to address the concern directly, the revised manuscript will include standard deviations computed over five independent runs that vary the random seed for both feature extraction and the downstream classifier. We will also add bootstrap-derived 95% confidence intervals for the headline AUCs and note the statistical significance of the primary ablation contrasts. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper defines DCA as a per-dimension coactivation measure on frozen features that deliberately omits centering, L2 normalization, and full Gram coupling, then validates the design choice through ablations on an external deepfake detection task (CelebDF-v2 AUC drops to 0.459 with centering, 0.862 with L2, 0.478 with cross-dimension coupling). These ablations and cross-dataset transfer results (0.9106 AUC on CelebDF-v2, 0.9289 on DFD) are independent measurements, not reductions of the method to its own inputs by construction. No self-citations, uniqueness theorems, or ansatzes from prior author work appear load-bearing; the coordinate-system stability claim is directly falsified by the reported experiments rather than assumed.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Frozen vision foundation models organize images through a learned coordinate system that remains internally coherent within a single input.
invented entities (1)
-
Dimensional Coactivation (DCA)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Transactions on Machine Learning Research , year =
Oquab, Maxime and Darcet, Timoth. Transactions on Machine Learning Research , year =
-
[2]
General Facial Representation Learning in a Visual-Linguistic Manner , author =. CVPR , year =
-
[3]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author =. ICLR , year =
-
[4]
Emerging Properties in Self-Supervised Vision Transformers , author =. ICCV , year =
-
[5]
Frontiers in Systems Neuroscience , volume =
Representational Similarity Analysis --- Connecting the Branches of Systems Neuroscience , author =. Frontiers in Systems Neuroscience , volume =
-
[6]
The Platonic Representation Hypothesis
The Platonic Representation Hypothesis , author =. arXiv preprint arXiv:2405.07987 , year =
-
[7]
Similarity of Neural Network Representations Revisited , author =. ICML , year =
-
[8]
Measuring Statistical Dependence with
Gretton, Arthur and Bousquet, Olivier and Smola, Alexander and Sch. Measuring Statistical Dependence with. ALT , year =
-
[9]
Li, Yuezun and Yang, Xin and Sun, Pu and Qi, Honggang and Lyu, Siwei , booktitle =. Celeb-
-
[10]
Xception: Deep Learning with Depthwise Separable Convolutions , author =. CVPR , year =
-
[11]
Li, Lingzhi and Bao, Jianmin and Zhang, Ting and Yang, Hao and Chen, Dong and Wen, Fang and Guo, Baining , booktitle =. Face
- [12]
-
[13]
End-to-End Reconstruction-Classification Learning for Face Forgery Detection , author =. CVPR , year =
- [14]
-
[15]
Towards Universal Fake Image Detectors that Generalize Across Generative Models , author =. CVPR , year =
-
[16]
Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection , author =. CVPR , year =
-
[17]
Deng, Jiankang and Guo, Jia and Ververas, Evangelos and Kotsia, Irene and Zafeiriou, Stefanos , booktitle =
-
[18]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author =. ICLR , year =
-
[19]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , author =. JMLR , volume =
-
[20]
Image Style Transfer Using Convolutional Neural Networks , author =. CVPR , year =
-
[21]
A Simple Neural Network Module for Relational Reasoning , author =. NeurIPS , year =
-
[22]
Audio-Visual Person-of-Interest
Cozzolino, Davide and Thies, Justus and R. Audio-Visual Person-of-Interest. CVPR Workshops , year =
-
[23]
Bridging the Gap: Dense Correspondence Architectures via
Zhang, Yang and others , journal =. Bridging the Gap: Dense Correspondence Architectures via
-
[24]
Emergent Correspondence from Image Diffusion , author =. NeurIPS , year =
-
[25]
Unsupervised Semantic Segmentation by Distilling Feature Correspondences , author =. ICLR , year =
-
[26]
Unsupervised Semantic Correspondence Using Stable Diffusion , author =. NeurIPS , year =
-
[27]
Doersch, Carl and Yang, Yi and Zisserman, Andrew and Schmid, Cordelia and Gupta, Ankush , booktitle =. 2022 , note =
work page 2022
-
[28]
Amir, Shir and Gandelsman, Yossi and Bagon, Shai and Dekel, Tali , booktitle =. Deep. 2022 , note =
work page 2022
-
[29]
Haliassos, Alexandros and Vougioukas, Konstantinos and Petridis, Stavros and Pantic, Maja , booktitle =
-
[30]
Segmentation-Guided Geometric Feature Extraction from Frozen Self-Supervised. 2026 , note =
work page 2026
-
[31]
Detecting Out-of-Distribution Examples with Gram Matrices , author =. ICML , year =
- [32]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.