arxiv: 2605.08249 · v1 · submitted 2026-05-07 · 💻 cs.CV · eess.IV· eess.SP

Recognition: no theorem link

Dimensional Coactivation for Representational Consistency in Frozen Vision Foundation Models

Izaldein Al-Zyoud Abdulmotaleb El Saddik

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:46 UTC · model grok-4.3

classification 💻 cs.CV eess.IVeess.SP

keywords representational consistencydimensional coactivationfrozen vision modelsdeepfake detectionDINOv3feature coactivationintra-sample coherence

0 comments

The pith

Dimensional Coactivation checks whether the same feature dimensions coactivate across semantic regions in a single frozen model input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Frozen vision foundation models organize images through a learned coordinate system, and the paper asks whether this system stays internally coherent within one sample across its parts. Representational Consistency is the property that one input is represented coherently across semantic subregions. Dimensional Coactivation measures this by testing whether identical dimensions activate together across regions like eyes, mouth, and nose, without centering, L2 normalization, or full Gram coupling. Deepfake detection serves as the validation task because synthetic faces can look locally realistic while breaking the links that hold in real faces. Experiments with DINOv3 features produce strong cross-dataset results, and ablations demonstrate that reintroducing those avoided operations collapses performance.

Core claim

Dimensional Coactivation (DCA) measures representational consistency by comparing whether the same feature dimensions coactivate across semantic subregions of one input, deliberately avoiding centering, L2 normalization, and full Gram coupling since the coordinate system is fixed and raw magnitudes carry signal. With frozen DINOv3 features an eyes-mouth-nose fingerprint reaches 0.9106 AUC on CelebDF-v2 and 0.9289 on DFD under FF++ c23 cross-dataset transfer. Ablations confirm the design: centering drops CelebDF-v2 AUC to 0.459, L2 normalization to 0.862, cross-dimension coupling to 0.478, and replacement of DINOv3 by FaRL drops it to 0.582.

What carries the argument

Dimensional Coactivation (DCA): a per-dimension instrument that checks coactivation of the same feature dimensions across semantic subregions without centering or normalization.

If this is right

Deepfake detectors can extract eyes-mouth-nose fingerprints from frozen DINOv3 features for high cross-dataset AUC without retraining the backbone.
Standard centering and normalization operations erase the intra-sample signal that DCA is designed to capture.
DCA performance is tied to models like DINOv3 that maintain stable per-dimension coordinate systems rather than region extraction alone.
Replacing the backbone with FaRL sharply reduces AUC, showing that the method depends on the specific coordinate system of the chosen model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

DCA could be applied to anomaly detection in natural scenes to test whether intra-image coherence correlates with other visual tasks.
Future model training that explicitly preserves raw magnitude information might increase the usefulness of such coherence probes.
Testing DCA on temporal sequences or 3D data could check whether the same per-dimension consistency holds across time or depth.
Pairing DCA with other internal probes might map what the fixed coordinate system encodes without additional supervision.

Load-bearing premise

The learned coordinate system stays fixed within any single input so that raw feature magnitudes directly signal coherence between regions.

What would settle it

A collection of deepfake faces engineered to preserve the exact per-dimension coactivation patterns of real faces across eyes, mouth, and nose yet still scored as fake by DCA would challenge whether the measure truly detects the claimed representational break.

Figures

Figures reproduced from arXiv: 2605.08249 by Izaldein Al-Zyoud Abdulmotaleb El Saddik.

read the original abstract

Frozen vision foundation models do not merely extract features; they organize images through a learned coordinate system. We ask whether that coordinate system remains internally coherent within a single input. This leads to Representational Consistency: the study of whether a frozen foundation model represents one sample coherently across its semantic subregions. We introduce Dimensional Coactivation (DCA), a per-dimension instrument for measuring this coherence. DCA compares semantic regions by asking whether the same feature dimensions coactivate across them. Unlike classical similarity measures, it deliberately avoids centering, L2 normalization, and full Gram coupling. These operations are useful when comparing different models or distributions, but they are mismatched to the intra-sample setting, where the coordinate system is fixed and raw magnitude carries signal. Deepfake detection provides a natural validation task. Synthetic faces may reproduce plausible eyes, noses, and mouths while breaking the representational structure that links those regions in real faces. Using frozen DINOv3 features, DCA exposes this break: an eyes-mouth-nose fingerprint achieves 0.9106 AUC on CelebDF-v2 and 0.9289 on DFD under FF++ c23 cross-dataset transfer. The design is also sharply validated by ablation: reintroducing centering collapses CelebDF-v2 AUC to 0.459, L2 normalization reduces it to 0.862, and cross-dimension coupling reduces it to 0.478. Finally, replacing DINOv3 with FaRL collapses CelebDF-v2 AUC to 0.582. DCA therefore depends on a stable per-dimension coordinate system, not on region extraction alone. These results position DCA as an instrument for measuring intra-sample representational coherence in frozen foundation models, with deepfake detection as the first validation task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DCA gives a per-dimension consistency check for frozen models that skips centering and normalization, and the ablations make the case for why those choices matter on deepfake detection.

read the letter

The paper introduces Dimensional Coactivation as a way to measure whether a frozen vision model like DINOv3 keeps its feature dimensions coherent across parts of one image. It compares regions by checking shared dimension activations without centering, L2 normalization, or full cross-dimension coupling, on the grounds that those steps fit inter-sample work better than intra-sample checks where raw magnitudes matter and the coordinate system stays fixed inside the input.

Referee Report

1 major / 3 minor

Summary. The paper introduces Dimensional Coactivation (DCA) as a per-dimension instrument to measure intra-sample representational consistency in frozen vision foundation models. It argues that the learned coordinate system remains fixed within a single input and that raw feature magnitudes carry signal, making centering, L2 normalization, and full Gram coupling mismatched; DCA is validated as a deepfake detector via an eyes-mouth-nose fingerprint on DINOv3 features, reporting AUCs of 0.9106 on CelebDF-v2 and 0.9289 on DFD under FF++ c23 cross-dataset transfer, with ablations showing sharp drops when the design choices are altered.

Significance. If the results hold, DCA offers a lightweight, training-free probe for the internal coherence of representations in frozen foundation models, with immediate utility for detecting synthetic content that preserves local appearance but breaks cross-region consistency. The empirical ablations (centering, L2, cross-dimension coupling, and model swap) provide direct evidence that performance is tied to the proposed per-dimension, magnitude-preserving approach rather than region extraction alone.

major comments (1)

Results section (AUC reporting): the headline AUC values (0.9106, 0.9289) and ablation deltas are presented without error bars, standard deviations, or statistical significance tests; this makes it difficult to determine whether the observed differences (e.g., 0.9106 vs. 0.459 for centering) exceed what would be expected from sampling variability alone.

minor comments (3)

The exact procedure for defining and extracting the eyes-mouth-nose regions (and constructing the fingerprint) should be stated with pseudocode or a small diagram to ensure reproducibility.
A brief comparison table contrasting DCA with classical measures (cosine, Pearson, etc.) on the same intra-sample task would help readers see the claimed mismatch more concretely.
The manuscript should explicitly state whether region masks are obtained via ground-truth annotations, off-the-shelf detectors, or model attention; this choice affects the interpretation of the ablation results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive recommendation of minor revision and for the constructive comment on result reporting. We address the point below.

read point-by-point responses

Referee: Results section (AUC reporting): the headline AUC values (0.9106, 0.9289) and ablation deltas are presented without error bars, standard deviations, or statistical significance tests; this makes it difficult to determine whether the observed differences (e.g., 0.9106 vs. 0.459 for centering) exceed what would be expected from sampling variability alone.

Authors: We agree that measures of variability would improve the presentation. The reported differences are large in absolute terms (0.9106 vs. 0.459 for centering; 0.9106 vs. 0.862 for L2 normalization; 0.9106 vs. 0.478 for cross-dimension coupling), making it unlikely that they arise from sampling variability alone. Nevertheless, to address the concern directly, the revised manuscript will include standard deviations computed over five independent runs that vary the random seed for both feature extraction and the downstream classifier. We will also add bootstrap-derived 95% confidence intervals for the headline AUCs and note the statistical significance of the primary ablation contrasts. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper defines DCA as a per-dimension coactivation measure on frozen features that deliberately omits centering, L2 normalization, and full Gram coupling, then validates the design choice through ablations on an external deepfake detection task (CelebDF-v2 AUC drops to 0.459 with centering, 0.862 with L2, 0.478 with cross-dimension coupling). These ablations and cross-dataset transfer results (0.9106 AUC on CelebDF-v2, 0.9289 on DFD) are independent measurements, not reductions of the method to its own inputs by construction. No self-citations, uniqueness theorems, or ansatzes from prior author work appear load-bearing; the coordinate-system stability claim is directly falsified by the reported experiments rather than assumed.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on a domain assumption about fixed model coordinate systems and introduces DCA as a new measurement construct. No explicit free parameters are described in the abstract.

axioms (1)

domain assumption Frozen vision foundation models organize images through a learned coordinate system that remains internally coherent within a single input.
This premise directly motivates the definition of Representational Consistency and the design of DCA.

invented entities (1)

Dimensional Coactivation (DCA) no independent evidence
purpose: Per-dimension instrument that measures representational consistency by comparing coactivation of the same feature dimensions across semantic subregions without centering or normalization.
Newly proposed in the paper; its utility is demonstrated via deepfake detection but lacks independent evidence outside this validation task.

pith-pipeline@v0.9.0 · 5624 in / 1486 out tokens · 109810 ms · 2026-05-12T02:46:36.429304+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

[1]

Transactions on Machine Learning Research , year =

Oquab, Maxime and Darcet, Timoth. Transactions on Machine Learning Research , year =

work page
[2]

CVPR , year =

General Facial Representation Learning in a Visual-Linguistic Manner , author =. CVPR , year =

work page
[3]

ICLR , year =

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author =. ICLR , year =

work page
[4]

ICCV , year =

Emerging Properties in Self-Supervised Vision Transformers , author =. ICCV , year =

work page
[5]

Frontiers in Systems Neuroscience , volume =

Representational Similarity Analysis --- Connecting the Branches of Systems Neuroscience , author =. Frontiers in Systems Neuroscience , volume =

work page
[6]

The Platonic Representation Hypothesis

The Platonic Representation Hypothesis , author =. arXiv preprint arXiv:2405.07987 , year =

work page Pith review arXiv
[7]

ICML , year =

Similarity of Neural Network Representations Revisited , author =. ICML , year =

work page
[8]

Measuring Statistical Dependence with

Gretton, Arthur and Bousquet, Olivier and Smola, Alexander and Sch. Measuring Statistical Dependence with. ALT , year =

work page
[9]

Li, Yuezun and Yang, Xin and Sun, Pu and Qi, Honggang and Lyu, Siwei , booktitle =. Celeb-

work page
[10]

CVPR , year =

Xception: Deep Learning with Depthwise Separable Convolutions , author =. CVPR , year =

work page
[11]

Li, Lingzhi and Bao, Jianmin and Zhang, Ting and Yang, Hao and Chen, Dong and Wen, Fang and Guo, Baining , booktitle =. Face

work page
[12]

CVPR , year =

Detecting Deepfakes with Self-Blended Images , author =. CVPR , year =

work page
[13]

CVPR , year =

End-to-End Reconstruction-Classification Learning for Face Forgery Detection , author =. CVPR , year =

work page
[14]

CVPR , year =

Multi-Attentional Deepfake Detection , author =. CVPR , year =

work page
[15]

CVPR , year =

Towards Universal Fake Image Detectors that Generalize Across Generative Models , author =. CVPR , year =

work page
[16]

CVPR , year =

Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection , author =. CVPR , year =

work page
[17]

Deng, Jiankang and Guo, Jia and Ververas, Evangelos and Kotsia, Irene and Zafeiriou, Stefanos , booktitle =

work page
[18]

ICLR , year =

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author =. ICLR , year =

work page
[19]

JMLR , volume =

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , author =. JMLR , volume =

work page
[20]

CVPR , year =

Image Style Transfer Using Convolutional Neural Networks , author =. CVPR , year =

work page
[21]

NeurIPS , year =

A Simple Neural Network Module for Relational Reasoning , author =. NeurIPS , year =

work page
[22]

Audio-Visual Person-of-Interest

Cozzolino, Davide and Thies, Justus and R. Audio-Visual Person-of-Interest. CVPR Workshops , year =

work page
[23]

Bridging the Gap: Dense Correspondence Architectures via

Zhang, Yang and others , journal =. Bridging the Gap: Dense Correspondence Architectures via

work page
[24]

NeurIPS , year =

Emergent Correspondence from Image Diffusion , author =. NeurIPS , year =

work page
[25]

ICLR , year =

Unsupervised Semantic Segmentation by Distilling Feature Correspondences , author =. ICLR , year =

work page
[26]

NeurIPS , year =

Unsupervised Semantic Correspondence Using Stable Diffusion , author =. NeurIPS , year =

work page
[27]

2022 , note =

Doersch, Carl and Yang, Yi and Zisserman, Andrew and Schmid, Cordelia and Gupta, Ankush , booktitle =. 2022 , note =

work page 2022
[28]

Amir, Shir and Gandelsman, Yossi and Bagon, Shai and Dekel, Tali , booktitle =. Deep. 2022 , note =

work page 2022
[29]

Haliassos, Alexandros and Vougioukas, Konstantinos and Petridis, Stavros and Pantic, Maja , booktitle =

work page
[30]

2026 , note =

Segmentation-Guided Geometric Feature Extraction from Frozen Self-Supervised. 2026 , note =

work page 2026
[31]

ICML , year =

Detecting Out-of-Distribution Examples with Gram Matrices , author =. ICML , year =

work page
[32]

2025 , eprint =

Sim. 2025 , eprint =

work page 2025