Beyond Neural Activity Prediction: Probing Latent Representations in Mouse V1 Digital Twins
Pith reviewed 2026-05-25 02:55 UTC · model grok-4.3
The pith
Digital twins of mouse V1 with comparable neural prediction accuracy still differ substantially in probed latent representations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Although representational properties such as probe accuracy and hidden-layer eigenspectra covary with neural-response prediction accuracy across architectures, digital twins with comparable prediction scores can still differ substantially in probe performance and latent-unit tuning. The work therefore treats multi-level probing of frozen models as a necessary complement to standard prediction evaluation when the models serve as substrates for studying visual computations.
What carries the argument
Multi-level probing of frozen digital twin models: linear decodability from controlled visual probes, latent-unit tuning to canonical features, and hidden-layer population eigenspectra.
If this is right
- Prediction accuracy correlates with stronger linear decodability of orientation, contrast, and motion across architectures.
- Highly predictive models show flatter hidden-population eigenspectra, aligning their geometry more closely with mouse V1 recordings.
- Architectural differences produce distinct latent representations even when training data and objective remain identical.
- Multi-level probing supplies a framework for evaluating digital twins beyond prediction accuracy alone.
Where Pith is reading between the lines
- When choosing a digital twin for stimulus design or hypothesis generation, accuracy scores alone may be insufficient and probe results may need to be inspected separately.
- Divergent latent tuning could cause the same model pair to generate different predictions for stimuli outside the original training distribution.
- Extending the same three-level probe battery to digital twins of other cortical areas could test whether the observed accuracy-representation dissociation is general.
Load-bearing premise
The chosen visual probes for orientation, contrast, and motion together with the tuning and eigenspectrum measures are sufficient to expose differences in latent representations that matter for using the models as experimental systems.
What would settle it
Finding that models with matched prediction accuracy produce statistically identical results on the three probing levels across a new set of architectures or stimuli would falsify the claim that comparable accuracy still permits substantial representational differences.
Figures
read the original abstract
Digital twins of sensory cortex serve as powerful response oracles. Although prediction accuracy is the central metric by which these models are evaluated, it provides limited insight into the latent representations that support those predictions. This becomes increasingly important as digital twins are used as in silico experimental systems for stimulus design and hypothesis generation: models with similar prediction accuracy may rely on different latent representations. We address this gap by systematically probing a family of digital twins of mouse V1 trained to predict neural activity from naturalistic videos recorded in freely moving mice. The models share the same training data and neural-prediction objective, but differ in visual-encoder architecture. For each frozen model, we characterize latent representations along three levels: (i) linear decodability from controlled visual probes of orientation, contrast, and motion; (ii) latent-unit tuning to canonical visual features including orientation selectivity, contrast response, spatial-frequency tuning; and (iii) population geometry of hidden-layer activity. Across architectures, better neural-response prediction correlates with stronger probe accuracy. Additionally, highly predictive models exhibit flatter hidden-population eigenspectra, indicating higher-dimensional representations closer to population-geometry signatures reported in mouse V1. Although these representational properties covary with prediction accuracy across architectures, digital twins with comparable prediction scores can still differ substantially in probe performance and latent-unit tuning. These results establish multi-level representational probing as a complement to standard neural-prediction evaluation, providing a framework for understanding digital twins not only as predictors, but also as substrates for studying visual computations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript trains a family of digital-twin models of mouse V1 that share the same naturalistic-video training set and neural-prediction objective but differ in visual-encoder architecture. After freezing the models, the authors characterize their latent representations at three levels: (i) linear decodability of orientation, contrast and motion from controlled probes, (ii) tuning properties of individual latent units to canonical visual features, and (iii) population geometry via hidden-layer eigenspectra. The central empirical claim is that prediction accuracy correlates with probe performance and with flatter eigenspectra (higher-dimensional representations closer to mouse V1), yet models with statistically comparable prediction scores can still differ substantially in probe accuracy and unit tuning; the authors therefore argue that multi-level representational probing is a necessary complement to prediction accuracy when digital twins are used for in-silico stimulus design and hypothesis generation.
Significance. If the reported differences survive rigorous statistical controls, the work supplies a concrete, multi-level protocol for evaluating whether two digital twins that achieve similar prediction accuracy are interchangeable as experimental substrates. The observation that higher-performing models exhibit eigenspectra closer to biological V1 is a useful positive control. The absence of any downstream in-silico task validation, however, leaves open whether the measured differences actually matter for the use-cases invoked in the abstract.
major comments (2)
- [Abstract] Abstract: the claim that 'digital twins with comparable prediction scores can still differ substantially in probe performance and latent-unit tuning' is presented without any statistical details, sample sizes, number of architectures, or controls; this information is load-bearing for the central claim that prediction accuracy is insufficient.
- [Abstract] Abstract (and framing throughout): the manuscript invokes the three characterization levels as closing the 'insight gap' for using digital twins as experimental systems, yet provides no test that variation on the chosen probes or eigenspectra produces divergent outcomes on any concrete downstream in-silico task (stimulus optimization, hypothesis generation, etc.).
minor comments (1)
- The abstract would be clearer if it stated the exact number of models/architectures compared and the precise statistical tests underlying the reported correlations.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address the two major comments point-by-point below. Where the comments identify missing details or limitations, we have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'digital twins with comparable prediction scores can still differ substantially in probe performance and latent-unit tuning' is presented without any statistical details, sample sizes, number of architectures, or controls; this information is load-bearing for the central claim that prediction accuracy is insufficient.
Authors: We agree that the abstract should supply the requested statistical context. The revised abstract now states the number of architectures examined, the number of random seeds per architecture, the statistical tests applied to establish comparable prediction accuracy, and the controls for training data and objective function. These additions make the central claim—that models with statistically indistinguishable prediction scores can still differ in probe performance and unit tuning—self-contained within the abstract. revision: yes
-
Referee: [Abstract] Abstract (and framing throughout): the manuscript invokes the three characterization levels as closing the 'insight gap' for using digital twins as experimental systems, yet provides no test that variation on the chosen probes or eigenspectra produces divergent outcomes on any concrete downstream in-silico task (stimulus optimization, hypothesis generation, etc.).
Authors: The referee correctly notes that the manuscript does not demonstrate that the observed differences in probes or eigenspectra lead to divergent outcomes on any specific downstream in-silico task. This is a genuine limitation of the present study; performing such validation would require new experiments that lie outside the scope of the current work. We have therefore revised the discussion to (i) explicitly acknowledge the absence of downstream-task validation and (ii) articulate how the reported representational differences could, in principle, affect stimulus optimization and hypothesis generation. We maintain that the multi-level probing already demonstrates that prediction accuracy alone is insufficient to guarantee representational equivalence, which is the paper’s primary claim. revision: partial
Circularity Check
No circularity: empirical measurements on independently probed latent representations
full rationale
The paper trains digital-twin models on a neural-activity prediction objective from naturalistic videos, then measures three separate characterization levels on frozen models: linear decodability of orientation/contrast/motion probes, latent-unit tuning curves, and hidden-layer eigenspectra. These quantities are computed from controlled stimuli or population statistics and are not algebraically or statistically forced by the training loss. No equations, fitted parameters, or self-citations are presented that would reduce any reported correlation or difference to a definitional identity. The observed covariation and residual differences are therefore genuine empirical findings rather than tautological restatements of the input data or training objective.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Models trained to predict neural activity from naturalistic videos capture relevant aspects of V1 computation.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We address this gap by systematically probing a family of convolutional-recurrent digital twins of mouse V1... linear decodability from controlled visual probes of orientation, contrast, and motion; (ii) latent-unit tuning... (iii) population geometry of hidden-layer activity.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
highly predictive models exhibit flatter hidden-population eigenspectra
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.