Beyond Neural Activity Prediction: Probing Latent Representations in Mouse V1 Digital Twins

Adriano Lima; Marius Schneider; Michael Beyeler; Yuchen Hou

arxiv: 2605.23122 · v2 · pith:DCTR2326new · submitted 2026-05-22 · 🧬 q-bio.NC

Beyond Neural Activity Prediction: Probing Latent Representations in Mouse V1 Digital Twins

Adriano Lima , Yuchen Hou , Michael Beyeler , Marius Schneider This is my paper

Pith reviewed 2026-05-25 02:55 UTC · model grok-4.3

classification 🧬 q-bio.NC

keywords digital twinsmouse V1neural activity predictionlatent representationsvisual probespopulation eigenspectramodel architecturesorientation selectivity

0 comments

The pith

Digital twins of mouse V1 with comparable neural prediction accuracy still differ substantially in probed latent representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains multiple digital twins of mouse V1 on the same naturalistic video data and neural prediction task but varies the visual encoder architecture. It then freezes each model and measures its latent representations through linear decodability of orientation, contrast, and motion probes, through tuning curves of individual hidden units, and through the eigenspectrum of population activity. Across architectures, higher prediction accuracy tracks with better probe decoding and flatter eigenspectra, yet models that reach similar accuracy levels still separate on probe performance and unit tuning. This matters because these models are used as in silico experimental systems, where the specific latent features determine what visual computations they can support or simulate.

Core claim

Although representational properties such as probe accuracy and hidden-layer eigenspectra covary with neural-response prediction accuracy across architectures, digital twins with comparable prediction scores can still differ substantially in probe performance and latent-unit tuning. The work therefore treats multi-level probing of frozen models as a necessary complement to standard prediction evaluation when the models serve as substrates for studying visual computations.

What carries the argument

Multi-level probing of frozen digital twin models: linear decodability from controlled visual probes, latent-unit tuning to canonical features, and hidden-layer population eigenspectra.

If this is right

Prediction accuracy correlates with stronger linear decodability of orientation, contrast, and motion across architectures.
Highly predictive models show flatter hidden-population eigenspectra, aligning their geometry more closely with mouse V1 recordings.
Architectural differences produce distinct latent representations even when training data and objective remain identical.
Multi-level probing supplies a framework for evaluating digital twins beyond prediction accuracy alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

When choosing a digital twin for stimulus design or hypothesis generation, accuracy scores alone may be insufficient and probe results may need to be inspected separately.
Divergent latent tuning could cause the same model pair to generate different predictions for stimuli outside the original training distribution.
Extending the same three-level probe battery to digital twins of other cortical areas could test whether the observed accuracy-representation dissociation is general.

Load-bearing premise

The chosen visual probes for orientation, contrast, and motion together with the tuning and eigenspectrum measures are sufficient to expose differences in latent representations that matter for using the models as experimental systems.

What would settle it

Finding that models with matched prediction accuracy produce statistically identical results on the three probing levels across a new set of architectures or stimuli would falsify the claim that comparable accuracy still permits substantial representational differences.

Figures

Figures reproduced from arXiv: 2605.23122 by Adriano Lima, Marius Schneider, Michael Beyeler, Yuchen Hou.

**Figure 2.** Figure 2: Controlled visual probes reveal functional access in V1 digital twins. (A) Example stimuli for the three controlled visual probes: orientation discrimination, dynamic contrast detection, and RDK motion-direction discrimination. (B) Linear readout performance across visual-encoder architectures. Top: representative task-performance curves for each probe. Bottom: architecture-level relationships between prob… view at source ↗

**Figure 3.** Figure 3: Latent-unit tuning provides interpretable axes for comparing V1 digital twins. (A) We probed latent units with parametric visual stimuli varying canonical stimulus dimensions, including orientation, contrast, spatial frequency, and phase. Example latent units show structured responses to orientation, contrast, and spatial frequency, from which tuning metrics such as global orientation selectivity (gOSI), c… view at source ↗

**Figure 4.** Figure 4: Population geometry of hidden representations covaries with neural prediction performance. (A) Example eigenspectra of latent-layer activity for three visual-encoder architectures. For each model, hidden activations were collected across naturalistic video stimuli, decomposed with PCA, and summarized by the eigenspectrum of explained variance. Solid lines show powerlaw fits, λk ∝ k −α, over the fitted ran… view at source ↗

read the original abstract

Digital twins of sensory cortex serve as powerful response oracles. Although prediction accuracy is the central metric by which these models are evaluated, it provides limited insight into the latent representations that support those predictions. This becomes increasingly important as digital twins are used as in silico experimental systems for stimulus design and hypothesis generation: models with similar prediction accuracy may rely on different latent representations. We address this gap by systematically probing a family of digital twins of mouse V1 trained to predict neural activity from naturalistic videos recorded in freely moving mice. The models share the same training data and neural-prediction objective, but differ in visual-encoder architecture. For each frozen model, we characterize latent representations along three levels: (i) linear decodability from controlled visual probes of orientation, contrast, and motion; (ii) latent-unit tuning to canonical visual features including orientation selectivity, contrast response, spatial-frequency tuning; and (iii) population geometry of hidden-layer activity. Across architectures, better neural-response prediction correlates with stronger probe accuracy. Additionally, highly predictive models exhibit flatter hidden-population eigenspectra, indicating higher-dimensional representations closer to population-geometry signatures reported in mouse V1. Although these representational properties covary with prediction accuracy across architectures, digital twins with comparable prediction scores can still differ substantially in probe performance and latent-unit tuning. These results establish multi-level representational probing as a complement to standard neural-prediction evaluation, providing a framework for understanding digital twins not only as predictors, but also as substrates for studying visual computations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

V1 digital twins with matched prediction accuracy can still differ on linear probes and tuning, but the work stops short of showing those differences matter for stimulus design.

read the letter

The main thing to know is that this paper demonstrates V1 models trained on the same data can match on neural prediction yet diverge on how their latents handle orientation, contrast, and motion when probed linearly or via tuning curves. The population eigenspectra also vary, with better predictors showing flatter spectra closer to biology. That separation is the new result and it is cleanly shown by holding training data and objective fixed while swapping encoders. The multi-level approach is a straightforward way to surface those differences without extra training. It does the field a service by treating accuracy as necessary but not sufficient for using these models as experimental stand-ins. The correlation between accuracy and probe strength is reported across architectures, and the within-accuracy variation is the part that matters for the claim. The population geometry measure is a good addition because it links back to existing V1 recordings. The soft spot is the missing link to the stated use case. The abstract frames the probes as addressing the gap for stimulus design and hypothesis generation, yet there is no test showing that models differing on these measures actually produce different outcomes on any in silico experiment. Without that, it is unclear whether the observed variation is the kind that would change conclusions in practice. The abstract also gives no statistical details or sample sizes, so the size of the differences is hard to judge from what is here. This is for computational neuroscientists who build or evaluate sensory digital twins and want more than accuracy numbers. A reader focused on model interpretability in systems neuroscience would get concrete value from the comparison setup. It is worth sending to peer review because the core observation is new, the methods are replicable, and the gap it targets is real; the authors can address the downstream validation and stats in revision.

Referee Report

2 major / 1 minor

Summary. The manuscript trains a family of digital-twin models of mouse V1 that share the same naturalistic-video training set and neural-prediction objective but differ in visual-encoder architecture. After freezing the models, the authors characterize their latent representations at three levels: (i) linear decodability of orientation, contrast and motion from controlled probes, (ii) tuning properties of individual latent units to canonical visual features, and (iii) population geometry via hidden-layer eigenspectra. The central empirical claim is that prediction accuracy correlates with probe performance and with flatter eigenspectra (higher-dimensional representations closer to mouse V1), yet models with statistically comparable prediction scores can still differ substantially in probe accuracy and unit tuning; the authors therefore argue that multi-level representational probing is a necessary complement to prediction accuracy when digital twins are used for in-silico stimulus design and hypothesis generation.

Significance. If the reported differences survive rigorous statistical controls, the work supplies a concrete, multi-level protocol for evaluating whether two digital twins that achieve similar prediction accuracy are interchangeable as experimental substrates. The observation that higher-performing models exhibit eigenspectra closer to biological V1 is a useful positive control. The absence of any downstream in-silico task validation, however, leaves open whether the measured differences actually matter for the use-cases invoked in the abstract.

major comments (2)

[Abstract] Abstract: the claim that 'digital twins with comparable prediction scores can still differ substantially in probe performance and latent-unit tuning' is presented without any statistical details, sample sizes, number of architectures, or controls; this information is load-bearing for the central claim that prediction accuracy is insufficient.
[Abstract] Abstract (and framing throughout): the manuscript invokes the three characterization levels as closing the 'insight gap' for using digital twins as experimental systems, yet provides no test that variation on the chosen probes or eigenspectra produces divergent outcomes on any concrete downstream in-silico task (stimulus optimization, hypothesis generation, etc.).

minor comments (1)

The abstract would be clearer if it stated the exact number of models/architectures compared and the precise statistical tests underlying the reported correlations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address the two major comments point-by-point below. Where the comments identify missing details or limitations, we have revised the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'digital twins with comparable prediction scores can still differ substantially in probe performance and latent-unit tuning' is presented without any statistical details, sample sizes, number of architectures, or controls; this information is load-bearing for the central claim that prediction accuracy is insufficient.

Authors: We agree that the abstract should supply the requested statistical context. The revised abstract now states the number of architectures examined, the number of random seeds per architecture, the statistical tests applied to establish comparable prediction accuracy, and the controls for training data and objective function. These additions make the central claim—that models with statistically indistinguishable prediction scores can still differ in probe performance and unit tuning—self-contained within the abstract. revision: yes
Referee: [Abstract] Abstract (and framing throughout): the manuscript invokes the three characterization levels as closing the 'insight gap' for using digital twins as experimental systems, yet provides no test that variation on the chosen probes or eigenspectra produces divergent outcomes on any concrete downstream in-silico task (stimulus optimization, hypothesis generation, etc.).

Authors: The referee correctly notes that the manuscript does not demonstrate that the observed differences in probes or eigenspectra lead to divergent outcomes on any specific downstream in-silico task. This is a genuine limitation of the present study; performing such validation would require new experiments that lie outside the scope of the current work. We have therefore revised the discussion to (i) explicitly acknowledge the absence of downstream-task validation and (ii) articulate how the reported representational differences could, in principle, affect stimulus optimization and hypothesis generation. We maintain that the multi-level probing already demonstrates that prediction accuracy alone is insufficient to guarantee representational equivalence, which is the paper’s primary claim. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical measurements on independently probed latent representations

full rationale

The paper trains digital-twin models on a neural-activity prediction objective from naturalistic videos, then measures three separate characterization levels on frozen models: linear decodability of orientation/contrast/motion probes, latent-unit tuning curves, and hidden-layer eigenspectra. These quantities are computed from controlled stimuli or population statistics and are not algebraically or statistically forced by the training loss. No equations, fitted parameters, or self-citations are presented that would reduce any reported correlation or difference to a definitional identity. The observed covariation and residual differences are therefore genuine empirical findings rather than tautological restatements of the input data or training objective.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new free parameters or invented entities. It relies on the standard domain assumption that models trained to predict neural activity capture relevant aspects of V1 computation.

axioms (1)

domain assumption Models trained to predict neural activity from naturalistic videos capture relevant aspects of V1 computation.
This assumption underpins interpreting the probing results as informative about visual computations rather than artifacts of the training objective.

pith-pipeline@v0.9.0 · 5808 in / 1245 out tokens · 29407 ms · 2026-05-25T02:55:42.013159+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We address this gap by systematically probing a family of convolutional-recurrent digital twins of mouse V1... linear decodability from controlled visual probes of orientation, contrast, and motion; (ii) latent-unit tuning... (iii) population geometry of hidden-layer activity.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

highly predictive models exhibit flatter hidden-population eigenspectra

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.