pith. sign in

arxiv: 2604.14193 · v1 · submitted 2026-04-01 · 💻 cs.CV · eess.IV· q-bio.NC

QualiaNet: An Experience-Before-Inference Network

Pith reviewed 2026-05-13 22:22 UTC · model grok-4.3

classification 💻 cs.CV eess.IVq-bio.NC
keywords stereo visiondisparity gradientsdistance estimationnatural scene statisticsconvolutional neural networktwo-stage architecturequaliascale perception
0
0 comments X

The pith

A neural network recovers scene distance from disparity gradients alone by first simulating raw stereo experience.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that human stereo vision works in two stages: an initial experience of depth relative to where the eyes are fixated, followed by an inference stage that turns that experience into judgments about actual distance and scale. It identifies a natural scene statistic as the bridge between the two: nearby objects create strong, vivid changes in disparity across the visual field, while distant objects produce much flatter patterns. QualiaNet tests this idea by generating disparity maps that match the first stage and feeding them to a convolutional network trained only to predict distance. The network succeeds at recovering distance, showing that the gradient pattern alone carries usable scale information. This matters because it supplies a concrete computational mechanism for how ambiguous stereo experience can support metric perception without any explicit distance cue being present at the experience stage.

Core claim

The central claim is that disparity maps simulating human stereo experience relative to fixation contain sufficient information for distance estimation when passed through a CNN trained on the natural scene regularity that near scenes produce vivid disparity gradients while far scenes appear comparatively flat; the network's ability to recover distance validates the two-stage experience-before-inference architecture.

What carries the argument

The two-stage QualiaNet architecture in which an Experience Module produces fixation-relative disparity maps and an Inference Module applies a CNN to map gradient vividness to distance estimates.

If this is right

  • Human distance judgments may arise from learning the correlation between disparity gradient strength and actual scene distance rather than from any direct metric signal in stereo experience.
  • Stereo vision can influence perceived scale even when the raw experience itself carries no distance information.
  • Computational models of 3D vision can usefully separate the generation of phenomenological disparity maps from the subsequent mapping to scene properties.
  • Training networks on human-like disparity inputs rather than metric depth maps may improve generalization in real-world distance estimation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same experience-inference split could be tested in other perceptual domains where raw sensory experience lacks metric content, such as color or motion.
  • If the gradient-vividness statistic is the operative cue, then artificially flattening disparity gradients in a virtual scene should systematically bias distance estimates upward.
  • The architecture predicts that people with reduced stereo sensitivity might still judge distance accurately if they can learn the remaining gradient statistics from other cues.

Load-bearing premise

The load-bearing premise is that the statistical difference in how vivid disparity gradients appear for near versus far scenes is reliable and learnable enough to support accurate distance estimation.

What would settle it

Train the network on simulated disparity maps and then test it on real stereo pairs from scenes whose distances are independently measured; if distance predictions remain at chance level across a range of viewing distances, the claim is falsified.

Figures

Figures reproduced from arXiv: 2604.14193 by Paul Linton.

Figure 2
Figure 2. Figure 2: Disparity gradients (the size of the angular off [PITH_FULL_IMAGE:figures/full_fig_p001_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: QualiaNet from Retinal Image → Experience Module (disparity gradient) → Inference Module (distance). The surprising thing is that even though stereo vision provides no absolute distance information, it has a pow￾erful effect on visual scale. Helmholtz (1857) demon￾strated the effect of stereo vision on visual scale, show￾ing that if we artificially increase the separation between the eyes (using mirrors) t… view at source ↗
Figure 4
Figure 4. Figure 4: QualiaNet accurately recovers absolute dis [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
read the original abstract

Human 3D vision involves two distinct stages: an Experience Module, where stereo depth is extracted relative to fixation, and an Inference Module, where this experience is interpreted to estimate 3D scene properties. Paradoxically, although our experience of stereo vision does not provide us with distance information, it does affect our inferences about visual scale. We propose the Inference Module exploits a natural scene statistic: near scenes produce vivid disparity gradients, while far scenes appear comparatively flat. QualiaNet implements this two-stage architecture computationally: disparity maps simulating human stereo experience are passed to a CNN trained to estimate distance. The network can recover distance from disparity gradients alone, validating this approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces QualiaNet, a two-stage computational model of human 3D vision consisting of an Experience Module that generates disparity maps from stereo input and an Inference Module implemented as a CNN trained to regress scene distance. The central claim is that the network recovers distance from disparity gradients alone, thereby validating the hypothesis that the Inference Module exploits the natural scene statistic that near scenes produce vivid disparity gradients while far scenes appear comparatively flat.

Significance. If the result holds with proper controls and metrics, the work supplies a concrete mechanistic account of how stereo experience can inform scale inferences without explicit distance information, bridging psychophysical observations with a trainable architecture and offering a falsifiable computational test of a specific scene-statistic hypothesis.

major comments (2)
  1. [Abstract] Abstract: the claim that the network recovers distance from disparity gradients alone is presented as a validated result, yet the manuscript supplies no training details, dataset description, quantitative metrics, baselines, or error analysis, rendering the central assertion unverifiable.
  2. [Methods] Methods / Experimental validation: no control experiment is reported that removes absolute disparity scale (e.g., by normalizing or randomizing magnitude while preserving relative gradient patterns), so success of the CNN does not specifically corroborate the gradient-vividness statistic rather than regression on raw disparity amplitude distributions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that the original submission lacked key experimental details and controls, and we have revised the manuscript to address both points fully.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the network recovers distance from disparity gradients alone is presented as a validated result, yet the manuscript supplies no training details, dataset description, quantitative metrics, baselines, or error analysis, rendering the central assertion unverifiable.

    Authors: We agree that the original manuscript lacked sufficient experimental details to allow verification of the central claim. In the revised version, we have added a comprehensive Experimental Setup section describing the dataset (synthetic stereo pairs generated from 3D scenes with known distances), the training procedure for the CNN (including architecture, hyperparameters, and optimization), quantitative evaluation metrics (MAE, RMSE, and R^2), baseline comparisons (e.g., regression on mean disparity), and error analysis with per-distance performance breakdowns. These additions make the results verifiable and support the claim that distance is recovered from disparity gradients. revision: yes

  2. Referee: [Methods] Methods / Experimental validation: no control experiment is reported that removes absolute disparity scale (e.g., by normalizing or randomizing magnitude while preserving relative gradient patterns), so success of the CNN does not specifically corroborate the gradient-vividness statistic rather than regression on raw disparity amplitude distributions.

    Authors: This is a valid concern, as absolute disparity amplitude could indeed be a confounding factor. We have added the requested control experiments in the revised manuscript. Specifically, we normalized disparity maps to have zero mean and unit standard deviation (removing absolute scale information while keeping gradients intact) and found that the network's performance in regressing distance remains comparable, suggesting it relies on gradient patterns. We also introduced random scaling of disparity magnitudes during training and testing, and the model still successfully infers distance based on the vividness of gradients as per the scene statistic hypothesis. These controls are now reported with quantitative results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in QualiaNet derivation

full rationale

The paper presents an empirical demonstration: disparity maps (simulating stereo experience) are fed to a supervised CNN trained to regress distance. Success of this training is not forced by construction or self-definition; it depends on whether the network can extract usable signal from the provided inputs. The motivating natural-scene-statistic claim (near scenes yield vivid gradients) is stated independently of the network output and is not derived from the training procedure itself. No load-bearing self-citation, ansatz smuggling, or renaming of known results appears in the derivation chain. The skeptic concern about absolute disparity magnitude versus gradient structure is a question of experimental controls, not a reduction of the claimed result to its inputs by definition.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on one domain assumption about scene statistics and on the empirical success of supervised CNN training; no new physical entities are introduced.

free parameters (1)
  • CNN training hyperparameters and loss weights
    The network is trained to map disparity maps to distance estimates; specific learning rates, architecture depth, and data augmentation choices are free parameters.
axioms (1)
  • domain assumption Near scenes produce vivid disparity gradients while far scenes appear comparatively flat.
    This natural scene statistic is invoked as the basis for the Inference Module without derivation inside the paper.
invented entities (1)
  • QualiaNet no independent evidence
    purpose: Computational realization of the two-stage experience-before-inference architecture
    A new network design that separates disparity generation from distance regression.

pith-pipeline@v0.9.0 · 5402 in / 1206 out tokens · 33027 ms · 2026-05-13T22:22:26.732517+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

  1. [1]

    Bonnen, T., Malik, J., & Kanazawa, A. (2026). Human- level 3d shape perception emerges from multi-view learning.ArXiv, 2602.17650. Helmholtz, H. (1858). Das telestereoskop [the telestere- oscope].Annalen der Physik und Chemie,101,494. Lee, W., Kotar, K., Venkatesh, R. M., Watrous, J., Chen, H., Aw, K. L., & Yamins, D. L. K. (2026). Unified 3D scene unders...

  2. [2]

    Linton, P. (2021b). V1 as an egocentric cognitive map. Neuroscience of Consciousness,2, niab017. Linton, P. (2023). Minimal theory of 3d vision: New ap- proach to visual scale and visual shape.Phil T rans Royal Soc B,378, 20210455. Linton, P. (2024a). Linton stereo illusion.ArXiv, 2408.00770. Linton, P. (2024b). Visual scale is governed by hori- zontal di...

  3. [3]

    Tsao, T., & Tsao, D. Y. (2022). A topological solution to object segmentation and tracking.PNAS,119, e2204248119