QualiaNet: An Experience-Before-Inference Network
Pith reviewed 2026-05-13 22:22 UTC · model grok-4.3
The pith
A neural network recovers scene distance from disparity gradients alone by first simulating raw stereo experience.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that disparity maps simulating human stereo experience relative to fixation contain sufficient information for distance estimation when passed through a CNN trained on the natural scene regularity that near scenes produce vivid disparity gradients while far scenes appear comparatively flat; the network's ability to recover distance validates the two-stage experience-before-inference architecture.
What carries the argument
The two-stage QualiaNet architecture in which an Experience Module produces fixation-relative disparity maps and an Inference Module applies a CNN to map gradient vividness to distance estimates.
If this is right
- Human distance judgments may arise from learning the correlation between disparity gradient strength and actual scene distance rather than from any direct metric signal in stereo experience.
- Stereo vision can influence perceived scale even when the raw experience itself carries no distance information.
- Computational models of 3D vision can usefully separate the generation of phenomenological disparity maps from the subsequent mapping to scene properties.
- Training networks on human-like disparity inputs rather than metric depth maps may improve generalization in real-world distance estimation tasks.
Where Pith is reading between the lines
- The same experience-inference split could be tested in other perceptual domains where raw sensory experience lacks metric content, such as color or motion.
- If the gradient-vividness statistic is the operative cue, then artificially flattening disparity gradients in a virtual scene should systematically bias distance estimates upward.
- The architecture predicts that people with reduced stereo sensitivity might still judge distance accurately if they can learn the remaining gradient statistics from other cues.
Load-bearing premise
The load-bearing premise is that the statistical difference in how vivid disparity gradients appear for near versus far scenes is reliable and learnable enough to support accurate distance estimation.
What would settle it
Train the network on simulated disparity maps and then test it on real stereo pairs from scenes whose distances are independently measured; if distance predictions remain at chance level across a range of viewing distances, the claim is falsified.
Figures
read the original abstract
Human 3D vision involves two distinct stages: an Experience Module, where stereo depth is extracted relative to fixation, and an Inference Module, where this experience is interpreted to estimate 3D scene properties. Paradoxically, although our experience of stereo vision does not provide us with distance information, it does affect our inferences about visual scale. We propose the Inference Module exploits a natural scene statistic: near scenes produce vivid disparity gradients, while far scenes appear comparatively flat. QualiaNet implements this two-stage architecture computationally: disparity maps simulating human stereo experience are passed to a CNN trained to estimate distance. The network can recover distance from disparity gradients alone, validating this approach.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces QualiaNet, a two-stage computational model of human 3D vision consisting of an Experience Module that generates disparity maps from stereo input and an Inference Module implemented as a CNN trained to regress scene distance. The central claim is that the network recovers distance from disparity gradients alone, thereby validating the hypothesis that the Inference Module exploits the natural scene statistic that near scenes produce vivid disparity gradients while far scenes appear comparatively flat.
Significance. If the result holds with proper controls and metrics, the work supplies a concrete mechanistic account of how stereo experience can inform scale inferences without explicit distance information, bridging psychophysical observations with a trainable architecture and offering a falsifiable computational test of a specific scene-statistic hypothesis.
major comments (2)
- [Abstract] Abstract: the claim that the network recovers distance from disparity gradients alone is presented as a validated result, yet the manuscript supplies no training details, dataset description, quantitative metrics, baselines, or error analysis, rendering the central assertion unverifiable.
- [Methods] Methods / Experimental validation: no control experiment is reported that removes absolute disparity scale (e.g., by normalizing or randomizing magnitude while preserving relative gradient patterns), so success of the CNN does not specifically corroborate the gradient-vividness statistic rather than regression on raw disparity amplitude distributions.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We agree that the original submission lacked key experimental details and controls, and we have revised the manuscript to address both points fully.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the network recovers distance from disparity gradients alone is presented as a validated result, yet the manuscript supplies no training details, dataset description, quantitative metrics, baselines, or error analysis, rendering the central assertion unverifiable.
Authors: We agree that the original manuscript lacked sufficient experimental details to allow verification of the central claim. In the revised version, we have added a comprehensive Experimental Setup section describing the dataset (synthetic stereo pairs generated from 3D scenes with known distances), the training procedure for the CNN (including architecture, hyperparameters, and optimization), quantitative evaluation metrics (MAE, RMSE, and R^2), baseline comparisons (e.g., regression on mean disparity), and error analysis with per-distance performance breakdowns. These additions make the results verifiable and support the claim that distance is recovered from disparity gradients. revision: yes
-
Referee: [Methods] Methods / Experimental validation: no control experiment is reported that removes absolute disparity scale (e.g., by normalizing or randomizing magnitude while preserving relative gradient patterns), so success of the CNN does not specifically corroborate the gradient-vividness statistic rather than regression on raw disparity amplitude distributions.
Authors: This is a valid concern, as absolute disparity amplitude could indeed be a confounding factor. We have added the requested control experiments in the revised manuscript. Specifically, we normalized disparity maps to have zero mean and unit standard deviation (removing absolute scale information while keeping gradients intact) and found that the network's performance in regressing distance remains comparable, suggesting it relies on gradient patterns. We also introduced random scaling of disparity magnitudes during training and testing, and the model still successfully infers distance based on the vividness of gradients as per the scene statistic hypothesis. These controls are now reported with quantitative results. revision: yes
Circularity Check
No significant circularity in QualiaNet derivation
full rationale
The paper presents an empirical demonstration: disparity maps (simulating stereo experience) are fed to a supervised CNN trained to regress distance. Success of this training is not forced by construction or self-definition; it depends on whether the network can extract usable signal from the provided inputs. The motivating natural-scene-statistic claim (near scenes yield vivid gradients) is stated independently of the network output and is not derived from the training procedure itself. No load-bearing self-citation, ansatz smuggling, or renaming of known results appears in the derivation chain. The skeptic concern about absolute disparity magnitude versus gradient structure is a question of experimental controls, not a reduction of the claimed result to its inputs by definition.
Axiom & Free-Parameter Ledger
free parameters (1)
- CNN training hyperparameters and loss weights
axioms (1)
- domain assumption Near scenes produce vivid disparity gradients while far scenes appear comparatively flat.
invented entities (1)
-
QualiaNet
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Bonnen, T., Malik, J., & Kanazawa, A. (2026). Human- level 3d shape perception emerges from multi-view learning.ArXiv, 2602.17650. Helmholtz, H. (1858). Das telestereoskop [the telestere- oscope].Annalen der Physik und Chemie,101,494. Lee, W., Kotar, K., Venkatesh, R. M., Watrous, J., Chen, H., Aw, K. L., & Yamins, D. L. K. (2026). Unified 3D scene unders...
-
[2]
Linton, P. (2021b). V1 as an egocentric cognitive map. Neuroscience of Consciousness,2, niab017. Linton, P. (2023). Minimal theory of 3d vision: New ap- proach to visual scale and visual shape.Phil T rans Royal Soc B,378, 20210455. Linton, P. (2024a). Linton stereo illusion.ArXiv, 2408.00770. Linton, P. (2024b). Visual scale is governed by hori- zontal di...
-
[3]
Tsao, T., & Tsao, D. Y. (2022). A topological solution to object segmentation and tracking.PNAS,119, e2204248119
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.