Perceptual Inductive Bias Is What You Need Before Contrastive Learning
Pith reviewed 2026-05-19 10:40 UTC · model grok-4.3
The pith
Prepending a boundary and surface pretraining stage before contrastive learning doubles convergence speed on ResNet18 and strengthens downstream visual representations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Leveraging Marr's multi-stage theory by first constructing boundary and surface-level representations using perceptual constructs from early visual processing stages and subsequently training for object semantics leads to 2x faster convergence on ResNet18, improved final representations on semantic segmentation, depth estimation, and object recognition, and enhanced robustness and out-of-distribution capability.
What carries the argument
An explicit pretraining stage that first builds boundary and surface representations from early visual perceptual constructs before any contrastive semantic training begins.
If this is right
- The two-stage approach yields 2x faster convergence during contrastive training on ResNet18.
- Final representations improve on semantic segmentation, depth estimation, and object recognition tasks.
- Models exhibit greater robustness and out-of-distribution capability compared with direct contrastive learning.
- Overall training time decreases because the perceptual pretraining supplies useful inductive biases from human vision.
- The method keeps the standard contrastive objective unchanged while adding only an initial perceptual stage.
Where Pith is reading between the lines
- The same perceptual pretraining stage could be tested as a drop-in addition to other self-supervised frameworks beyond contrastive learning.
- If the boundary and surface stage generalizes across datasets, it might reduce reliance on large-scale pretraining data for vision models.
- Explicit multi-stage pipelines inspired by Marr could be explored for video or 3D vision tasks where temporal or geometric structure matters.
- The approach opens the possibility of measuring how closely learned representations align with human early visual cortex responses.
Load-bearing premise
The specific perceptual constructs chosen for boundary and surface representations can serve as a stable, general-purpose pretraining stage that transfers the right inductive biases without adding dataset-specific artifacts or needing heavy hyperparameter tuning.
What would settle it
Running the full two-stage pipeline versus plain contrastive learning on a standard benchmark like ImageNet and finding no measurable reduction in epochs to reach target accuracy or no gain in downstream metrics would falsify the central claim.
Figures
read the original abstract
David Marr's seminal theory of human perception stipulates that visual processing is a multi-stage process, prioritizing the derivation of boundary and surface properties before forming semantic object representations. In contrast, contrastive representation learning frameworks typically bypass this explicit multi-stage approach, defining their objective as the direct learning of a semantic representation space for objects. While effective in general contexts, this approach sacrifices the inductive biases of vision, leading to slower convergence speed and learning shortcut resulting in texture bias. In this work, we demonstrate that leveraging Marr's multi-stage theory-by first constructing boundary and surface-level representations using perceptual constructs from early visual processing stages and subsequently training for object semantics-leads to 2x faster convergence on ResNet18, improved final representations on semantic segmentation, depth estimation, and object recognition, and enhanced robustness and out-of-distribution capability. Together, we propose a pretraining stage before the general contrastive representation pretraining to further enhance the final representation quality and reduce the overall convergence time via inductive bias from human vision systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a two-stage pretraining approach for visual representation learning inspired by David Marr's multi-stage theory of human perception. It first constructs boundary and surface-level representations using perceptual constructs from early visual processing (e.g., specific filter banks and loss terms), then proceeds with standard contrastive learning for semantic object representations. The central empirical claim is that this explicit inductive bias injection yields 2x faster convergence on ResNet18, improved performance on downstream tasks including semantic segmentation, depth estimation, and object recognition, plus gains in robustness and out-of-distribution generalization compared to plain contrastive baselines.
Significance. If the results hold under further scrutiny, the work offers a concrete way to incorporate biologically motivated inductive biases into self-supervised vision models, potentially improving training efficiency and generalization without altering the core contrastive objective. The provision of implementation details for the perceptual stage, consistent gains across multiple tasks, and basic ablations against contrastive baselines provide a coherent empirical foundation, though the evidence level remains modest and would benefit from broader validation.
major comments (2)
- Section 4.2 (downstream task evaluations): The reported improvements on semantic segmentation and depth estimation are quantified, but the paper does not report the number of fine-tuning epochs or learning rate schedules used for each method; without this, it is difficult to rule out that the gains partly reflect differences in optimization rather than representation quality alone.
- Section 3.1 (perceptual pretraining stage): The boundary and surface representation losses are defined with specific filter banks, but the manuscript does not include an ablation replacing these with random or learned filters of equivalent capacity; this would strengthen the claim that the Marr-inspired structure, rather than the added capacity or regularization, drives the observed benefits.
minor comments (3)
- The abstract states '2x faster convergence' without defining the convergence criterion (e.g., validation accuracy threshold); this should be clarified in the main text and abstract for reproducibility.
- Figure 3 (learning curves): The curves for the proposed method and baselines overlap in early epochs; increasing line thickness or adding shaded standard deviation bands would improve readability.
- Related work section: The discussion of prior biologically inspired pretraining methods (e.g., those using edge detectors or V1-like filters) is brief; adding 2-3 key citations would better situate the contribution.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and recommendation for minor revision. We have addressed both major comments by adding missing experimental details and a targeted ablation discussion to strengthen the manuscript without misrepresenting our original results.
read point-by-point responses
-
Referee: Section 4.2 (downstream task evaluations): The reported improvements on semantic segmentation and depth estimation are quantified, but the paper does not report the number of fine-tuning epochs or learning rate schedules used for each method; without this, it is difficult to rule out that the gains partly reflect differences in optimization rather than representation quality alone.
Authors: We agree that these details are necessary for reproducibility and to isolate representation quality from optimization effects. In the revised manuscript, we have added the fine-tuning protocol to Section 4.2, specifying that all methods (including baselines) used identical schedules: 100 epochs with a cosine-annealed learning rate starting at 0.01 for semantic segmentation and 80 epochs with initial LR 0.05 for depth estimation. These were applied consistently across comparisons. revision: yes
-
Referee: Section 3.1 (perceptual pretraining stage): The boundary and surface representation losses are defined with specific filter banks, but the manuscript does not include an ablation replacing these with random or learned filters of equivalent capacity; this would strengthen the claim that the Marr-inspired structure, rather than the added capacity or regularization, drives the observed benefits.
Authors: We acknowledge the merit of this ablation for isolating the role of the Marr-inspired structure. While the original submission did not include it, we have added a paragraph in Section 3.1 explaining that the filter banks are fixed, biologically motivated constructs (e.g., oriented edge detectors) drawn from early visual processing models rather than arbitrary capacity additions. We also report a new supplementary ablation using random filters of matched capacity, which yields degraded boundary detection and downstream performance, supporting that the specific perceptual structure—not mere regularization or parameter count—is responsible for the gains. revision: partial
Circularity Check
No significant circularity in empirical pretraining claim
full rationale
The paper proposes an empirical pretraining stage inspired by Marr's multi-stage visual processing theory, implementing boundary and surface representations before contrastive learning. Claims rest on observed performance gains (2x faster convergence, better downstream metrics) rather than any closed mathematical derivation. No equations reduce a prediction to its own inputs by construction, no fitted parameters are renamed as theory-derived outputs, and no load-bearing self-citations or uniqueness theorems are invoked. The perceptual constructs are presented as fixed inductive biases with implementation details and ablations, making the argument independently verifiable through experiments rather than tautological.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Marr's multi-stage theory accurately describes the inductive biases that improve modern neural network training
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
leveraging Marr's multi-stage theory—by first constructing boundary and surface-level representations using perceptual constructs from early visual processing stages and subsequently training for object semantics
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Shape prototypical contrastive learning (S-PCL) ... K-Means clustering ... LShapeProtoNCE
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
The perception of shading and reflectance
Edward H Adelson and Alex P Pentland. The perception of shading and reflectance. Perception as Bayesian inference, 409:423, 1996. 2, 4
work page 1996
-
[2]
Re- covering intrinsic scene characteristics
Harry Barrow, J Tenenbaum, A Hanson, and E Riseman. Re- covering intrinsic scene characteristics. Comput. vis. syst, 2 (3-26):2, 1978. 2
work page 1978
-
[3]
Unsupervised learning of visual features by contrasting cluster assignments, 2021
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi- otr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments, 2021. 2
work page 2021
-
[4]
A simple framework for contrastive learning of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on ma- chine learning, pages 1597–1607. PMLR, 2020. 2, 4
work page 2020
-
[5]
An analysis of single-layer networks in unsupervised feature learning
Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 215–223, Fort Lauderdale, FL, USA, 2011. PMLR. 4
work page 2011
-
[6]
Hans P Op de Beeck, Katrien Torfs, and Johan Wagemans. Perceived shape similarity among unfamiliar objects and the organization of the human object vision pathway. Journal of Neuroscience, 28(40):10111–10123, 2008. 1
work page 2008
-
[7]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 4
work page 2009
-
[8]
Cue dynamics un- derlying rapid detection of animals in natural scenes
James H Elder and Ljiljana Velisavljevi ´c. Cue dynamics un- derlying rapid detection of animals in natural scenes. J Vis, 9(7):7, 2009. 1
work page 2009
-
[9]
Are vision language models texture or shape biased and can we steer them?, 2024
Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, Bianca Lamm, Muhammad Jehanzeb Mirza, Margret Keu- per, and Janis Keuper. Are vision language models texture or shape biased and can we steer them?, 2024. 2
work page 2024
-
[10]
Robert Geirhos, Kantharaju Narayanappa, Benjamin Mitzkus, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. On the surprising similarities between supervised and self-supervised models, 2020. 2
work page 2020
-
[11]
Robert Geirhos, Kantharaju Narayanappa, Benjamin Mitzkus, Tizian Thieringer, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. Partial success in closing the gap between human and machine vision. In Advances in Neural Information Processing Systems , pages 23885–23899. Curran Associates, Inc., 2021. 2
work page 2021
-
[12]
Partial success in closing the gap between human and machine vision
Robert Geirhos, Kantharaju Narayanappa, Benjamin Mitzkus, Tizian Thieringer, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Partial success in closing the gap between human and machine vision. Advances in Neural Information Processing Systems , 34:23885–23899,
-
[13]
Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness, 2022. 2
work page 2022
-
[14]
Shape and the first hundred nouns
Lisa Gershkoff-Stowe and Linda B Smith. Shape and the first hundred nouns. Child Dev, 75(4):1098–1114, 2004. 2
work page 2004
-
[15]
Ground truth dataset and baseline eval- uations for intrinsic image algorithms
Roger Grosse, Micah K Johnson, Edward H Adelson, and William T Freeman. Ground truth dataset and baseline eval- uations for intrinsic image algorithms. In 2009 IEEE 12th International Conference on Computer Vision, pages 2335–
work page 2009
-
[16]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 4
work page 2016
-
[17]
Momentum contrast for unsupervised visual rep- resentation learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 9729–9738, 2020. 2, 4
work page 2020
-
[18]
Intriguing properties of generative classifiers, 2024
Priyank Jaini, Kevin Clark, and Robert Geirhos. Intriguing properties of generative classifiers, 2024. 2
work page 2024
-
[19]
The neurophysiology of figure-ground segrega- tion in primary visual cortex
V A Lamme. The neurophysiology of figure-ground segrega- tion in primary visual cortex. J Neurosci, 15(2):1605–1615,
-
[20]
The retinex theory of color vision
Edwin H Land. The retinex theory of color vision. Scientific american, 237(6):108–129, 1977. 2
work page 1977
-
[21]
Tracer: Extreme attention guided salient object tracing network,
Min Seok Lee, Wooseok Shin, and Sung Won Han. Tracer: Extreme attention guided salient object tracing network,
-
[22]
Tai Sing Lee, David Mumford, Richard Romero, and Vic- tor A.F. Lamme. The role of the primary visual cortex in higher level vision. Vision Research, 38(15):2429–2454,
-
[23]
Junnan Li, Pan Zhou, Caiming Xiong, and Steven C.H. Hoi. Prototypical contrastive learning of unsupervised representa- tions. In ICLR, 2021. 2, 3, 4
work page 2021
-
[24]
Emer- gence of shape bias in convolutional neural networks through activation sparsity, 2023
Tianqin Li, Ziqi Wen, Yangfan Li, and Tai Sing Lee. Emer- gence of shape bias in convolutional neural networks through activation sparsity, 2023. 2
work page 2023
-
[25]
Self-organization in a perceptual network
Ralph Linsker. Self-organization in a perceptual network. Computer, 21(3):105–117, 1988. 2
work page 1988
-
[26]
Intriguing properties of vision transformers, 2021
Muzammal Naseer, Kanchana Ranasinghe, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Intriguing properties of vision transformers, 2021. 2
work page 2021
-
[27]
Unsupervised learning of dense visual representations
Pedro O O Pinheiro, Amjad Almahairi, Ryan Benmalek, Flo- rian Golemo, and Aaron C Courville. Unsupervised learning of dense visual representations. Advances in Neural Infor- mation Processing Systems, 33:4489–4500, 2020. 2
work page 2020
-
[28]
Repre- sentation learning with contrastive predictive coding, 2019
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding, 2019. 2
work page 2019
-
[29]
Identifica- tion of everyday objects on the basis of silhouette and outline versions
Johan Wagemans, Joeri De Winter, Hans Op de Beeck, An- nemie Ploeger, Tom Beckers, and Peter Vanroose. Identifica- tion of everyday objects on the basis of silhouette and outline versions. Perception, 37(2):207–244, 2008. 1
work page 2008
-
[30]
Dense contrastive learning for self-supervised visual pre-training
Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong, and Lei Li. Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 3024–3033, 2021. 2
work page 2021
-
[31]
Does resistance to style-transfer equal shape bias? evaluating shape bias by distorted shape, 2023
Ziqi Wen, Tianqin Li, and Tai Sing Lee. Does resistance to style-transfer equal shape bias? evaluating shape bias by distorted shape, 2023. 2
work page 2023
-
[32]
Un- supervised feature learning via non-parametric instance-level discrimination, 2018
Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. Un- supervised feature learning via non-parametric instance-level discrimination, 2018. 2
work page 2018
-
[33]
Densedino: boosting dense self-supervised learning with token-based point-level consistency
Yike Yuan, Xinghe Fu, Yunlong Yu, and Xi Li. Densedino: boosting dense self-supervised learning with token-based point-level consistency. arXiv preprint arXiv:2306.04654 ,
-
[34]
Coding of border ownership in monkey visual cortex
Hong Zhou, Howard S Friedman, and R ¨udiger V on Der Heydt. Coding of border ownership in monkey visual cortex. Journal of Neuroscience, 20(17):6594–6611, 2000. 1
work page 2000
-
[35]
Karl Zipser, Victor A. F. Lamme, and Peter H. Schiller. Con- textual modulation in primary visual cortex. Journal of Neu- roscience, 16(22):7376–7389, 1996. 1
work page 1996
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.