pith. sign in

arxiv: 2605.23819 · v1 · pith:CEUYNQXVnew · submitted 2026-05-22 · 💻 cs.CV · cs.AI

Not Too Generative, Not Too Discriminative: The Human Alignment Sweet Spot

Pith reviewed 2026-05-25 04:19 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords human visual alignmentgenerative-discriminative continuumjoint energy-based modelsperceptual benchmarksvisual representationshybrid learning
0
0 comments X

The pith

Human visual alignment peaks at intermediate mixtures of generative and discriminative learning rather than at either extreme.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper uses Joint Energy-Based Models to vary a single mixing coefficient that shifts training continuously from fully discriminative to fully generative while holding architecture and data fixed. This setup isolates the learning objective and tests the resulting representations on six human-alignment benchmarks covering perceptual similarity, gloss perception, response uncertainty, robustness, shape-texture conflicts, and feature attribution. Alignment with human behavior is highest at intermediate coefficient values, where models gain both the categorical structure produced by discriminative training and the sensitivity to input statistics produced by generative training. Pure endpoints underperform the hybrids on the same tasks. The results indicate that the generative-discriminative choice is not the correct axis for explaining human-aligned vision.

Core claim

By varying the mixing coefficient in JEMs, the study shows that human alignment across the six benchmarks reaches its maximum at intermediate points on the generative-discriminative continuum. These hybrid models combine the categorical distinctions induced by discriminative learning with the structural sensitivity induced by generative learning, producing responses that better match human judgments at multiple levels of vision.

What carries the argument

Joint Energy-Based Models (JEMs) that use a single mixing coefficient to interpolate between discriminative and generative objectives inside one fixed architecture.

If this is right

  • Intermediate hybrid models outperform both pure generative and pure discriminative models on the tested human-alignment metrics.
  • The categorical structure from discriminative training and the input sensitivity from generative training are both required for the observed gains.
  • The generative-discriminative dichotomy is not the right framing for achieving human-aligned visual representations.
  • Balancing the two objectives inside a single model yields more human-like behavior than selecting one objective alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training procedures in other domains might also benefit from explicit interpolation toward intermediate regimes rather than endpoint selection.
  • The optimal mixing point could shift with changes in model scale or data distribution, offering a testable prediction for follow-up work.
  • New benchmarks that separately measure category structure and input sensitivity could help locate the balance point more precisely.

Load-bearing premise

That varying only the mixing coefficient fully isolates the learning objective from all other differences in capacity, optimization, or regularization that normally separate generative and discriminative regimes.

What would settle it

A follow-up experiment that adds new human-judgment tasks and finds that intermediate mixing coefficients no longer outperform the pure generative and pure discriminative endpoints after matching for model size and training compute.

Figures

Figures reproduced from arXiv: 2605.23819 by Bastien Le Lan, Jorge Chang Ortega, Thomas Serre, Victor Boutin.

Figure 1
Figure 1. Figure 1: Human alignment peaks in the hybrid regime. JEMs are trained across the generative (p(x))– discriminative (p(y|x)) continuum by varying α ∈ [0, 1] and evaluated on six human–machine comparison bench￾marks. Arrows indicate the best-aligned α for each benchmark. Joint Energy-Based Models (JEMs) Grathwohl et al. [2020a] offer a principled way to resolve this debate. A JEM assigns an energy to each input–label… view at source ↗
Figure 2
Figure 2. Figure 2: Human alignment across the generative–discriminative continuum. JEMs are evaluated across α ∈ [0, 1], from purely discriminative (α = 0) to purely generative (α = 1). (a) Low-level perceptual similarity on BAPPS (JND mAP and 2AFC; human ceiling: 83% Zhang et al. [2018]). (b) Mid-level gloss perception (gloss accuracy and Pearson correlation with human judgments; theoretical upper bound: r = 1). (c) CIFAR-1… view at source ↗
Figure 3
Figure 3. Figure 3: Generative pressure reveals shape bias. a) Visualization of a cue-conflict image under the generative component of JEMs trained with different α values; increasing α shifts the visualization from texture-consistent toward shape-consistent. b) Shape bias across SGLD steps for each α. SGLD increases shape bias in hybrid/generative JEMs, indicating shape-favoring energy landscapes. The α = 1 endpoint is omitt… view at source ↗
Figure 4
Figure 4. Figure 4: Hybrid JEMs align with human saliency. Original images are shown on the left, followed by human ClickMe maps and model attribution maps for JEMs trained across the generative–discriminative continuum. As the generative contribution increases up to intermediate values, attribution maps become more concentrated on object-relevant regions and better resemble human diagnostic regions. Beyond this hybrid regime… view at source ↗
Figure 5
Figure 5. Figure 5: Latent-space sampling trajectories for an ImageNet JEM with [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative generations obtained after 50 SGLD steps from the same shared latent initialization across [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example of a reference image and its distorted patches. [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: 2AFC accuracy and JND mAP on the BAPPS perceptual similarity benchmark, across JEM [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Examples of images labeled as Low or High gloss. [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Gloss accuracy vs. human correlation in experiments with different latent sizes. [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Surface relief (R 2 ) and light field accuracy across JEM α values for different latent dimensionalities. Shaded regions indicate the standard error of the mean (SEM) across two seeds. The disconnected point on the right of each panel shows the corresponding PixelVAE baseline. a) Human correlation b) Gloss accuracy [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Gloss-human correlation and gloss accuracy across JEM [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative gloss generations as a function of [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: CIFAR-10 and CIFAR-10H evaluation across [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗
Figure 16
Figure 16. Figure 16: Parametric transformations used in the Model-vs-Human benchmark: colour, contrast, frequency [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Nonparametric transformations used in the Model-vs-Human benchmark: sketch, stylized images, [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: OOD accuracy and error consistency metrics, across JEM [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗
Figure 20
Figure 20. Figure 20: Percentage of shape or texture choice made per category for each model . [PITH_FULL_IMAGE:figures/full_fig_p033_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Evolution of an image dependent on the alpha and the number of MCMC steps. [PITH_FULL_IMAGE:figures/full_fig_p034_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Shape bias across JEM α values. Shaded regions indicate the standard error of the mean (SEM) across two seeds. H The Click-Me benchmark Modern CNNs achieve high performance on object-recognition benchmarks, but they are also known to rely on shortcut cues that can diverge from the diagnostic features used by human observers. To assess this aspect of alignment, we use the ClickMe dataset introduced by Lins… view at source ↗
Figure 23
Figure 23. Figure 23: Examples of human feature importance maps. [PITH_FULL_IMAGE:figures/full_fig_p034_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Visual strategy of object recognition. Evaluation metrics . To compare models with humans, we follow the evaluation protocol of Fel et al. [2022b]. For each model, saliency maps are computed on the ClickMe images and compared with the corresponding human feature-importance maps, yielding a quantitative measure of feature alignment ( [PITH_FULL_IMAGE:figures/full_fig_p035_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: ClickMe alignment score across JEM α values. Shaded regions indicate the standard error of the mean (SEM) across two seeds. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_25.png] view at source ↗
read the original abstract

A central question in computational vision is whether human-like visual representations are better explained by discriminative or generative learning. Existing comparisons, however, often confound the learning objective with architecture, scale, and training data, leaving open whether the objective itself drives alignment. We address this confound using Joint Energy-Based Models (JEMs), which interpolate continuously between discriminative and generative training within a fixed architecture. By varying a single mixing coefficient, we isolate the effect of the learning objective and evaluate the resulting models across six human-alignment benchmarks spanning perceptual similarity, gloss perception, human response uncertainty, robustness, shape-texture cue conflict, and diagnostic feature attribution. Across this diverse suite, human alignment is consistently maximized at intermediate points of the generative-discriminative continuum, rather than at either endpoint. Hybrid JEMs combine the categorical structure induced by discriminative learning with the sensitivity to input structure induced by generative learning, yielding more human-like behavior across multiple levels of vision. These results suggest that the generative-discriminative dichotomy is the wrong axis for understanding human-aligned vision: alignment emerges not from choosing one objective over the other, but from balancing both.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that human alignment with visual representations is maximized at intermediate points along the generative-discriminative continuum rather than at either extreme. It uses Joint Energy-Based Models (JEMs) with a single mixing coefficient λ to interpolate objectives while holding architecture fixed, then evaluates the resulting models on six human-alignment benchmarks (perceptual similarity, gloss perception, response uncertainty, robustness, shape-texture conflict, and feature attribution). The central result is that hybrid JEMs outperform pure generative or discriminative endpoints across this suite.

Significance. If the isolation of the objective holds, the result would be significant for computational vision: it supplies evidence that human-like behavior emerges from balancing rather than choosing between the two objectives, and it supplies a concrete method (fixed-architecture interpolation) for testing such claims. The use of a continuous mixing parameter within one model family is a methodological strength that directly targets the usual confounds of architecture and data.

major comments (2)
  1. [§3] §3 (JEM training and mixing coefficient): the central claim requires that alignment differences arise solely from the generative-discriminative balance. No loss-curve statistics, gradient-norm diagnostics, or effective-capacity measures are reported across λ values, leaving open the possibility that changes in optimization dynamics or implicit regularization (rather than the intended objective shift) produce the observed intermediate peak. This is load-bearing for the causal interpretation.
  2. [§4] §4 (benchmark results): the paper reports consistent maximization at intermediate λ but does not provide per-benchmark statistical tests, error bars, or controls for multiple comparisons that would establish the peak is reliably above the endpoints rather than within noise. Without these, the cross-benchmark claim rests on visual inspection alone.
minor comments (2)
  1. [§3] Notation for the mixing coefficient λ is introduced without an explicit equation relating it to the joint loss; adding the precise interpolation formula would improve reproducibility.
  2. [§4] Figure captions for the alignment plots do not state the number of random seeds or the exact human-subject sample sizes underlying each benchmark.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger evidence that alignment differences stem from the objective balance rather than optimization artifacts, and for emphasizing the importance of statistical rigor. We agree these points are central to the causal interpretation and will revise the manuscript to address both concerns directly.

read point-by-point responses
  1. Referee: [§3] §3 (JEM training and mixing coefficient): the central claim requires that alignment differences arise solely from the generative-discriminative balance. No loss-curve statistics, gradient-norm diagnostics, or effective-capacity measures are reported across λ values, leaving open the possibility that changes in optimization dynamics or implicit regularization (rather than the intended objective shift) produce the observed intermediate peak. This is load-bearing for the causal interpretation.

    Authors: We agree that additional diagnostics are required to support the claim that differences arise from the objective rather than training dynamics. In the revised manuscript we will add loss curves, gradient-norm statistics, and effective-capacity measures across λ values. These will demonstrate that optimization behavior remains comparable and that the intermediate alignment peak is not explained by differences in convergence, stability, or implicit regularization. revision: yes

  2. Referee: [§4] §4 (benchmark results): the paper reports consistent maximization at intermediate λ but does not provide per-benchmark statistical tests, error bars, or controls for multiple comparisons that would establish the peak is reliably above the endpoints rather than within noise. Without these, the cross-benchmark claim rests on visual inspection alone.

    Authors: We accept that formal statistical support is necessary. The revision will include error bars on all figures, per-benchmark statistical tests comparing intermediate λ values to the endpoints (with appropriate post-hoc corrections), and family-wise error control across the six benchmarks. These additions will replace reliance on visual inspection with quantitative evidence that the peaks are reliable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claim rests on external human benchmarks

full rationale

The paper trains JEMs at different values of the mixing coefficient λ and measures alignment on six independent human psychophysical benchmarks (perceptual similarity, gloss, uncertainty, robustness, cue conflict, feature attribution). No step defines the alignment metric from the model parameters or loss; the metrics are external. No self-citation is used to justify a uniqueness result or to smuggle an ansatz. No fitted parameter is relabeled as a prediction. The central result is therefore not equivalent to its inputs by construction and receives the default non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Ledger constructed from abstract only. The central method assumes JEMs cleanly separate objective from architecture; no free parameters are fitted to the human data in the described procedure, and no new entities are introduced.

axioms (1)
  • domain assumption JEMs allow continuous interpolation between discriminative and generative training objectives inside a fixed architecture by varying a single mixing coefficient
    This assumption is required for the claim that the experiment isolates the effect of the learning objective.

pith-pipeline@v0.9.0 · 5735 in / 1246 out tokens · 26888 ms · 2026-05-25T04:19:33.081263+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 7 internal anchors

  1. [1]

    Rajesh PN Rao and Dana H Ballard

    doi: 10.1016/j.tics.2007.06.010. Rajesh PN Rao and Dana H Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience, 2(1):79–87,

  2. [2]

    Low-pass filtering improves behavioral alignment of vision models

    Max Wolff, Thomas Klein, Evgenia Rusak, Felix Wichmann, and Wieland Brendel. Low-pass filtering improves behavioral alignment of vision models. arXiv preprint arXiv:2602.13859,

  3. [3]

    doi: 10.48550/arXiv.2602. 13859. URL https://arxiv.org/abs/2602.13859. Lukas Muttenthaler, Jonas Dippel, Lorenz Linhardt, Robert A. Vandermeulen, and Simon Kornblith. Human alignment of neural network representations. In International Conference on Learning Representations,

  4. [4]

    Lorenz Linhardt, Marco Morik, Sidney Bender, and Naima Elosegui Borras

    URL https://openreview.net/forum?id=ReDQ1OUQR0X. Lorenz Linhardt, Marco Morik, Sidney Bender, and Naima Elosegui Borras. An analysis of human alignment of latent diffusion models. In ICLR 2024 Workshop on Representational Alignment,

  5. [6]

    Daniel L

    doi: 10.1073/pnas.1403112111. Daniel L. K. Yamins and James J. DiCarlo. Using goal-driven deep learning models to understand sensory cortex. Nature Neuroscience, 19(3):356–365,

  6. [7]

    Martin Schrimpf, Jonas Kubilius, Ha Hong, Najib J

    doi: 10.1038/nn.4244. Martin Schrimpf, Jonas Kubilius, Ha Hong, Najib J. Majaj, Rishi Rajalingham, Elias B. Issa, Kohitij Kar, Pouya Bashivan, Jonathan Prescott-Roy, Kailyn Schmidt, Daniel L. K. Yamins, and James J. DiCarlo. Brain-score: Which artificial neural network for object recognition is most brain-like? bioRxiv,

  7. [8]

    Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus

    doi: 10.1101/407007. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations,

  8. [9]

    Intriguing properties of neural networks

    URL https://arxiv.org/abs/1312.6199. Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 427–436,

  9. [10]

    11 Nicholas Baker, Hongjing Lu, Gennady Erlikhman, and Philip J

    doi: 10.1109/CVPR.2015.7298640. 11 Nicholas Baker, Hongjing Lu, Gennady Erlikhman, and Philip J. Kellman. Deep convolutional networks do not classify based on global object shape. PLOS Computational Biology , 14(12):e1006613,

  10. [11]

    Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel

    doi: 10.1371/journal.pcbi.1006613. Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In International conference on learning representations, 2018a. Robert Geirhos, Jörn-Henrik Jacobsen, Claud...

  11. [12]

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang

    doi: 10.1017/S0140525X22002813. Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595,

  12. [13]

    Katherine M

    doi: 10.1038/s41467-020-18946-z. Katherine M. Collins, Umang Bhatt, and Adrian Weller. Eliciting and learning with soft labels from every annotator. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 10, pages 40–52,

  13. [14]

    Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

    URL https://arxiv.org/ abs/1903.12261. Katherine R Storrs, Barton L Anderson, and Roland W Fleming. Unsupervised learning predicts human perception and misperception of gloss. Nature human behaviour, 5(10):1402–1417,

  14. [15]

    Lasserre, Christopher M

    Julia A. Lasserre, Christopher M. Bishop, and Tom P. Minka. Principled hybrids of generative and discriminative models. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), pages 87–94,

  15. [16]

    Diederik P

    doi: 10.1109/CVPR.2006.227. Diederik P. Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, volume 27,

  16. [17]

    Wide Residual Networks

    Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146,

  17. [18]

    Learning what and where to attend

    Robert Geirhos, Carlos R. Medina Temme, Jonas Rauber, Heiko H. Schütt, Matthias Bethge, and Felix A. Wichmann. Generalisation in humans and deep neural networks. In Advances in Neural Information Processing Systems, volume 31, 2018b. Drew Linsley, Dan Shiebler, Sven Eberhardt, and Thomas Serre. Learning what and where to attend. arXiv preprint arXiv:1805.08819,

  18. [19]

    SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

    Thomas Fel, Ivan F Rodriguez Rodriguez, Drew Linsley, and Thomas Serre. Harmonizing the object recognition strategies of deep neural networks with humans. Advances in neural information processing systems, 35: 9432–9446, 2022a. Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level ...

  19. [20]

    Towards Deeper Understanding of Variational Autoencoding Models

    Shengjia Zhao, Jiaming Song, and Stefano Ermon. Towards deeper understanding of variational autoencoding models. arXiv preprint arXiv:1702.08658,

  20. [21]

    Dickerson

    URL https://openreview.net/forum?id=BJgLg3R9KQ. Thomas Fel, Ivan Felipe, Drew Linsley, and Thomas Serre. Harmonizing the object recognition strategies of deep neural networks with humans. International Conference on Learning Representations (ICLR), 2022b. doi: 10.48550/ARXIV .2211.04533. Nikolaus Kriegeskorte and Pamela K. Douglas. Cognitive computational...

  21. [22]

    Thomas Serre, Aude Oliva, and Tomaso Poggio

    doi: 10.1038/s41593-018-0210-5. Thomas Serre, Aude Oliva, and Tomaso Poggio. A feedforward architecture accounts for rapid categorization. Proceedings of the National Academy of Sciences, 104(15):6424–6429,

  22. [23]

    Rufin VanRullen and Simon J

    doi: 10.1073/pnas.0700622104. Rufin VanRullen and Simon J. Thorpe. The time course of visual processing: From early perception to decision- making. Journal of Cognitive Neuroscience, 13(4):454–461,

  23. [24]

    Karl Friston

    doi: 10.1162/08989290152001880. Karl Friston. A theory of cortical responses. Philosophical Transactions of the Royal Society B: Biological Sciences, 360(1456):815–836,

  24. [25]

    A theory of cortical responses , volume =

    doi: 10.1098/rstb.2005.1622. 13 Victor Boutin, Angelo Franciosini, Frédéric Chavane, and Laurent U Perrinet. Pooling strategies in v1 can account for the functional and structural diversity across species. PLOS Computational Biology , 18(7): e1010270, 2022a. Daniel Kersten, Pascal Mamassian, and Alan Yuille. Object perception as bayesian inference. Annual...

  25. [26]

    Gabriel Kreiman and Thomas Serre

    doi: 10.1146/annurev.psych.55.090902.142005. Gabriel Kreiman and Thomas Serre. Beyond the feedforward sweep: Feedback computations in the visual cortex. Annals of the New York Academy of Sciences, 1464(1):222–241,

  26. [27]

    Kohitij Kar and James J

    doi: 10.1111/nyas.14320. Kohitij Kar and James J. DiCarlo. Fast recurrent processing via ventrolateral prefrontal cortex is needed by the primate ventral stream for robust core visual object recognition. Neuron, 109(1):164–176.e5,

  27. [28]

    Victor Boutin, Lakshya Singhal, Xavier Thomas, and Thomas Serre

    doi: 10.1016/j.neuron.2020.09.035. Victor Boutin, Lakshya Singhal, Xavier Thomas, and Thomas Serre. Diversity vs. recognizability: Human-like generalization in one-shot generative models. Advances in Neural Information Processing Systems , 35: 20933–20946, 2022b. Victor Boutin, Thomas Fel, Lakshya Singhal, Rishav Mukherji, Akash Nagaraj, Julien Colin, and...

  28. [29]

    Rajat Raina, Andrew Y

    doi: 10.1073/pnas.1912334117. Rajat Raina, Andrew Y . Ng, and Christopher D. Manning. Classification with hybrid generative/discriminative models. In Advances in Neural Information Processing Systems 16,

  29. [30]

    The tradeoff between generative and discriminative classifiers

    Guillaume Bouchard and Bill Triggs. The tradeoff between generative and discriminative classifiers. In COMPSTAT 2004, pages 721–728,

  30. [31]

    Iterative vae as a predictive brain model for out-of-distribution generalization

    Victor Boutin, Aimen Zerroug, Minju Jung, and Thomas Serre. Iterative vae as a predictive brain model for out-of-distribution generalization. arXiv preprint arXiv:2012.00557,

  31. [32]

    doi: 10.1016/j.patcog.2019. 107156. Hugo Larochelle and Yoshua Bengio. Classification using discriminative restricted boltzmann machines. In Proceedings of the 25th International Conference on Machine Learning , pages 536–543,

  32. [33]

    Xiulong Yang and Shihao Ji

    doi: 10.1145/1390156.1390224. Xiulong Yang and Shihao Ji. JEM++: Improved techniques for training JEM. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6494–6503,

  33. [34]

    Towards bridging the performance gaps of joint energy-based models

    Xiulong Yang, Qing Su, and Shihao Ji. Towards bridging the performance gaps of joint energy-based models. arXiv preprint arXiv:2209.07959,

  34. [35]

    Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One

    Will Grathwohl, Kuan-Chieh Wang, and Jorn-Henrik Jacobsen. Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One. ICLR, 2020b. URL https://arxiv.org/abs/1912.03263. Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. In CVPR,

  35. [36]

    Alex Krizhevsky and Geoffrey Hinton

    doi: 10.48550/arXiv.2505.18230. Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto,

  36. [37]

    15 Supplementary Material A Extended Related Work Generative and discriminative theories of vision

    URL https://arxiv.org/abs/1905.13549. 15 Supplementary Material A Extended Related Work Generative and discriminative theories of vision. A longstanding question in vision science is whether human-like visual representations are better explained by discriminative or generative learning principles. Recent work frames this debate as a contrast between two i...

  37. [38]

    and hybrid energy- based classifiers Larochelle and Bengio [2008], Grathwohl et al. [2020a]. These approaches are motivated by the complementary strengths of generative and discriminative objectives, but many in- troduce additional latent variables, separate modules, or partially distinct parameterizations Kuleshov and Ermon [2017], Gordon and Hernández-L...

  38. [39]

    • All JEMs were trained using mixed precision (via PyTorch AMP) and torch.compile to improve training efficiency

    • We do not use batch normalization in the energy model, as in our experiments it tended to destabilize generative training and often prevented convergence. • All JEMs were trained using mixed precision (via PyTorch AMP) and torch.compile to improve training efficiency. • For Gloss and CIFAR-10H, the generative model (α = 1 .0) was selected at the best-FI...

  39. [40]

    BAPPS also includes a just noticeable difference (JND) task to measure sensitivity to small perceptual changes

    Example of a reference image and its distorted patches. BAPPS also includes a just noticeable difference (JND) task to measure sensitivity to small perceptual changes. In the JND task, observers are required to judge whether a distorted path and the reference image appear perceptually the same or different. Models are then evaluated through similarity, by...

  40. [41]

    Shaded regions indicate the standard error of the mean (SEM) across two seeds

    2AFC accuracy and JND mAP on the BAPPS perceptual similarity benchmark, across JEM α values. Shaded regions indicate the standard error of the mean (SEM) across two seeds. 22 D Gloss and depth perceptual benchmark The gloss perception dataset probes mid-level material perception . This is a challenging task because it requires distinguishing surface refle...

  41. [42]

    with a supervised ResNet-18 model He et al. [2016]. For direct comparison, we trained the same two model classes using the public implementation provided in the Storrs et al. Storrs et al

  42. [43]

    In contrast, the ResNet18 baselines used one seed, and each of the eleven JEM variants were trained with two seeds per condition

    PixelV AE baselines were trained with 10 random seeds. In contrast, the ResNet18 baselines used one seed, and each of the eleven JEM variants were trained with two seeds per condition. Additionally, we used mild label smoothing of 0.05 for the JEMs, which we found helpful for stabilizing training in the binary classification setting. The next step was to ...

  43. [44]

    together with qualitative generations across α (Fig. 13). While increasing α generally improves the visual plausibility of the generated surfaces, the best alignment with human gloss judgments is achieved in the hybrid regime rather than at the purely generative endpoint. a) 10 dimensions b) 100 dimensions c) 500 dimensions d) 2000 dimensions Figure

  44. [45]

    human correlation in experiments with different latent sizes

    Gloss accuracy vs. human correlation in experiments with different latent sizes. 25 a) 10 dimensions b) 100 dimensions c) 500 dimensions d) 2000 dimensions Figure

  45. [46]

    We trained three JEM instances with different seeds for each value of α, using the same general procedure described in Appendix B.2

    and evaluated on CIFAR-10H only at test time. We trained three JEM instances with different seeds for each value of α, using the same general procedure described in Appendix B.2. For the discriminative baselines, we trained VGG, ResNet, and ResNeXt models using the pytorch_image_classification codebase, matching the repository used by Peterson et al. Pete...

  46. [47]

    Nonparametric datasets from Geirhos et al

    Texture–shape benchmarks. Nonparametric datasets from Geirhos et al. [2018a] and Wang et al. [2019]. Benchmark Levels / description Original Clean reference photographs Greyscale Desaturated originals Edge Canny-edge line drawings Silhouette Black-on-white object silhouettes Texture Texture-only patches Cue conflict Stylized images with conflicting shape ...

  47. [48]

    Nonparametric transformations used in the Model-vs-Human benchmark: sketch, stylized images, edge maps, silhouettes, and cue-conflict images. Evaluation metrics We evaluated 11 ImageNet-trained JEMs, corresponding to values of α ranging from 0 to 1 in increments of 0.1, i.e., from purely discriminative to purely generative training. Following Geirhos et a...

  48. [49]

    Shaded regions indicate the standard error of the mean (SEM) across two seeds

    Shape bias across JEM α values. Shaded regions indicate the standard error of the mean (SEM) across two seeds. H The Click-Me benchmark Modern CNNs achieve high performance on object-recognition benchmarks, but they are also known to rely on shortcut cues that can diverge from the diagnostic features used by human observers. To assess this aspect of align...

  49. [50]

    local feature maps

    Visual strategy of object recognition. Evaluation metrics . To compare models with humans, we follow the evaluation protocol of Fel et al. [2022b]. For each model, saliency maps are computed on the ClickMe images and compared with the corresponding human feature-importance maps, yielding a quantitative measure of feature alignment (Fig. 24). In our case, ...