Predicting Visual Memory Schemas with Variational Autoencoders

Adrian Bors; Cameron Kyle-Davidson; Karla Evans

arxiv: 1907.08514 · v1 · pith:JEHMIB6Vnew · submitted 2019-07-19 · 💻 cs.CV

Predicting Visual Memory Schemas with Variational Autoencoders

Cameron Kyle-Davidson , Adrian Bors , Karla Evans This is my paper

Pith reviewed 2026-05-24 19:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual memory schemasvariational autoencodersimage-to-image translationmemorability predictionfalse memorabilitydual-channel mapsconvolutional neural networks

0 comments

The pith

A variational autoencoder generates higher-resolution dual-channel maps of image regions that drive true or false visual memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats the prediction of visual memory schema maps as an image-to-image translation task. Earlier convolutional networks produced only low-resolution outputs that combined true and false memorability signals. The variational autoencoder instead produces dual-channel images at higher resolution, so that predicted true memorability and predicted false memorability can be scored separately. The authors also measure how these maps relate to ground-truth memorability scores and to scores predicted by other models.

Core claim

Approaching visual memory schema prediction as an image-to-image translation task with a variational autoencoder allows generation of higher resolution dual channel images that represent visual memory schemas, allowing separate evaluation of predicted true memorability and false memorability.

What carries the argument

Variational autoencoder trained to translate input images into dual-channel visual memory schema maps.

If this is right

Predicted true memorability and false memorability can be evaluated as independent channels rather than a single combined map.
The generated maps reach higher spatial resolution than maps produced by prior convolutional networks.
Relationships can be measured among ground-truth VMS maps, predicted VMS maps, ground-truth memorability scores, and predicted memorability scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dual-channel output format may let memory researchers test whether true and false memorability arise from spatially distinct image features.
If the resolution advantage holds on new image sets, the method could be inserted into pipelines that rank or edit images for memorability.
The same translation framing might be applied to other perceptual schema tasks that currently rely on low-resolution regression outputs.

Load-bearing premise

A variational autoencoder trained only on image data can produce accurate visual memory schema maps without task-specific architectural changes or post-processing steps that would remove the claimed resolution gain.

What would settle it

A side-by-side resolution or accuracy comparison in which the variational autoencoder outputs are no higher in resolution or no more accurate than the earlier convolutional-neural-network maps would falsify the central advantage.

Figures

Figures reproduced from arXiv: 1907.08514 by Adrian Bors, Cameron Kyle-Davidson, Karla Evans.

**Figure 2.** Figure 2: Predicting VAEs in images using an autoencoder. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Structure of the Decoder [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Reconstruction accuracy for various image categories. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of the memorability results for a set of image categories between the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: VISCHEMA2 Latent Space Embedding. Green represents memorability and red [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Set of three images from VISCHEMA2 dataset and their predicted true VMS and [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

Visual memory schema (VMS) maps show which regions of an image cause that image to be remembered or falsely remembered. Previous work has succeeded in generating low resolution VMS maps using convolutional neural networks. We instead approach this problem as an image-to-image translation task making use of a variational autoencoder. This approach allows us to generate higher resolution dual channel images that represent visual memory schemas, allowing us to evaluate predicted true memorability and false memorability separately. We also evaluate the relationship between VMS maps, predicted VMS maps, ground truth memorability scores, and predicted memorability scores.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VAE for dual-channel high-res VMS maps is a direct application that could help with separate true/false evaluation, but standard VAE smoothing is a real risk to the resolution claim.

read the letter

The main takeaway is that the authors reframe visual memory schema prediction as an image-to-image translation problem and use a variational autoencoder to output higher-resolution dual-channel maps, one for true memorability and one for false memorability. This lets them evaluate the two channels separately and check relations to ground-truth and predicted memorability scores, which moves past the low-resolution CNN baselines cited in prior work. That dual-channel setup and the explicit link back to memorability scores are the practical steps forward here. The paper does a clean job of identifying the resolution bottleneck in earlier CNN results and picking a generative model that can in principle scale to finer outputs. If the implementation delivers usable maps, it gives the visual memory community a tool that supports more granular analysis than before. The approach stays within established VAE techniques for image tasks, so the novelty is in the application rather than new machinery. The soft spot is the lack of any mention of how they handle the well-known smoothing effect in VAE reconstructions. The ELBO objective tends to average details, which can blur the localized regions that matter for VMS maps. The abstract gives no equations, loss modifications, perceptual terms, or adversarial components that would counteract this, so it is not obvious whether the higher-resolution outputs actually preserve sharp distinctions or simply look better at a distance. Without quantitative resolution metrics or side-by-side comparisons in the provided text, the central advantage remains unverified. The evaluation of map-to-score relationships is a sensible addition, but again the abstract supplies no details on the metrics used. This is a niche paper aimed at researchers who already work on visual memory modeling or who apply image translation models to perceptual psychology tasks. A reader in that intersection would get value from seeing the dual-channel framing and the attempt to raise resolution. The work shows clear thinking about the task constraints and honest engagement with the prior CNN limitation, so it is coherent on its own terms. It deserves a serious referee to check the actual outputs, training procedure, and whether any fixes were needed for sharpness. I would send it to peer review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that treating visual memory schema (VMS) prediction as an image-to-image translation task with a variational autoencoder enables generation of higher-resolution dual-channel outputs representing true and false memorability (improving on prior low-resolution CNN results), while also evaluating relationships among VMS maps, predicted VMS maps, ground-truth memorability scores, and predicted scores.

Significance. If the central claim holds, the work would supply a direct VAE-based route to separable, higher-resolution VMS channels without task-specific post-processing, potentially improving localization of memorability cues and enabling finer-grained analysis of true versus false memory effects.

major comments (2)

[Abstract] Abstract and method description: the claim that a standard VAE yields usable higher-resolution dual-channel VMS maps rests on the assumption that the ELBO objective (reconstruction + KL) will preserve sharp localization; however, VAEs are known to produce averaged, blurry decodings, which would directly undermine the resolution advantage for region-specific true/false memorability channels unless mitigated by unmentioned losses or architectural changes.
[Abstract] Abstract: no training procedure, loss formulation, network architecture details, or quantitative metrics (e.g., resolution achieved, PSNR/SSIM on dual channels, or comparison to prior CNN baselines) are supplied, so the data cannot be checked against the stated benefit of higher-resolution separable outputs.

minor comments (1)

[Abstract] The abstract states that relationships among VMS maps and memorability scores are evaluated but does not name the correlation or regression methods used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract and method description: the claim that a standard VAE yields usable higher-resolution dual-channel VMS maps rests on the assumption that the ELBO objective (reconstruction + KL) will preserve sharp localization; however, VAEs are known to produce averaged, blurry decodings, which would directly undermine the resolution advantage for region-specific true/false memorability channels unless mitigated by unmentioned losses or architectural changes.

Authors: We agree that standard VAEs can yield blurry outputs in general. Our manuscript presents the VAE as an image-to-image translation model that produces higher-resolution dual-channel outputs than prior CNN work; the full paper includes experimental results supporting this. We will revise the manuscript to clarify the specific architecture and any modifications employed to support localization in the generated maps. revision: partial
Referee: [Abstract] Abstract: no training procedure, loss formulation, network architecture details, or quantitative metrics (e.g., resolution achieved, PSNR/SSIM on dual channels, or comparison to prior CNN baselines) are supplied, so the data cannot be checked against the stated benefit of higher-resolution separable outputs.

Authors: The abstract is concise by design. The full manuscript details the training procedure, ELBO loss, network architecture, achieved resolution, and quantitative comparisons to prior CNN baselines. We will revise the abstract to reference these elements or point to the methods and results sections. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on standard VAE training

full rationale

The paper frames VMS prediction as an image-to-image translation task solved via a variational autoencoder, with the abstract and available text describing generation of dual-channel outputs for separate true/false memorability evaluation. No equations, fitted parameters, or self-citations are quoted that reduce any claimed prediction to its inputs by construction. The method invokes standard VAE properties (ELBO training) without redefining quantities in terms of the target outputs or importing uniqueness results from the authors' prior work. The central claim remains an empirical assertion about resolution and separate channel evaluation, not a self-referential renaming or fit. This is the expected non-finding for a paper whose derivation chain is externally grounded in established VAE mechanics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that standard VAE training will produce the claimed resolution and separation benefits.

pith-pipeline@v0.9.0 · 5619 in / 1130 out tokens · 18939 ms · 2026-05-24T19:19:00.972485+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 6 internal anchors

[1]

Defining Image Memorability using the Visual Memory Schema

E. Akagunduz, A. G. Bors, and K. K. Evans. Deﬁning Image Memorability using the Visual Memory Schema. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), 2019. URL http://arxiv.org/abs/1903.02056

work page internal anchor Pith review Pith/arXiv arXiv 2019
[2]

Baveye, R

Y . Baveye, R. Cohendet, M. Perreira Da Silva, and P. Le Callet. Deep learning for image memorability prediction: The emotional bias. In Proc. of the 24th ACM Int. Conf. on Multimedia, pages 491–495, 2016. KYLE-DA VIDSON, BORS, EV ANS: PREDICTING VMS WITH V ARIA TIONAL AUTOENCODERS11

work page 2016
[3]

Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet

W. Brendel and M. Bethge. Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet. In Proc. Int. Conf. on Learning Representations (ICLR), 2019. URL https://arxiv.org/abs/1904.00760

work page internal anchor Pith review Pith/arXiv arXiv 2019
[4]

L2 Regularization for Learning Kernels

Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. L2 regularization for learn- ing kernels. CoRR, abs/1205.2653, 2012. URL http://arxiv.org/abs/1205. 2653

work page internal anchor Pith review Pith/arXiv arXiv 2012
[5]

Dubey, J

R. Dubey, J. Peterson, A. Khosla, M. Yang, and B. Ghanem. What makes an object memorable? In Proc. IEEE Int. Conf. on Computer Vision, pages 1089–1097, 2015

work page 2015
[6]

Fajtl, V

J. Fajtl, V . Argyriou, D. Monekosso, and P. Remagnino. AMNet: Memorability Estima- tion with Attention. In Proc. IEEE Computer Vision and Pattern Recognition (CVPR), pages 6363–6372, 2018

work page 2018
[7]

Garcia-Gasulla, F

D. Garcia-Gasulla, F. Parés, A. Vilalta, J. Moreno, E. Ayguadé, J. Labarta, U. Cortés, and T. Suzumura. On the Behavior of Convolutional Nets for Feature Extraction. Jour. of Artiﬁcial Intelligence Research, 61:563–592, 2018

work page 2018
[8]

Gonzalez-Garcia, D

A. Gonzalez-Garcia, D. Modolo, and V . Ferrari. Do Semantic Parts Emerge in Convo- lutional Neural Networks? Int. Journal of Computer Vision, 126(5):476–494, 2018

work page 2018
[9]

Harel, C

J. Harel, C. Koch, and P. Perona. Graph-Based Visual Saliency. In Proc. Advances in Neural Information Processing Systems (NIPS), pages 545–552, 2006

work page 2006
[10]

Isola, J

P. Isola, J. Xiao, A. Torralba, and A. Oliva. What makes an image memorable? In Proc. IEEE Conf. on Comp. Vision and Pattern Recognition, pages 145–152, 2011

work page 2011
[11]

P. Jing, Y . Su, L. Nie, and H. Gu. Predicting Image Memorability Through Adaptive Transfer Learning From External Sources. IEEE Trans. on Multimedia , 19(5):1050– 1062, 2017

work page 2017
[12]

Khosla, A

A. Khosla, A. S. Raju, A. Torralba, and A. Oliva. Understanding and Predicting Image Memorability at a Large Scale. In IEEE Int. Conf. on Comp. Vision, pages 2390–2398

work page
[13]

Khosla, J

A. Khosla, J. Xiao, A. Torralba, and A. Oliva. Memorability of image regions. In Advances in Neural Information Processing Systems (NIPS), pages 296–304, 2012

work page 2012
[14]

D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. InProc. Int. Conf. on Learning Repres. (ICLR), 2014. URL http://arxiv.org/abs/1312.6114

work page internal anchor Pith review Pith/arXiv arXiv 2014
[15]

Lukavský and F

J. Lukavský and F. D ˇechtˇerenko. Visual properties and memorising scenes: Effects of image-space sparseness and uniformity. Attention, Perception, & Psychophysics , 79 (7):2044–2054, October 2017

work page 2044
[16]

H. Peng, K. Li, B. Li, H. Ling, W. Xiong, and W. Hu. In Proc. of the 23rd ACM Int. Conf. on Multimedia, pages 1147–1150, 2015

work page 2015
[17]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. 2014. URL http://arxiv.org/abs/1409.1556

work page internal anchor Pith review Pith/arXiv arXiv 2014
[18]

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Object Detectors Emerge in Deep Scene CNNs. In Proc. Int. Conf. on Learning Representations (ICLR), 2015. URL http://arxiv.org/abs/1412.6856

work page internal anchor Pith review Pith/arXiv arXiv 2015

[1] [1]

Defining Image Memorability using the Visual Memory Schema

E. Akagunduz, A. G. Bors, and K. K. Evans. Deﬁning Image Memorability using the Visual Memory Schema. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), 2019. URL http://arxiv.org/abs/1903.02056

work page internal anchor Pith review Pith/arXiv arXiv 2019

[2] [2]

Baveye, R

Y . Baveye, R. Cohendet, M. Perreira Da Silva, and P. Le Callet. Deep learning for image memorability prediction: The emotional bias. In Proc. of the 24th ACM Int. Conf. on Multimedia, pages 491–495, 2016. KYLE-DA VIDSON, BORS, EV ANS: PREDICTING VMS WITH V ARIA TIONAL AUTOENCODERS11

work page 2016

[3] [3]

Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet

W. Brendel and M. Bethge. Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet. In Proc. Int. Conf. on Learning Representations (ICLR), 2019. URL https://arxiv.org/abs/1904.00760

work page internal anchor Pith review Pith/arXiv arXiv 2019

[4] [4]

L2 Regularization for Learning Kernels

Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. L2 regularization for learn- ing kernels. CoRR, abs/1205.2653, 2012. URL http://arxiv.org/abs/1205. 2653

work page internal anchor Pith review Pith/arXiv arXiv 2012

[5] [5]

Dubey, J

R. Dubey, J. Peterson, A. Khosla, M. Yang, and B. Ghanem. What makes an object memorable? In Proc. IEEE Int. Conf. on Computer Vision, pages 1089–1097, 2015

work page 2015

[6] [6]

Fajtl, V

J. Fajtl, V . Argyriou, D. Monekosso, and P. Remagnino. AMNet: Memorability Estima- tion with Attention. In Proc. IEEE Computer Vision and Pattern Recognition (CVPR), pages 6363–6372, 2018

work page 2018

[7] [7]

Garcia-Gasulla, F

D. Garcia-Gasulla, F. Parés, A. Vilalta, J. Moreno, E. Ayguadé, J. Labarta, U. Cortés, and T. Suzumura. On the Behavior of Convolutional Nets for Feature Extraction. Jour. of Artiﬁcial Intelligence Research, 61:563–592, 2018

work page 2018

[8] [8]

Gonzalez-Garcia, D

A. Gonzalez-Garcia, D. Modolo, and V . Ferrari. Do Semantic Parts Emerge in Convo- lutional Neural Networks? Int. Journal of Computer Vision, 126(5):476–494, 2018

work page 2018

[9] [9]

Harel, C

J. Harel, C. Koch, and P. Perona. Graph-Based Visual Saliency. In Proc. Advances in Neural Information Processing Systems (NIPS), pages 545–552, 2006

work page 2006

[10] [10]

Isola, J

P. Isola, J. Xiao, A. Torralba, and A. Oliva. What makes an image memorable? In Proc. IEEE Conf. on Comp. Vision and Pattern Recognition, pages 145–152, 2011

work page 2011

[11] [11]

P. Jing, Y . Su, L. Nie, and H. Gu. Predicting Image Memorability Through Adaptive Transfer Learning From External Sources. IEEE Trans. on Multimedia , 19(5):1050– 1062, 2017

work page 2017

[12] [12]

Khosla, A

A. Khosla, A. S. Raju, A. Torralba, and A. Oliva. Understanding and Predicting Image Memorability at a Large Scale. In IEEE Int. Conf. on Comp. Vision, pages 2390–2398

work page

[13] [13]

Khosla, J

A. Khosla, J. Xiao, A. Torralba, and A. Oliva. Memorability of image regions. In Advances in Neural Information Processing Systems (NIPS), pages 296–304, 2012

work page 2012

[14] [14]

D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. InProc. Int. Conf. on Learning Repres. (ICLR), 2014. URL http://arxiv.org/abs/1312.6114

work page internal anchor Pith review Pith/arXiv arXiv 2014

[15] [15]

Lukavský and F

J. Lukavský and F. D ˇechtˇerenko. Visual properties and memorising scenes: Effects of image-space sparseness and uniformity. Attention, Perception, & Psychophysics , 79 (7):2044–2054, October 2017

work page 2044

[16] [16]

H. Peng, K. Li, B. Li, H. Ling, W. Xiong, and W. Hu. In Proc. of the 23rd ACM Int. Conf. on Multimedia, pages 1147–1150, 2015

work page 2015

[17] [17]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. 2014. URL http://arxiv.org/abs/1409.1556

work page internal anchor Pith review Pith/arXiv arXiv 2014

[18] [18]

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Object Detectors Emerge in Deep Scene CNNs. In Proc. Int. Conf. on Learning Representations (ICLR), 2015. URL http://arxiv.org/abs/1412.6856

work page internal anchor Pith review Pith/arXiv arXiv 2015