Structure over Pixels: Learning Variable-Length Visual Programs

Kacper Dobek; Krzysztof Krawiec; Piotr Wyrwi\'nski

arxiv: 2605.27696 · v2 · pith:KXAIZOKQnew · submitted 2026-05-26 · 💻 cs.CV · cs.LG

Structure over Pixels: Learning Variable-Length Visual Programs

Piotr Wyrwi\'nski , Kacper Dobek , Krzysztof Krawiec This is my paper

Pith reviewed 2026-06-29 18:03 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords visual tokenizationvariable-length programsstructural representationsrate-distortion supervisionscene complexitycompositional codesdiscrete visual codeslength prediction

0 comments

The pith

STROP learns both a structural codebook and the right number of tokens per image in one forward pass by supervising on rate-distortion quality of DINO features instead of pixels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents STROP as a discrete visual tokenizer that produces ordered sequences of codes to describe scene structure while also learning how many codes each image needs. Training proceeds through a four-phase curriculum that uses local rate-distortion measurements on frozen higher-level features to shape the codebook and train a length head. Because no pixel-reconstruction loss is used, the codes are driven only by how well they support accurate latent representations. If the approach holds, program length will scale with scene complexity and the resulting codes will exhibit compositional properties visible both in direct inspection and in transfer to dense prediction tasks.

Core claim

STROP forms structural scene representations and simultaneously learns how long an image's visual program should be. Using a four-phase curriculum supervised by local rate-distortion probes against frozen DINOv3 features, STROP optimizes a dedicated length head that estimates the active prefix length in a single forward pass. By bypassing pixel-level reconstruction gradients, the codebook is shaped entirely by the quality of higher-level latent representations. Program length grows with scene complexity, and signs of compositional structure emerge both in downstream dense-prediction transfer and in direct inspection of the learned code vocabulary.

What carries the argument

STROP length head, a module that estimates the active prefix length of the visual program from rate-distortion quality measured on frozen DINOv3 features.

If this is right

Program length increases with scene complexity.
Compositional structure becomes visible upon inspection of the learned code vocabulary.
Downstream dense-prediction tasks benefit from the learned structural codes.
A single forward pass suffices to determine the appropriate program length for any image.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same length-adaptation mechanism could be tested on video sequences to see whether it selects longer programs only for frames with changing structure.
Because length is predicted from structural quality alone, the codes might serve as a more compact input for relational reasoning models than fixed-length tokenizers.
Removing the curriculum phases one at a time would reveal which stage most strongly enforces the observed growth of program length with complexity.

Load-bearing premise

That local rate-distortion probes against frozen DINOv3 features provide a sufficient and unbiased supervision signal for shaping the codebook toward structure without any pixel-level reconstruction term.

What would settle it

Measuring whether estimated program lengths fail to increase when independent measures of scene complexity are increased, or whether downstream dense-prediction performance remains unchanged when the length head is removed.

Figures

Figures reproduced from arXiv: 2605.27696 by Kacper Dobek, Krzysztof Krawiec, Piotr Wyrwi\'nski.

**Figure 1.** Figure 1: Images as discrete adaptive programs. Each image is encoded as a variable-length sequence of discrete code IDs. For each highlighted code position, we intervene by erasing that token from the program and reinterpreting the remaining sequence. The attribution panels show the resulting change in the DINO-aligned patch field: red denotes higher token influence and blue denotes lower influence within that pane… view at source ↗

**Figure 2.** Figure 2: STROP architecture overview. 2 Proposed approach 2.1 Architecture Given an input image x ∈ R 3×H×H, we seek a compact discrete program c = (c1, . . . , cL) with ck ∈ {1, . . . , N} from which a DINO-aligned patch field can be recovered, where L varies per image and |CB| is the codebook size. The model ( [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Training curriculum for autonomous truncation. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Unsupervised readouts on COCO-Stuff-27. DINO, Latent, and Token show k-means [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Supervised downstream segmentation probes on COCO-Stuff-27. DINO and Latent show [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Pairwise program-token synergies. We erase pairs of discrete program tokens and reinterpret the remaining sequence, then visualize the resulting change in the DINO-aligned patch field. The selected COCO-Stuff-27 examples show that high-synergy token pairs often produce localized responses aligned with semantic mask regions, suggesting that program tokens can act as compositional semantic handles. Heatmaps … view at source ↗

**Figure 7.** Figure 7: Program length and scene complexity analysis on CLEVR and COCO. STROP generally [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Program length and scene complexity analysis on COCO using variance of DINO embed [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Contingency heatmap for CLEVR attributes and token ids; codebook size 32. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

read the original abstract

Discrete visual tokenizers translate images into ordered sequences of codes, providing a natural representation for structural description of scenes. Yet existing adaptive tokenizers either require post-hoc search or select among a discrete set of pre-trained rates, rather than learning a continuous per-image sequence length coupled to the model and scene, and they typically train against pixel reconstruction, emphasizing texture rather than structure. We propose STROP, a discrete visual tokenizer architecture that forms structural scene representations and simultaneously learns how long an image's visual program should be. Using a four-phase curriculum supervised by local rate--distortion probes against frozen DINOv3 features, STROP optimizes a dedicated length head that estimates the active prefix length in a single forward pass. By bypassing pixel-level reconstruction gradients, the codebook is shaped entirely by the quality of higher-level latent representations. Program length grows with scene complexity, and signs of compositional structure emerge both in downstream dense-prediction transfer and in direct inspection of the learned code vocabulary.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STROP jointly trains a length head and codebook via DINOv3 rate-distortion curriculum to get variable-length programs without pixel loss, but the single-supervisor setup leaves open whether lengths track real structure or DINO biases.

read the letter

The main point is that STROP learns both the discrete codes and a continuous per-image length in one model by running a four-phase curriculum on local rate-distortion against frozen DINOv3 features. This is distinct from prior adaptive tokenizers that either search lengths after training or pick from a fixed menu of rates.

It does a straightforward job of removing the pixel reconstruction term so the codebook is shaped only by higher-level latent quality. That choice aligns with the goal of favoring structure over texture, and the length head is trained to output the active prefix directly.

The soft spot is the exclusive reliance on DINOv3 for the supervision signal. The stress-test note flags that any mismatch between DINOv3 geometry and actual scene structure would flow straight into both the learned lengths and the downstream results. The abstract gives no sign of ablations that swap the frozen extractor or add a reconstruction term to test stability, so it is not yet clear whether the length growth with scene complexity is robust or DINO-specific. Without those checks, the claim that compositional structure appears in code inspection and transfer tasks rests on unverified assumptions.

This paper is for groups working on visual tokenizers for structured scene tasks or variable-complexity generation. A reader already thinking about rate-distortion objectives or curriculum-based length control would get the most from it.

It deserves peer review because the mechanism is internally consistent and differs from existing options, even though the current write-up leaves the supervision robustness untested.

Referee Report

2 major / 2 minor

Summary. The paper proposes STROP, a discrete visual tokenizer that learns structural scene representations from images while simultaneously estimating a variable per-image program length via a dedicated length head. It uses a four-phase curriculum driven solely by local rate-distortion probes against frozen DINOv3 features, with no pixel-level reconstruction term, claiming that resulting program lengths grow with scene complexity and that compositional structure appears in downstream dense-prediction transfer and code-vocabulary inspection.

Significance. If the central claims hold after addressing robustness concerns, the work would offer a technically interesting route to adaptive-length structural tokenization that decouples length prediction from pixel texture. The single-forward-pass length head and curriculum design are concrete strengths that could be reusable; however, the absence of any reported invariance checks or alternative-extractor ablations limits the strength of the structural-compositionality claim.

major comments (2)

[Abstract] Abstract: the claim that bypassing pixel reconstruction allows the codebook to be 'shaped entirely by the quality of higher-level latent representations' is load-bearing for the variable-length result, yet the manuscript provides no derivation or experiment showing that the local rate-distortion objective remains stable when the frozen extractor is replaced by a different model family.
[Curriculum description (four-phase procedure)] Curriculum description (four-phase procedure): because both the rate-distortion supervision and the downstream evaluation probes are defined with respect to the same DINOv3 feature space, any DINO-specific bias in latent geometry would directly affect both the learned lengths and the reported transfer gains; a cross-extractor stability test is required to support the claim that lengths reflect scene complexity rather than extractor artifacts.

minor comments (2)

The abstract states that 'signs of compositional structure emerge in direct inspection of the learned code vocabulary' without providing quantitative metrics, example programs, or inter-rater agreement on the inspected codes.
Notation for the length head and the active-prefix selection mechanism is introduced without an accompanying equation or pseudocode block, making it difficult to verify that length estimation occurs in a single forward pass as claimed.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments and for recognizing the strengths of the single-forward-pass length head and curriculum design. We address the two major comments below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that bypassing pixel reconstruction allows the codebook to be 'shaped entirely by the quality of higher-level latent representations' is load-bearing for the variable-length result, yet the manuscript provides no derivation or experiment showing that the local rate-distortion objective remains stable when the frozen extractor is replaced by a different model family.

Authors: The rate-distortion objective is formulated generally with respect to any provided frozen feature extractor and contains no pixel reconstruction term, so the codebook is optimized solely against the quality of the supplied latents. The four-phase curriculum applies the same local probes without reference to DINOv3-specific geometry beyond the features themselves. While we did not run replacement experiments with other families, the variable-length head is trained to predict active prefix length from the same rate signal, and the design is extractor-agnostic in principle. We will revise the abstract to state that the codebook is shaped by the chosen higher-level representations rather than claiming universality. revision: partial
Referee: [Curriculum description (four-phase procedure)] Curriculum description (four-phase procedure): because both the rate-distortion supervision and the downstream evaluation probes are defined with respect to the same DINOv3 feature space, any DINO-specific bias in latent geometry would directly affect both the learned lengths and the reported transfer gains; a cross-extractor stability test is required to support the claim that lengths reflect scene complexity rather than extractor artifacts.

Authors: We agree that shared use of DINOv3 for both supervision and evaluation creates a risk that reported lengths and transfer gains partly reflect DINOv3 biases. The downstream tasks themselves are standard benchmarks whose labels are independent of DINOv3, and length growth is observed qualitatively across diverse scenes. However, a full cross-extractor ablation would require repeating the entire four-phase curriculum with alternative frozen models, which was outside the scope of the submitted work. We will add an explicit limitations paragraph discussing extractor dependence and the scope of the current claims. revision: partial

standing simulated objections not resolved

Empirical demonstration that the local rate-distortion objective and resulting program lengths remain stable when the frozen DINOv3 extractor is replaced by a different model family, as this requires new training runs not performed in the manuscript.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation relies on an external frozen DINOv3 model to supply the rate-distortion supervision signal for the four-phase curriculum and length head; this external benchmark is independent of the STROP parameters being optimized. No self-citations, self-definitional equations, fitted inputs renamed as predictions, or ansatzes smuggled via prior author work appear in the abstract or described mechanism. The variable-length estimation and codebook shaping are therefore driven by an outside latent geometry rather than reducing to the model's own outputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only view limits enumeration; the approach rests on the domain assumption that DINOv3 features encode structure more usefully than pixels and that a dedicated length head can be optimized without introducing new fitted constants beyond standard training.

axioms (1)

domain assumption DINOv3 features provide a suitable structural supervision signal for rate-distortion probes
The entire curriculum and codebook shaping depend on this external model being a good proxy for structure.

pith-pipeline@v0.9.1-grok · 5700 in / 1200 out tokens · 40135 ms · 2026-06-29T18:03:03.165521+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 33 canonical work pages · 8 internal anchors

[1]

Object-Centric Learning with Slot Attention

Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-Centric Learning with Slot Attention. In: Advances in Neural Information Processing Systems (NeurIPS). 2020

2020
[2]

Illiterate DALL-E Learns to Compose

Gautam Singh, Fei Deng, and Sungjin Ahn. Illiterate DALL-E Learns to Compose. In:International Conference on Learning Representations (ICLR). 2022. arXiv:2110.11405

work page arXiv 2022
[3]

Bridging the Gap to Real-World Object-Centric Learning

Maximilian Seitzer, Max Horn, Andrii Zadaianchuk, Dominik Zietlow, Tianjun Xiao, Carl-Johann Simon-Gabriel, Tong He, Zheng Zhang, Bernhard Schölkopf, Thomas Brox, and Francesco Locatello. Bridging the Gap to Real-World Object-Centric Learning. In:International Conference on Learning Representations (ICLR). 2023. arXiv:2209.14860

work page arXiv 2023
[4]

Neural Programmer-Interpreters

Scott Reed and Nando de Freitas. Neural Programmer-Interpreters. In:International Conference on Learning Representations (ICLR). 2016. arXiv:1511.06279

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Jian Jia, Jingtong Gao, Ben Xue, Junhao Wang, Qingpeng Cai, Quan Chen, Xiangyu Zhao, Peng Jiang, and Kun Gai.From Principles to Applications: A Comprehensive Survey of Discrete Tokenizers in Generation, Comprehension, Recommendation, and Information Retrieval. 2025. arXiv:2502.12448 [cs.IR]. 10

work page arXiv 2025
[6]

Neural Discrete Representation Learning

Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural Discrete Representation Learning. In:Advances in Neural Information Processing Systems (NeurIPS). 2017

2017
[7]

Taming Transformers for High-Resolution Image Synthesis

Patrick Esser, Robin Rombach, and Björn Ommer. Taming Transformers for High-Resolution Image Synthesis. In:IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021

2021
[8]

An Image is Worth 32 Tokens for Reconstruction and Generation

Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An Image is Worth 32 Tokens for Reconstruction and Generation. In:Advances in Neural Information Processing Systems (NeurIPS). 2024. arXiv:2406.07550

work page arXiv 2024
[9]

FlexTok: Resampling Images into 1D Token Sequences of Flexible Length

Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, O ˘guzhan Fatih Kar, Elmira Amirloo, Alaaeldin El-Nouby, Amir Zamir, and Afshin Dehghan. FlexTok: Resampling Images into 1D Token Sequences of Flexible Length. In:International Conference on Machine Learning (ICML). 2025. arXiv: 2502.13967

work page arXiv 2025
[10]

One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression

Keita Miwa, Kento Sasaki, Hidehisa Arai, Tsubasa Takahashi, and Yu Yamaguchi. One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression. In:arXiv preprint arXiv:2501.10064(2025)

work page arXiv 2025
[11]

ElasticTok: Adaptive Tokenization for Image and Video

Wilson Yan, Matei Zaharia, V olodymyr Mnih, Pieter Abbeel, Aleksandra Faust, and Hao Liu. ElasticTok: Adaptive Tokenization for Image and Video. In:International Conference on Learning Representations (ICLR). 2025. arXiv:2410.08368

work page arXiv 2025
[12]

Shivam Duggal, Phillip Isola, Antonio Torralba, and William T. Freeman. Adaptive Length Image Tokenization via Recurrent Allocation. In:International Conference on Learning Representations (ICLR)
[13]

CAT: Content-Adaptive Image Tokenization

Junhong Shen, Kushal Tirumala, Michihiro Yasunaga, Ishan Misra, Luke Zettlemoyer, Lili Yu, and Chunting Zhou. CAT: Content-Adaptive Image Tokenization. In:arXiv preprint arXiv:2501.03120 (2025)

work page arXiv 2025
[14]

InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

Haotian Ye, Qiyuan He, Jiaqi Han, Puheng Li, Jiaojiao Fan, Zekun Hao, Fitsum Reda, Yogesh Balaji, Huayu Chen, Sheng Liu, Angela Yao, James Zou, Stefano Ermon, Haoxiang Wang, and Ming-Yu Liu. InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression. In:arXiv preprint arXiv:2512.16975(2025)

work page arXiv 2025
[15]

BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers

Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, and Furu Wei. BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers. In:arXiv preprint arXiv:2208.06366(2022)

work page arXiv 2022
[16]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think. In:International Conference on Learning Representations (ICLR). 2025. arXiv:2410.06940

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Masked Autoencoders Are Effective Tokenizers for Diffusion Models

Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhiksha Raj. Masked Autoencoders Are Effective Tokenizers for Diffusion Models. In: International Conference on Machine Learning (ICML). 2025. arXiv:2502.03444

work page arXiv 2025
[18]

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Generating Diverse High-Fidelity Images with VQ-V AE-2

Ali Razavi, Aäron van den Oord, and Oriol Vinyals. Generating Diverse High-Fidelity Images with VQ-V AE-2. In:Advances in Neural Information Processing Systems (NeurIPS). 2019

2019
[20]

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. MaskGIT: Masked Generative Image Transformer. In:IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022

2022
[21]

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction. In:Advances in Neural Information Processing Systems (NeurIPS). 2024. arXiv:2404.02905

work page arXiv 2024
[22]

Finite Scalar Quantization: VQ-VAE Made Simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite Scalar Quantization: VQ-V AE Made Simple. In:International Conference on Learning Representations (ICLR). 2024. arXiv: 2309.15505

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Autoregressive Image Generation Using Residual Quantization

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive Image Generation Using Residual Quantization. In:IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022

2022
[24]

Junkins, Dennis Duan, Aniketh Iger, Jerry W

Christopher Fifty, Ronald G. Junkins, Dennis Duan, Aniketh Iger, Jerry W. Liu, Ehsan Amid, Sebastian Thrun, and Christopher Ré. Restructuring Vector Quantization with the Rotation Trick. In:International Conference on Learning Representations (ICLR). 2025. arXiv:2410.06424. 11

work page arXiv 2025
[25]

Language Model Beats Diffusion - Tokenizer is key to visual generation

Lijun Yu, Jose Lezama, Nitesh Bharadwaj Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A Ross, and Lu Jiang. Language Model Beats Diffusion - Tokenizer is key to visual generation. In:The Twelfth International Conference on Learning Representat...

2024
[26]

Yue Zhao, Hanwen Jiang, Zhenlin Xu, Chutong Yang, Ehsan Adeli, and Philipp Krähenbühl.Spherical Leech Quantization for Visual Tokenization and Generation. 2025. arXiv:2512.14697 [cs.CV]

work page arXiv 2025
[27]

SoftVQ-V AE: Efficient 1-Dimensional Continuous Tokenizer

Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, and Emad Barsoum. SoftVQ-V AE: Efficient 1-Dimensional Continuous Tokenizer. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2025. arXiv: 2412.10958

work page arXiv 2025
[28]

ImageFolder: Autoregressive Image Generation with Folded Tokens

Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, and Zhe Lin. ImageFolder: Autoregressive Image Generation with Folded Tokens. In: 2024. arXiv:2410.01756

work page arXiv 2024
[29]

Adaptive Computation Time for Recurrent Neural Networks

Alex Graves. Adaptive Computation Time for Recurrent Neural Networks. In:arXiv preprint arXiv:1603.08983(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[30]

Pondernet: Learning to ponder.arXiv preprint arXiv:2107.05407,

Andrea Banino, Jan Balaguer, and Charles Blundell. PonderNet: Learning to Ponder. In:arXiv preprint arXiv:2107.05407(2021)

work page arXiv 2021
[31]

Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation.arXiv preprint arXiv:2507.10524, 2025

Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, and Se-Young Yun. Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation. In:Advances in Neural Information Processing Systems (NeurIPS). 2025. arXiv:2507.10524

work page arXiv 2025
[32]

Return of Unconditional Generation: A Self-Supervised Representation Generation Method

Tianhong Li, Dina Katabi, and Kaiming He. Return of Unconditional Generation: A Self-Supervised Representation Generation Method. In:Advances in Neural Information Processing Systems (NeurIPS)
[33]

arXiv preprint arXiv:2504.10483 (2025)

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. REPA-E: Unlocking V AE for End-to-End Tuning with Latent Diffusion Transformers. In:IEEE/CVF International Conference on Computer Vision (ICCV). 2025. arXiv:2504.10483

work page arXiv 2025
[34]

Reconstruction vs

Jingfeng Yao and Xinggang Wang. Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models. In:IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
[35]

Latent Denoising Makes Good Visual Tokenizers

Jiawei Yang, Tianhong Li, Lijie Fan, Yonglong Tian, and Yue Wang. Latent Denoising Makes Good Visual Tokenizers. In:arXiv preprint arXiv:2507.15856(2025)

work page arXiv 2025
[36]

SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models

Ziyi Wu, Jingyu Hu, Wuyue Lu, Igor Gilitschenski, and Animesh Garg. SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models. In:Advances in Neural Information Processing Systems (NeurIPS). 2023. arXiv:2305.11281

work page arXiv 2023
[37]

Principal Components

Xin Wen, Bingchen Zhao, Ismail Elezi, Jiankang Deng, and Xiaojuan Qi. “Principal Components” Enable a New Language of Images. In:IEEE/CVF International Conference on Computer Vision (ICCV). 2025. arXiv:2503.08685

work page arXiv 2025
[38]

How Diffusion Models Learn to Factorize and Compose

Qiyao Liang, Ziming Liu, Mitchell Ostrow, and Ila Fiete. How Diffusion Models Learn to Factorize and Compose. In:Advances in Neural Information Processing Systems (NeurIPS). 2024. arXiv: 2408.13256

work page arXiv 2024
[39]

Dick, and Hidenori Tanaka

Maya Okawa, Ekdeep Singh Lubana, Robert P. Dick, and Hidenori Tanaka. Compositional Abili- ties Emerge Multiplicatively: Exploring Diffusion Models on a Synthetic Task. In:arXiv preprint arXiv:2310.09336(2023)

work page arXiv 2023
[40]

Disentangled Latent Representations of Images with Atomic Autoencoders

Alasdair Newson and Yann Traonmilin. Disentangled Latent Representations of Images with Atomic Autoencoders. In:Sampling Theory and Applications Conference (SampTA). 2023

2023
[41]

End-to-End Object Detection with Transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-End Object Detection with Transformers. In:European Conference on Computer Vision (ECCV). 2020

2020
[42]

Perceiver: General Perception with Iterative Attention

Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and João Carreira. Perceiver: General Perception with Iterative Attention. In:International Conference on Machine Learning (ICML). 2021

2021
[43]

Botvinick, Andrew Zisserman, Oriol Vinyals, and João Carreira

Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, and João Carreira. Perceiver IO: A General Architecture for Structured Inputs & Outputs. In:International Conference on Learn...

2022
[44]

Everingham, S

M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object Classes Challenge: A Retrospective. In:International Journal of Computer Vision111.1 (Jan. 2015), pp. 98–136. 12

2015
[45]

Holger Caesar, Jasper Uijlings, and Vittorio Ferrari.COCO-Stuff: Thing and Stuff Classes in Context
[46]

arXiv:1612.03716 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Semantic understanding of scenes through the ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset. In:International Journal of Computer Vision127.3 (2019), pp. 302–321

2019
[48]

The Cityscapes Dataset for Semantic Urban Scene Understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benen- son, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes Dataset for Semantic Urban Scene Understanding. In:Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016

2016
[49]

Indoor Segmentation and Support Inference from RGBD Images

Pushmeet Kohli Nathan Silberman Derek Hoiem and Rob Fergus. Indoor Segmentation and Support Inference from RGBD Images. In:ECCV. 2012

2012
[50]

ImageNet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierar- chical image database. In:2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009, pp. 248–255.DOI:10.1109/CVPR.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009
[51]

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common Objects in Context. In:CoRRabs/1405.0312 (2014). arXiv:1405.0312

work page internal anchor Pith review Pith/arXiv arXiv 2014
[52]

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Girshick. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In:CoRRabs/1612.06890 (2016). arXiv:1612.06890. 13 A Adaptive program-length curriculum details This appendix gives the full parameterization of the adapti...

work page internal anchor Pith review Pith/arXiv arXiv 2016

[1] [1]

Object-Centric Learning with Slot Attention

Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-Centric Learning with Slot Attention. In: Advances in Neural Information Processing Systems (NeurIPS). 2020

2020

[2] [2]

Illiterate DALL-E Learns to Compose

Gautam Singh, Fei Deng, and Sungjin Ahn. Illiterate DALL-E Learns to Compose. In:International Conference on Learning Representations (ICLR). 2022. arXiv:2110.11405

work page arXiv 2022

[3] [3]

Bridging the Gap to Real-World Object-Centric Learning

Maximilian Seitzer, Max Horn, Andrii Zadaianchuk, Dominik Zietlow, Tianjun Xiao, Carl-Johann Simon-Gabriel, Tong He, Zheng Zhang, Bernhard Schölkopf, Thomas Brox, and Francesco Locatello. Bridging the Gap to Real-World Object-Centric Learning. In:International Conference on Learning Representations (ICLR). 2023. arXiv:2209.14860

work page arXiv 2023

[4] [4]

Neural Programmer-Interpreters

Scott Reed and Nando de Freitas. Neural Programmer-Interpreters. In:International Conference on Learning Representations (ICLR). 2016. arXiv:1511.06279

work page internal anchor Pith review Pith/arXiv arXiv 2016

[5] [5]

Jian Jia, Jingtong Gao, Ben Xue, Junhao Wang, Qingpeng Cai, Quan Chen, Xiangyu Zhao, Peng Jiang, and Kun Gai.From Principles to Applications: A Comprehensive Survey of Discrete Tokenizers in Generation, Comprehension, Recommendation, and Information Retrieval. 2025. arXiv:2502.12448 [cs.IR]. 10

work page arXiv 2025

[6] [6]

Neural Discrete Representation Learning

Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural Discrete Representation Learning. In:Advances in Neural Information Processing Systems (NeurIPS). 2017

2017

[7] [7]

Taming Transformers for High-Resolution Image Synthesis

Patrick Esser, Robin Rombach, and Björn Ommer. Taming Transformers for High-Resolution Image Synthesis. In:IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021

2021

[8] [8]

An Image is Worth 32 Tokens for Reconstruction and Generation

Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An Image is Worth 32 Tokens for Reconstruction and Generation. In:Advances in Neural Information Processing Systems (NeurIPS). 2024. arXiv:2406.07550

work page arXiv 2024

[9] [9]

FlexTok: Resampling Images into 1D Token Sequences of Flexible Length

Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, O ˘guzhan Fatih Kar, Elmira Amirloo, Alaaeldin El-Nouby, Amir Zamir, and Afshin Dehghan. FlexTok: Resampling Images into 1D Token Sequences of Flexible Length. In:International Conference on Machine Learning (ICML). 2025. arXiv: 2502.13967

work page arXiv 2025

[10] [10]

One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression

Keita Miwa, Kento Sasaki, Hidehisa Arai, Tsubasa Takahashi, and Yu Yamaguchi. One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression. In:arXiv preprint arXiv:2501.10064(2025)

work page arXiv 2025

[11] [11]

ElasticTok: Adaptive Tokenization for Image and Video

Wilson Yan, Matei Zaharia, V olodymyr Mnih, Pieter Abbeel, Aleksandra Faust, and Hao Liu. ElasticTok: Adaptive Tokenization for Image and Video. In:International Conference on Learning Representations (ICLR). 2025. arXiv:2410.08368

work page arXiv 2025

[12] [12]

Shivam Duggal, Phillip Isola, Antonio Torralba, and William T. Freeman. Adaptive Length Image Tokenization via Recurrent Allocation. In:International Conference on Learning Representations (ICLR)

[13] [13]

CAT: Content-Adaptive Image Tokenization

Junhong Shen, Kushal Tirumala, Michihiro Yasunaga, Ishan Misra, Luke Zettlemoyer, Lili Yu, and Chunting Zhou. CAT: Content-Adaptive Image Tokenization. In:arXiv preprint arXiv:2501.03120 (2025)

work page arXiv 2025

[14] [14]

InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

Haotian Ye, Qiyuan He, Jiaqi Han, Puheng Li, Jiaojiao Fan, Zekun Hao, Fitsum Reda, Yogesh Balaji, Huayu Chen, Sheng Liu, Angela Yao, James Zou, Stefano Ermon, Haoxiang Wang, and Ming-Yu Liu. InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression. In:arXiv preprint arXiv:2512.16975(2025)

work page arXiv 2025

[15] [15]

BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers

Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, and Furu Wei. BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers. In:arXiv preprint arXiv:2208.06366(2022)

work page arXiv 2022

[16] [16]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think. In:International Conference on Learning Representations (ICLR). 2025. arXiv:2410.06940

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Masked Autoencoders Are Effective Tokenizers for Diffusion Models

Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhiksha Raj. Masked Autoencoders Are Effective Tokenizers for Diffusion Models. In: International Conference on Machine Learning (ICML). 2025. arXiv:2502.03444

work page arXiv 2025

[18] [18]

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Generating Diverse High-Fidelity Images with VQ-V AE-2

Ali Razavi, Aäron van den Oord, and Oriol Vinyals. Generating Diverse High-Fidelity Images with VQ-V AE-2. In:Advances in Neural Information Processing Systems (NeurIPS). 2019

2019

[20] [20]

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. MaskGIT: Masked Generative Image Transformer. In:IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022

2022

[21] [21]

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction. In:Advances in Neural Information Processing Systems (NeurIPS). 2024. arXiv:2404.02905

work page arXiv 2024

[22] [22]

Finite Scalar Quantization: VQ-VAE Made Simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite Scalar Quantization: VQ-V AE Made Simple. In:International Conference on Learning Representations (ICLR). 2024. arXiv: 2309.15505

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Autoregressive Image Generation Using Residual Quantization

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive Image Generation Using Residual Quantization. In:IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022

2022

[24] [24]

Junkins, Dennis Duan, Aniketh Iger, Jerry W

Christopher Fifty, Ronald G. Junkins, Dennis Duan, Aniketh Iger, Jerry W. Liu, Ehsan Amid, Sebastian Thrun, and Christopher Ré. Restructuring Vector Quantization with the Rotation Trick. In:International Conference on Learning Representations (ICLR). 2025. arXiv:2410.06424. 11

work page arXiv 2025

[25] [25]

Language Model Beats Diffusion - Tokenizer is key to visual generation

Lijun Yu, Jose Lezama, Nitesh Bharadwaj Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A Ross, and Lu Jiang. Language Model Beats Diffusion - Tokenizer is key to visual generation. In:The Twelfth International Conference on Learning Representat...

2024

[26] [26]

Yue Zhao, Hanwen Jiang, Zhenlin Xu, Chutong Yang, Ehsan Adeli, and Philipp Krähenbühl.Spherical Leech Quantization for Visual Tokenization and Generation. 2025. arXiv:2512.14697 [cs.CV]

work page arXiv 2025

[27] [27]

SoftVQ-V AE: Efficient 1-Dimensional Continuous Tokenizer

Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, and Emad Barsoum. SoftVQ-V AE: Efficient 1-Dimensional Continuous Tokenizer. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2025. arXiv: 2412.10958

work page arXiv 2025

[28] [28]

ImageFolder: Autoregressive Image Generation with Folded Tokens

Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, and Zhe Lin. ImageFolder: Autoregressive Image Generation with Folded Tokens. In: 2024. arXiv:2410.01756

work page arXiv 2024

[29] [29]

Adaptive Computation Time for Recurrent Neural Networks

Alex Graves. Adaptive Computation Time for Recurrent Neural Networks. In:arXiv preprint arXiv:1603.08983(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[30] [30]

Pondernet: Learning to ponder.arXiv preprint arXiv:2107.05407,

Andrea Banino, Jan Balaguer, and Charles Blundell. PonderNet: Learning to Ponder. In:arXiv preprint arXiv:2107.05407(2021)

work page arXiv 2021

[31] [31]

Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation.arXiv preprint arXiv:2507.10524, 2025

Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, and Se-Young Yun. Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation. In:Advances in Neural Information Processing Systems (NeurIPS). 2025. arXiv:2507.10524

work page arXiv 2025

[32] [32]

Return of Unconditional Generation: A Self-Supervised Representation Generation Method

Tianhong Li, Dina Katabi, and Kaiming He. Return of Unconditional Generation: A Self-Supervised Representation Generation Method. In:Advances in Neural Information Processing Systems (NeurIPS)

[33] [33]

arXiv preprint arXiv:2504.10483 (2025)

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. REPA-E: Unlocking V AE for End-to-End Tuning with Latent Diffusion Transformers. In:IEEE/CVF International Conference on Computer Vision (ICCV). 2025. arXiv:2504.10483

work page arXiv 2025

[34] [34]

Reconstruction vs

Jingfeng Yao and Xinggang Wang. Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models. In:IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

[35] [35]

Latent Denoising Makes Good Visual Tokenizers

Jiawei Yang, Tianhong Li, Lijie Fan, Yonglong Tian, and Yue Wang. Latent Denoising Makes Good Visual Tokenizers. In:arXiv preprint arXiv:2507.15856(2025)

work page arXiv 2025

[36] [36]

SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models

Ziyi Wu, Jingyu Hu, Wuyue Lu, Igor Gilitschenski, and Animesh Garg. SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models. In:Advances in Neural Information Processing Systems (NeurIPS). 2023. arXiv:2305.11281

work page arXiv 2023

[37] [37]

Principal Components

Xin Wen, Bingchen Zhao, Ismail Elezi, Jiankang Deng, and Xiaojuan Qi. “Principal Components” Enable a New Language of Images. In:IEEE/CVF International Conference on Computer Vision (ICCV). 2025. arXiv:2503.08685

work page arXiv 2025

[38] [38]

How Diffusion Models Learn to Factorize and Compose

Qiyao Liang, Ziming Liu, Mitchell Ostrow, and Ila Fiete. How Diffusion Models Learn to Factorize and Compose. In:Advances in Neural Information Processing Systems (NeurIPS). 2024. arXiv: 2408.13256

work page arXiv 2024

[39] [39]

Dick, and Hidenori Tanaka

Maya Okawa, Ekdeep Singh Lubana, Robert P. Dick, and Hidenori Tanaka. Compositional Abili- ties Emerge Multiplicatively: Exploring Diffusion Models on a Synthetic Task. In:arXiv preprint arXiv:2310.09336(2023)

work page arXiv 2023

[40] [40]

Disentangled Latent Representations of Images with Atomic Autoencoders

Alasdair Newson and Yann Traonmilin. Disentangled Latent Representations of Images with Atomic Autoencoders. In:Sampling Theory and Applications Conference (SampTA). 2023

2023

[41] [41]

End-to-End Object Detection with Transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-End Object Detection with Transformers. In:European Conference on Computer Vision (ECCV). 2020

2020

[42] [42]

Perceiver: General Perception with Iterative Attention

Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and João Carreira. Perceiver: General Perception with Iterative Attention. In:International Conference on Machine Learning (ICML). 2021

2021

[43] [43]

Botvinick, Andrew Zisserman, Oriol Vinyals, and João Carreira

Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, and João Carreira. Perceiver IO: A General Architecture for Structured Inputs & Outputs. In:International Conference on Learn...

2022

[44] [44]

Everingham, S

M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object Classes Challenge: A Retrospective. In:International Journal of Computer Vision111.1 (Jan. 2015), pp. 98–136. 12

2015

[45] [45]

Holger Caesar, Jasper Uijlings, and Vittorio Ferrari.COCO-Stuff: Thing and Stuff Classes in Context

[46] [46]

arXiv:1612.03716 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

Semantic understanding of scenes through the ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset. In:International Journal of Computer Vision127.3 (2019), pp. 302–321

2019

[48] [48]

The Cityscapes Dataset for Semantic Urban Scene Understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benen- son, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes Dataset for Semantic Urban Scene Understanding. In:Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016

2016

[49] [49]

Indoor Segmentation and Support Inference from RGBD Images

Pushmeet Kohli Nathan Silberman Derek Hoiem and Rob Fergus. Indoor Segmentation and Support Inference from RGBD Images. In:ECCV. 2012

2012

[50] [50]

ImageNet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierar- chical image database. In:2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009, pp. 248–255.DOI:10.1109/CVPR.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009

[51] [51]

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common Objects in Context. In:CoRRabs/1405.0312 (2014). arXiv:1405.0312

work page internal anchor Pith review Pith/arXiv arXiv 2014

[52] [52]

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Girshick. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In:CoRRabs/1612.06890 (2016). arXiv:1612.06890. 13 A Adaptive program-length curriculum details This appendix gives the full parameterization of the adapti...

work page internal anchor Pith review Pith/arXiv arXiv 2016