Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation

Aashu Singh; Chao Li; Dina Katabi; Hong-You Chen; Jianpeng Cheng; Jun Xiao; Sai Vidyaranya Nuthalapati; Satya Narayan Shukla; Shlok Kumar Mishra; Tianhong Li

arxiv: 2603.02667 · v2 · pith:RFI5KKZZnew · submitted 2026-03-03 · 💻 cs.CV · cs.LG

Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation

Chao Li , Tianhong Li , Sai Vidyaranya Nuthalapati , Hong-You Chen , Satya Narayan Shukla , Jianpeng Cheng , Yonghuan Yang , Jun Xiao

show 4 more authors

Xiangjun Fan Aashu Singh Dina Katabi Shlok Kumar Mishra

This is my paper

Pith reviewed 2026-05-21 12:11 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords unified contrastive generative learningmasking warmuptext-to-image generationvisual representation learningsemantically aligned decodingjoint optimization

0 comments

The pith

A shared encoder can optimize both contrastive alignment and masked generation by gradually shifting the masking distribution during training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that text-image contrastive learning and text-to-image generation can be trained together in one end-to-end model despite their opposing needs for visible image tokens. A schedule called Masking Warmup moves the typical masking ratio from low to high across training steps so that both light and heavy corruption are always present. This shared exposure produces one encoder that supports strong visual understanding and also enables the text encoder to score and steer partially generated images at inference. The resulting model improves on separate contrastive and generative baselines across recognition, segmentation, depth, and generation metrics.

Core claim

The authors show that contrastive and generative objectives become synergistic rather than competing when a single visual encoder is trained under Masking Warmup, a schedule that shifts the center of the masking distribution so low and high masking ratios coexist at every step. The jointly trained encoder then supports Semantically Aligned Decoding, allowing the text encoder to select the best generation trajectory after decoding as little as 12.5 percent of the image.

What carries the argument

Masking Warmup, a training schedule that gradually shifts the center of the masking distribution so that low and high masking ratios coexist at every step and stabilize joint optimization of a shared encoder.

If this is right

The unified encoder improves ImageNet linear probing by 1.1 percent and 5-shot transfer by 4.1 percent over a pure contrastive baseline.
It raises segmentation performance on ADE20K by 1.9 percent and depth estimation on NYU by 6.25 percent over the same baseline.
On text-to-image generation it reduces FID on CC12M by 6.2 percent relative to a pure generative baseline while preserving CLIP score.
Semantically Aligned Decoding improves both output quality and generation speed by choosing among trajectories using only 12.5 percent of the image.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gradual-masking idea could be tested on other paired data such as video or audio where alignment and generation objectives pull in opposite directions on visibility.
If the warmup schedule generalizes, it offers a practical way to add multiple objectives to large vision-language models without separate training runs.
Early selection of generation trajectories may cut compute in interactive settings where only a few candidate images need to be completed.

Load-bearing premise

The assumption that gradually shifting the center of the masking distribution over training allows low and high masking ratios to coexist at every step without destabilizing the joint optimization of the shared encoder.

What would settle it

Remove the gradual shift in masking center and train the model with a fixed masking distribution; the joint training should then fail to improve both contrastive and generative performance simultaneously or become unstable.

read the original abstract

Unifying text-image contrastive learning and text-to-image (T2I) generation in a single end-to-end model is challenging because the two objectives demand opposing masking regimes: contrastive alignment needs near-complete visible tokens, while masked generative modeling needs heavy corruption. We introduce DREAM, a unified framework that resolves this conflict through Masking Warmup, a schedule that shifts the center of the masking distribution over training, so low and high masking ratios coexist at every step. This co-exposure lets a single jointly-trained encoder serve both objectives. The resulting stable optimization unlocks Semantically Aligned Decoding at inference: the text encoder, trained against visual embeddings at all masking ratios, can score partially generated images and select the best trajectory with as little as 12.5% of the image decoded, improving both FID and throughput. DREAM outperforms its single-objective baselines, CLIP and FLUID: on ImageNet linear-probing (+1.1%), 5-shot transfer (+4.1%), ADE20K segmentation (+1.9%), and NYU depth estimation (+6.25%) over CLIP, and on CC12M FID (+6.2%) over FLUID while maintaining CLIP Score. Together, these gains show that text-image contrastive and generative objectives, when properly unified, are synergistic rather than competing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper unifies contrastive and generative training via a shifting masking schedule and shows modest gains, but without ablations on that schedule it's unclear whether the unification itself drives the synergy.

read the letter

The main thing here is that DREAM trains one encoder on both text-image contrastive alignment and masked generative modeling by using Masking Warmup to shift the masking distribution center over time. This keeps low-mask and high-mask examples in the mix at every step instead of forcing a single fixed ratio. They also use the resulting text encoder at inference to score and pick among partially decoded images, which they call Semantically Aligned Decoding and claim speeds things up with only 12.5 percent of the image generated.

Referee Report

3 major / 2 minor

Summary. The paper introduces DREAM, a unified end-to-end model for text-image contrastive learning and text-to-image generation. It addresses the conflicting masking requirements of the two objectives by proposing Masking Warmup, a training schedule that shifts the center of the masking distribution so that low- and high-masking regimes coexist at every step. This enables stable joint optimization of a shared encoder. At inference, the framework uses Semantically Aligned Decoding, in which the text encoder scores partially generated images to select better trajectories. Empirical results claim improvements over CLIP on ImageNet linear probing (+1.1%), 5-shot transfer (+4.1%), ADE20K segmentation (+1.9%), and NYU depth (+6.25%), and over FLUID on CC12M FID (+6.2%) while preserving CLIP Score.

Significance. If the central claims are substantiated, the work would demonstrate that contrastive and generative objectives can be synergistic rather than competing when unified through a carefully designed masking schedule. The Masking Warmup mechanism and Semantically Aligned Decoding procedure would constitute concrete, reusable contributions to multimodal representation learning and efficient generation. The reported gains across both understanding and generation benchmarks suggest practical value for reducing the need for separate specialist models.

major comments (3)

[Abstract and Experiments] The central claim that unification produces synergy (rather than the schedule being the dominant factor) is load-bearing yet unsupported by ablation. No comparison is described to a fixed-center masking distribution, a non-warmup joint baseline, or separately trained models; the reported gains (+1.1 % linear probing, +6.2 % FID) could therefore be attributable to Masking Warmup alone rather than to the unified objective.
[Experiments] The experimental description supplies no details on how the masking-distribution-center schedule parameters were selected, no error bars, and no ablation studies. This absence prevents verification of whether the joint optimization remains stable without the warmup and whether the claimed improvements are statistically reliable.
[Inference / Semantically Aligned Decoding] The description of Semantically Aligned Decoding states that the text encoder can score images decoded to as little as 12.5 % and select the best trajectory, but no quantitative breakdown is given of how much of the FID improvement is due to this inference procedure versus the training unification itself.

minor comments (2)

[Abstract / Method] The abstract and method sections introduce several new terms (Masking Warmup, Semantically Aligned Decoding) without a concise summary table or diagram that would help readers track the relationship between schedule, encoder, and decoding procedure.
[Method] Notation for the masking distribution and its center schedule is introduced informally; an explicit equation or pseudocode would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the empirical support for our claims without altering the core contributions.

read point-by-point responses

Referee: [Abstract and Experiments] The central claim that unification produces synergy (rather than the schedule being the dominant factor) is load-bearing yet unsupported by ablation. No comparison is described to a fixed-center masking distribution, a non-warmup joint baseline, or separately trained models; the reported gains (+1.1 % linear probing, +6.2 % FID) could therefore be attributable to Masking Warmup alone rather than to the unified objective.

Authors: We agree that isolating the contribution of unification versus the masking schedule alone would strengthen the paper. Our current results compare the full DREAM model to separately trained single-objective baselines (CLIP and FLUID), which already demonstrate gains from joint training. However, to directly address the concern, we will add ablations in the revision: (1) a joint-training baseline with fixed-center masking distribution, and (2) a non-warmup joint optimization run. These will clarify whether the observed improvements stem from the unified objective enabled by the schedule. revision: yes
Referee: [Experiments] The experimental description supplies no details on how the masking-distribution-center schedule parameters were selected, no error bars, and no ablation studies. This absence prevents verification of whether the joint optimization remains stable without the warmup and whether the claimed improvements are statistically reliable.

Authors: We acknowledge the lack of these details in the current manuscript. In the revised version, we will specify the exact schedule parameters (e.g., the linear or exponential shift of the masking center over training epochs), report standard deviations from multiple random seeds as error bars on all metrics, and include an ablation on joint optimization stability when using a fixed masking distribution without warmup. This will allow readers to assess reliability and stability. revision: yes
Referee: [Inference / Semantically Aligned Decoding] The description of Semantically Aligned Decoding states that the text encoder can score images decoded to as little as 12.5 % and select the best trajectory, but no quantitative breakdown is given of how much of the FID improvement is due to this inference procedure versus the training unification itself.

Authors: We agree that a quantitative decomposition would be valuable. We will add experiments in the revision that apply Semantically Aligned Decoding to the same trained model with and without the inference-time scoring step, reporting the isolated FID delta. This will separate the contribution of the decoding procedure from the gains due to unified training. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical unification with independent baselines

full rationale

The paper introduces Masking Warmup as a training schedule to reconcile opposing masking needs for contrastive alignment and generative reconstruction in a shared encoder. Reported gains (+1.1% linear probing, +6.2% FID) are framed as direct comparisons against external single-objective baselines (CLIP, FLUID) on standard benchmarks. No equations, self-definitional reductions, fitted-input predictions, or load-bearing self-citations appear in the provided text that would make the synergy claim tautological. The method and results remain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

Review performed on abstract only; full details on parameters and assumptions unavailable. The central claim rests on the stated domain assumptions about masking requirements and the unverified claim that the warmup schedule produces stable joint training.

free parameters (1)

masking distribution center schedule
The rate and shape of the shift in masking ratio center is introduced to resolve the conflict but its exact parameterization is not specified.

axioms (2)

domain assumption Contrastive alignment needs near-complete visible tokens
Explicitly stated as the reason opposing masking regimes are required.
domain assumption Masked generative modeling needs heavy corruption
Explicitly stated as the reason opposing masking regimes are required.

invented entities (2)

Masking Warmup schedule no independent evidence
purpose: Gradually shifts masking distribution center so low and high ratios coexist
New training procedure introduced to resolve the stated conflict.
Semantically Aligned Decoding no independent evidence
purpose: Uses text encoder to score and select best partial generation trajectory
New inference procedure enabled by the joint training.

pith-pipeline@v0.9.0 · 5817 in / 1452 out tokens · 44578 ms · 2026-05-21T12:11:05.330592+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Masking Warmup, a progressive masking schedule, begins with minimal masking ... mean of the distribution increases linearly from 0 to 1.0 over the first 36 epochs. After that point, the mean is fixed at 1.0

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 12 internal anchors

[1]

BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin

https://papers.neurips.cc/paper_files/paper/2020/file/70feb62b69f16e0238f741fab228fec2-Paper.pdf. Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660,

work page 2020
[3]

Muse: Text-to- image generation via masked generative transformers.arXiv preprint arXiv:2301.00704, 2023

Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers.arXiv preprint arXiv:2301.00704,

work page arXiv
[4]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InProceedings of the 37th International Conference on Machine Learning (ICML), volume 119 ofPMLR, pages 1597–1607, 2020.https://proceedings.mlr.press/v119/chen20j/chen20j.pdf. Ekin D Cubuk, Barret Zoph, Jonathon Shlens,...

work page 2020
[5]

The American Statistician 36(3a):153–157 Charoenphakdee N, Cui Z, Zhang Y, et al (2021) Classification with rejection based on cost- sensitive classification

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009a. doi: 10.1109/CVPR.2009.5206848. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical imag...

work page doi:10.1109/cvpr.2009.5206848 2009
[6]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[7]

Fluid: Scaling autoregressive text-to-image generative models with continuous tokens

Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, and Yonglong Tian. Fluid: Scaling autoregressive text-to-image generative models with continuous tokens.arXiv preprint arXiv:2410.13863,

work page arXiv
[8]

Hidden in plain sight: Vlms overlook their visual representations.arXiv preprint arXiv:2506.08008, 2025

Stephanie Fu, Tyler Bonnen, Devin Guillory, and Trevor Darrell. Hidden in plain sight: Vlms overlook their visual representations, 2025.https://arxiv.org/abs/2506.08008. Xinyang Geng, Hao Liu, Lisa Lee, Dale Schuurmans, Sergey Levine, and Pieter Abbeel. Multimodal masked autoencoders learn transferable representations.arXiv preprint arXiv:2205.14204,

work page arXiv 2025
[9]

Denoising Diffusion Probabilistic Models

11 Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and MichalValko. Bootstrapyourownlatent: Anewapproachtoself-supervisedlearning. InAdvancesinNeuralInformationProcessing S...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[10]

Onimprovedconditioningmechanismsandpre-trainingstrategies for diffusion models.arXiv preprint arXiv:2411.03177,

Tariq Berrada Ifriqi, Pietro Astolfi, Melissa Hall, Reyhane Askari-Hemmat, Yohann Benchetrit, Marton Havasi, Matthew Muckley, KarteekAlahari,AdrianaRomero-Soriano,JakobVerbeek,etal. Onimprovedconditioningmechanismsandpre-trainingstrategies for diffusion models.arXiv preprint arXiv:2411.03177,

work page arXiv
[11]

Efficiency without compromise: Clip-aided text-to-image gans with increased diversity.arXiv preprint arXiv:2506.01493,

Yuya Kobayashi, Yuhta Takida, Takashi Shibuya, and Yuki Mitsufuji. Efficiency without compromise: Clip-aided text-to-image gans with increased diversity.arXiv preprint arXiv:2506.01493,

work page arXiv
[12]

Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing.arXiv preprint arXiv:1808.06226,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab et al. Dinov2: Learning robust visual features without supervision.arXiv:2304.07193, 2023.https://arxiv.org/ abs/2304.07193. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual mode...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

DINOv3

Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Learning robust global representations by penalizing local predictive power

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. InAdvances in Neural Information Processing Systems, pages 10506–10518, 2019a. YanWang,Wei-LunChao,KilianQWeinberger,andLaurensVanDerMaaten. Simpleshot: Revisitingnearest-neighborclassification for few-shot learning.arXiv ...

work page arXiv 1911
[19]

Large Batch Training of Convolutional Networks

Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks.arXiv preprint arXiv:1708.03888,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

mixup: Beyond Empirical Risk Minimization

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization.arXiv preprint arXiv:1710.09412,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

6, we plot Linear Probing accuracy and FID (on CC12M) across training for three masking standard deviations (𝜎∈0.35,0.45,0.55 )

A.2 Quantitative Results A.2.1 Stability of Masking Warm-up In Fig. 6, we plot Linear Probing accuracy and FID (on CC12M) across training for three masking standard deviations (𝜎∈0.35,0.45,0.55 ). For𝜎∈0.45,0.55 , both metrics improve monotonically throughout training. In contrast, with 𝜎=0.35 , Linear Probing begins to degrade once masking warm-up ends. ...

work page 2025
[23]

C Implementation Details In the following section, we provide detailed descriptions of our training setup and include pseudocode for DREAM, REPA, and Semantically Aligned Decoding

These gains indicate that Semantically Aligned Decoding effectively exploits the encoder’s semantically rich visual representations—accessed via the text encoder—highlighting the synergy between representation learning and generation in DREAM. C Implementation Details In the following section, we provide detailed descriptions of our training setup and inc...

work page 2024
[24]

We prepend 64 buffer tokens to the unmasked sequence to ensure stability

During training, tokens are masked and dropped before being fed into the encoder. We prepend 64 buffer tokens to the unmasked sequence to ensure stability. More specifically, weusestandardViTarchitectureDosovitskiy(2020),whichconsistsofastackofTransformerblocksDosovitskiy(2020), where each block consists of a multi-head self-attention block and an MLP blo...

work page 2020

[1] [1]

BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin

https://papers.neurips.cc/paper_files/paper/2020/file/70feb62b69f16e0238f741fab228fec2-Paper.pdf. Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660,

work page 2020

[3] [3]

Muse: Text-to- image generation via masked generative transformers.arXiv preprint arXiv:2301.00704, 2023

Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers.arXiv preprint arXiv:2301.00704,

work page arXiv

[4] [4]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InProceedings of the 37th International Conference on Machine Learning (ICML), volume 119 ofPMLR, pages 1597–1607, 2020.https://proceedings.mlr.press/v119/chen20j/chen20j.pdf. Ekin D Cubuk, Barret Zoph, Jonathon Shlens,...

work page 2020

[5] [5]

The American Statistician 36(3a):153–157 Charoenphakdee N, Cui Z, Zhang Y, et al (2021) Classification with rejection based on cost- sensitive classification

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009a. doi: 10.1109/CVPR.2009.5206848. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical imag...

work page doi:10.1109/cvpr.2009.5206848 2009

[6] [6]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[7] [7]

Fluid: Scaling autoregressive text-to-image generative models with continuous tokens

Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, and Yonglong Tian. Fluid: Scaling autoregressive text-to-image generative models with continuous tokens.arXiv preprint arXiv:2410.13863,

work page arXiv

[8] [8]

Hidden in plain sight: Vlms overlook their visual representations.arXiv preprint arXiv:2506.08008, 2025

Stephanie Fu, Tyler Bonnen, Devin Guillory, and Trevor Darrell. Hidden in plain sight: Vlms overlook their visual representations, 2025.https://arxiv.org/abs/2506.08008. Xinyang Geng, Hao Liu, Lisa Lee, Dale Schuurmans, Sergey Levine, and Pieter Abbeel. Multimodal masked autoencoders learn transferable representations.arXiv preprint arXiv:2205.14204,

work page arXiv 2025

[9] [9]

Denoising Diffusion Probabilistic Models

11 Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and MichalValko. Bootstrapyourownlatent: Anewapproachtoself-supervisedlearning. InAdvancesinNeuralInformationProcessing S...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[10] [10]

Onimprovedconditioningmechanismsandpre-trainingstrategies for diffusion models.arXiv preprint arXiv:2411.03177,

Tariq Berrada Ifriqi, Pietro Astolfi, Melissa Hall, Reyhane Askari-Hemmat, Yohann Benchetrit, Marton Havasi, Matthew Muckley, KarteekAlahari,AdrianaRomero-Soriano,JakobVerbeek,etal. Onimprovedconditioningmechanismsandpre-trainingstrategies for diffusion models.arXiv preprint arXiv:2411.03177,

work page arXiv

[11] [11]

Efficiency without compromise: Clip-aided text-to-image gans with increased diversity.arXiv preprint arXiv:2506.01493,

Yuya Kobayashi, Yuhta Takida, Takashi Shibuya, and Yuki Mitsufuji. Efficiency without compromise: Clip-aided text-to-image gans with increased diversity.arXiv preprint arXiv:2506.01493,

work page arXiv

[12] [12]

Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing.arXiv preprint arXiv:1808.06226,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab et al. Dinov2: Learning robust visual features without supervision.arXiv:2304.07193, 2023.https://arxiv.org/ abs/2304.07193. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual mode...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

DINOv3

Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Learning robust global representations by penalizing local predictive power

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. InAdvances in Neural Information Processing Systems, pages 10506–10518, 2019a. YanWang,Wei-LunChao,KilianQWeinberger,andLaurensVanDerMaaten. Simpleshot: Revisitingnearest-neighborclassification for few-shot learning.arXiv ...

work page arXiv 1911

[19] [19]

Large Batch Training of Convolutional Networks

Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks.arXiv preprint arXiv:1708.03888,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

mixup: Beyond Empirical Risk Minimization

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization.arXiv preprint arXiv:1710.09412,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

6, we plot Linear Probing accuracy and FID (on CC12M) across training for three masking standard deviations (𝜎∈0.35,0.45,0.55 )

A.2 Quantitative Results A.2.1 Stability of Masking Warm-up In Fig. 6, we plot Linear Probing accuracy and FID (on CC12M) across training for three masking standard deviations (𝜎∈0.35,0.45,0.55 ). For𝜎∈0.45,0.55 , both metrics improve monotonically throughout training. In contrast, with 𝜎=0.35 , Linear Probing begins to degrade once masking warm-up ends. ...

work page 2025

[23] [23]

C Implementation Details In the following section, we provide detailed descriptions of our training setup and include pseudocode for DREAM, REPA, and Semantically Aligned Decoding

These gains indicate that Semantically Aligned Decoding effectively exploits the encoder’s semantically rich visual representations—accessed via the text encoder—highlighting the synergy between representation learning and generation in DREAM. C Implementation Details In the following section, we provide detailed descriptions of our training setup and inc...

work page 2024

[24] [24]

We prepend 64 buffer tokens to the unmasked sequence to ensure stability

During training, tokens are masked and dropped before being fed into the encoder. We prepend 64 buffer tokens to the unmasked sequence to ensure stability. More specifically, weusestandardViTarchitectureDosovitskiy(2020),whichconsistsofastackofTransformerblocksDosovitskiy(2020), where each block consists of a multi-head self-attention block and an MLP blo...

work page 2020