arxiv: 2604.20329 · v2 · submitted 2026-04-22 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Image Generators are Generalist Vision Learners

Valentin Gabeur , Shangbang Long , Songyou Peng , Paul Voigtlaender , Shuyang Sun , Yanan Bao , Karen Truong , Zhicheng Wang

show 17 more authors

Wenlei Zhou Jonathan T. Barron Kyle Genova Nithish Kannen Sherry Ben Yandong Li Mandy Guo Suhas Yogin Yiming Gu Huizhong Chen Oliver Wang Saining Xie Howard Zhou Kaiming He Thomas Funkhouser Jean-Baptiste Alayrac Radu Soricut

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords image generationgeneralist vision modelsinstruction tuningsegmentationdepth estimationvision pretrainingunified output interface

0 comments

The pith

Image generation pretraining builds general visual representations that reach SOTA on perception tasks when outputs are cast as RGB images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that training an image generator develops broad visual understanding, in the same way text generation pretrains language models. It takes a base generator, mixes its original training data with limited vision-task examples, and instruction-tunes the model so that every task—segmentation, depth estimation, and others—becomes the generation of an RGB image. The resulting generalist model matches or exceeds specialist systems on both 2D and 3D benchmarks while preserving its original image-generation quality. This outcome indicates that the generative pretraining itself supplies the core visual knowledge rather than task-specific engineering.

Core claim

Image generation training serves a role similar to LLM pretraining by letting models acquire powerful and general visual representations. When an image generator is lightly instruction-tuned on a mixture of its original data and small amounts of vision-task data, and all outputs are parameterized as RGB images, the resulting model achieves state-of-the-art results on a range of 2D and 3D understanding tasks, rivaling or surpassing zero-shot specialists such as SAM3 on segmentation and the Depth Anything series on metric depth. These gains occur without loss of the base model's generation capability.

What carries the argument

Reframing every vision task as RGB image generation inside a single instruction-tuned generator, so that perception and creation share the same output space and training objective.

If this is right

A single model can perform both high-quality image generation and multiple perception tasks without separate specialist networks.
Vision tasks become interchangeable under one image-generation interface, analogous to how text generation unifies language tasks.
Lightweight tuning on top of generative pretraining is sufficient to reach competitive or superior results on standard benchmarks.
Generalist vision models built this way can maintain generation ability while acquiring new understanding skills.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If generative pretraining is the dominant source of capability, then larger-scale image-generation datasets should produce even stronger understanding without additional task data.
The same unified interface could extend naturally to video or multi-view generation, enabling consistent 3D scene reasoning in one model.
Models trained this way might reduce the need for task-specific architectures and evaluation protocols across computer vision.

Load-bearing premise

The reported performance gains on vision tasks arise mainly from the generative pretraining rather than from the particular data mixture or the details of the instruction-tuning stage.

What would settle it

Train an identical architecture from scratch on only the vision-task data, without any generative pretraining, and measure whether it reaches the same accuracy on the same benchmarks.

read the original abstract

Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks. We introduce Vision Banana, a generalist model built by instruction-tuning Nano Banana Pro (NBP) on a mixture of its original training data alongside a small amount of vision task data. By parameterizing the output space of vision tasks as RGB images, we seamlessly reframe perception as image generation. Our generalist model, Vision Banana, achieves SOTA results on a variety of vision tasks involving both 2D and 3D understanding, beating or rivaling zero-shot domain-specialists, including Segment Anything Model 3 on segmentation tasks, and the Depth Anything series on metric depth estimation. We show that these results can be achieved with lightweight instruction-tuning without sacrificing the base model's image generation capabilities. The superior results suggest that image generation pretraining is a generalist vision learner. It also shows that image generation serves as a unified and universal interface for vision tasks, similar to text generation's role in language understanding and reasoning. We could be witnessing a major paradigm shift for computer vision, where generative vision pretraining takes a central role in building Foundational Vision Models for both generation and understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Generative pretraining plus RGB task framing gets SOTA on segmentation and depth, but lacks controls to isolate the pretraining effect.

read the letter

The main point is that taking a strong image generator, mixing in a bit of vision task data, and reframing outputs as RGB images produces competitive or better numbers on segmentation and depth while keeping generation quality. They build Vision Banana from Nano Banana Pro this way and report it beating or matching SAM3 and the Depth Anything models on those tasks with only light tuning. That unified interface is the practical part worth noting. The approach shows you can add perception capabilities without a full retrain or separate heads. The results are presented as evidence that generative pretraining builds general visual understanding similar to how LLMs work. The soft spot is exactly the missing isolation. No control runs the same tuning and RGB output format on a non-generative backbone or with the original generative data stripped out, so it's unclear how much the pretraining itself drives the gains versus the data mix or the output trick. The abstract also gives no error bars, ablation tables, or data selection details, which leaves the robustness open. This is for people building multi-task or unified vision systems who want to see one way to collapse generation and understanding into the same model. A reader already working on generative backbones would pick up the concrete setup and numbers. It deserves peer review so the experiments can be checked for the controls and variance that are needed to support the stronger claims.

Referee Report

2 major / 2 minor

Summary. The paper claims that image generation training develops general visual representations analogous to LLM pretraining. It introduces Vision Banana by instruction-tuning Nano Banana Pro (NBP) on a mixture of its original generative data plus limited vision-task data, reframing all outputs as RGB images. This yields SOTA or near-SOTA results on segmentation, depth estimation, and other 2D/3D tasks while preserving generation ability, suggesting generative pretraining as a unified interface for vision understanding.

Significance. If the attribution to generative pretraining holds after proper controls, the work would support a paradigm in which generative vision pretraining becomes foundational for generalist models, unifying generation and perception in a manner parallel to text-based LLMs. This could influence training strategies for foundational vision models.

major comments (2)

[Abstract] Abstract: The abstract asserts SOTA results on segmentation and depth tasks but supplies no baselines, error bars, ablation details, or data-selection protocol, preventing assessment of whether gains are robust or post-hoc.
[Experimental setup] Experimental setup (inferred from abstract and methods description): The central claim attributes performance to generative pretraining on NBP, yet the design applies identical instruction-tuning and RGB-output parameterization to the same backbone; no control uses a non-generatively-pretrained model or removes generative data, leaving the contribution of pretraining unisolated from tuning mixture and output interface effects.

minor comments (2)

[Methods] Clarify the exact proportion of vision-task data versus original generative data and the precise RGB parameterization scheme for each task (e.g., how depth or segmentation masks are encoded as images).
[Results] Add error bars, multiple random seeds, and statistical comparisons against the cited specialists (SAM3, Depth Anything) in all result tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and describe the revisions that will be made to improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts SOTA results on segmentation and depth tasks but supplies no baselines, error bars, ablation details, or data-selection protocol, preventing assessment of whether gains are robust or post-hoc.

Authors: We agree that the abstract should be more self-contained. In the revised version we will expand it to name the primary baselines (SAM 3 for segmentation, Depth Anything series for depth), note that reported numbers include standard deviations across runs, and briefly state the data protocol (original generative data mixed with a small fraction of task-specific examples). Full tables, error bars, and ablation studies already appear in Sections 4 and 5 and will be cross-referenced in the abstract. revision: yes
Referee: [Experimental setup] Experimental setup (inferred from abstract and methods description): The central claim attributes performance to generative pretraining on NBP, yet the design applies identical instruction-tuning and RGB-output parameterization to the same backbone; no control uses a non-generatively-pretrained model or removes generative data, leaving the contribution of pretraining unisolated from tuning mixture and output interface effects.

Authors: We acknowledge that a direct control with a non-generatively-pretrained backbone would more cleanly isolate the pretraining source. Our current evidence rests on the fact that NBP, after generative pretraining, reaches or exceeds specialist SOTA with only lightweight instruction tuning while retaining generation quality; this is contrasted against published zero-shot specialists that lack the unified RGB interface. In the revision we will add an explicit limitations paragraph discussing the missing control and, if compute permits, include a small-scale comparison against a classification-pretrained ViT under the identical tuning recipe. We believe the existing results still support the central hypothesis that generative pretraining supplies the necessary visual representations for the unified interface to succeed. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical claims with no derivations or self-referential reductions

full rationale

The paper reports empirical training of Vision Banana via instruction-tuning on mixed generative and vision-task data, followed by benchmark results on segmentation, depth, etc. No equations, parameter fits presented as predictions, or derivation chains appear in the provided text. Claims rest on observed SOTA performance rather than any self-definitional mapping, fitted-input renaming, or load-bearing self-citation of a uniqueness theorem. The absence of a mathematical derivation chain means there is nothing to reduce to its own inputs by construction. Self-citations (if any) are not invoked to force a central result. This is the standard case of an empirical paper whose validity hinges on experimental controls and data, not on circular logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that generative pretraining encodes transferable visual understanding and that RGB-output parameterization preserves this capability after light tuning. No free parameters or invented physical entities are introduced; model names are engineering artifacts.

axioms (1)

domain assumption Image generation pretraining builds general visual representations usable for perception tasks
Invoked throughout abstract as the reason SOTA results appear after instruction tuning.

invented entities (1)

Vision Banana no independent evidence
purpose: Name for the instruction-tuned generalist model
Engineering label for the resulting system; no independent physical or theoretical existence claimed.

pith-pipeline@v0.9.0 · 5691 in / 1299 out tokens · 43558 ms · 2026-05-15T07:37:38.564195+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PathPainter: Transferring the Generalization Ability of Image Generation Models to Embodied Navigation
cs.RO 2026-05 unverdicted novelty 6.0

PathPainter transfers image generation models to embodied navigation by generating traversability masks from BEV images and language instructions while using cross-view localization to reduce odometry drift.
Leveraging Image Generators to Address Training Data Scarcity: The Gen4Regen Dataset for Forest Regeneration Mapping
cs.CV 2026-05 conditional novelty 6.0

Mixing real UAV imagery with 2101 AI-generated image-mask pairs improves semantic segmentation F1 scores for fine-grained forest species by over 15 percentage points overall and up to 30 points for rare classes.
Open-Source Image Editing Models Are Zero-Shot Vision Learners
cs.CV 2026-05 unverdicted novelty 6.0

Open-source image-editing models show competitive zero-shot performance on monocular depth, surface normals, and semantic segmentation, sometimes matching tuned models.
Diffusion Model as a Generalist Segmentation Learner
cs.CV 2026-04 unverdicted novelty 6.0

DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.
Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence
cs.CV 2026-05 unverdicted novelty 5.0

Integrating generative novel-view synthesis into LMM reasoning loops improves accuracy on spatial subtasks by 1.3 to 3.9 percentage points across multiple models and tasks.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 5 Pith papers · 16 internal anchors

[1]

H. Bao, L. Dong, S. Piao, and F. Wei. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

J. T. Barron. A power transform.arXiv preprint arXiv:2502.10647,

work page arXiv
[3]

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y. Zhou, S. R. Richter, and V. Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901
[5]

Accessed: 2026-03-18. Y. Cabon, N. Murray, and M. Humenberger. Virtual kitti 2.arXiv preprint arXiv:2001.10773,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Cai, C.-F

Z. Cai, C.-F. Yeh, H. Xu, Z. Liu, G. Meyer, X. Lei, C. Zhao, S.-W. Li, V. Chandra, and Y. Shi. Depthlm: Metric depth from vision language models.arXiv preprint arXiv:2509.25413,

work page arXiv
[7]

B. Cao, K. Chen, K.-K. Maninis, K. Chen, A. Karpur, Y. Xia, S. Dua, T. Dabral, G. Han, B. Han, et al. Tipsv2: Advancing vision-language pretraining with enhanced patch-text alignment.arXiv preprint arXiv:2604.12012,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

SAM 3: Segment Anything with Concepts

N. Carion, L. Gustafson, Y.-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V. Alwala, H. Khedr, A. Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever. Generative pretraining from pixels. InInternational conference on machine learning, pages 1691–1703. PMLR, 2020a. 15 Image Generators are Generalist Vision Learners T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations...

work page internal anchor Pith review Pith/arXiv arXiv 2003
[10]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Min- derer, G.Heigold, S.Gelly, etal. Animageisworth16x16words: Transformersforimagerecognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[11]

Introducing nano banana pro

Google. Introducing nano banana pro. https://blog.google/innovation-and-ai/ products/nano-banana-pro/, 2025a. Accessed: 2026-03-15. Google. Veo 3 announcement. https://blog.google/innovation-and-ai/products/ generative-media-models-io-2025/, 2025b. Accessed: 2026-03-15. J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch,...

work page 2026
[12]

J. He, H. Li, W. Yin, Y. Liang, L. Li, K. Zhou, H. Zhang, B. Liu, and Y.-C. Chen. Lotus: Diffusion-based visual foundation model for high-quality dense prediction.arXiv preprint arXiv:2409.18124,

work page arXiv
[13]

J. He, H. Li, M. Sheng, and Y.-C. Chen. Lotus-2: Advancing geometric dense prediction with powerful image generative model.arXiv preprint arXiv:2512.01030,

work page arXiv
[14]

Kazemzadeh, V

S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798,

work page 2014
[15]

B. Li, Z. Lin, D. Pathak, J. Li, Y. Fei, K. Wu, T. Ling, X. Xia, P. Zhang, G. Neubig, et al. Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743,

work page arXiv
[16]

H. Lin, S. Chen, J. Liew, D. Y. Chen, Z. Li, G. Shi, J. Feng, and B. Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Y. Liu, B. Peng, Z. Zhong, Z. Yue, F. Lu, B. Yu, and J. Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520,

work page arXiv
[18]

Accessed: 2026-03-19. M. Minderer, A. Gritsenko, and N. Houlsby. Scaling open-vocabulary object detection.Advances in Neural Information Processing Systems, 36:72983–73007,

work page 2026
[19]

Accessed: 2026-03-19. M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C.-Y. Wu, R. Girshick, P. Dollár, and C. Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

T. Ren, Y. Chen, Q. Jiang, Z. Zeng, Y. Xiong, W. Liu, Z. Ma, J. Shen, Y. Gao, X. Jiang, et al. Dino- x: A unified vision model for open-world object detection and understanding.arXiv preprint arXiv:2411.14347,

work page arXiv
[22]

DINOv3

18 Image Generators are Generalist Vision Learners O. Siméoni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y. Xia, B. Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

URLhttp://arxiv.org/abs/1908.00463. H. Wang, L. Qiao, Z. Jie, Z. Huang, C. Feng, Q. Zheng, L. Ma, X. Lan, and X. Liang. X-sam: From segmentanythingtoanysegmentation. InProceedingsoftheAAAIConferenceonArtificialIntelligence, volume 40, pages 26187–26196, 2026a. J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry gro...

work page arXiv 1908
[26]

J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Video models are zero-shot learners and reasoners

T. Wiedemer, Y. Li, P. Vicol, S. S. Gu, N. Matarese, K. Swersky, B. Kim, P. Jaini, and R. Geirhos. Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Y. Ye, X. He, Z. Li, B. Lin, S. Yuan, Z. Yan, B. Hou, and L. Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Yu, P.-T

Q. Yu, P.-T. Jiang, H. Zhang, J. Chen, B. Li, L. Zhang, and H. Lu. High-precision dichotomous image segmentation via probing diffusion capacity.arXiv preprint arXiv:2410.10105,

work page arXiv
[30]

Zhang, G

19 Image Generators are Generalist Vision Learners C. Zhang, G. L. Moing, S. Koppula, I. Rocco, L. Momeni, J. Xie, S. Sun, R. Sukthankar, J. K. Barral, R. Hadsell, et al. Efficiently reconstructing dynamic scenes one d4rt at a time.arXiv preprint arXiv:2512.08924,

work page arXiv
[31]

C. Zhao, Y. Sun, M. Liu, H. Zheng, M. Zhu, Z. Zhao, H. Chen, T. He, and C. Shen. Diception: A generalist diffusion model for visual perceptual tasks.arXiv preprint arXiv:2502.17157,

work page arXiv
[32]

J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

A lantern casting dim light in a haunted forest

20 Image Generators are Generalist Vision Learners Appendix - Additional Demonstrations A ghostly ship sailing on a fog-shrouded, moonlit sea. A lantern casting dim light in a haunted forest. A yellow taxi waiting outside a modern glass building. A samurai with a silk sash in a cherry blossom garden. Figure 9 | Comparing Vision Banana (left) and Nano Bana...

work page 2024