Recognition: 2 theorem links
· Lean TheoremImage Generators are Generalist Vision Learners
Pith reviewed 2026-05-15 07:37 UTC · model grok-4.3
The pith
Image generation pretraining builds general visual representations that reach SOTA on perception tasks when outputs are cast as RGB images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Image generation training serves a role similar to LLM pretraining by letting models acquire powerful and general visual representations. When an image generator is lightly instruction-tuned on a mixture of its original data and small amounts of vision-task data, and all outputs are parameterized as RGB images, the resulting model achieves state-of-the-art results on a range of 2D and 3D understanding tasks, rivaling or surpassing zero-shot specialists such as SAM3 on segmentation and the Depth Anything series on metric depth. These gains occur without loss of the base model's generation capability.
What carries the argument
Reframing every vision task as RGB image generation inside a single instruction-tuned generator, so that perception and creation share the same output space and training objective.
If this is right
- A single model can perform both high-quality image generation and multiple perception tasks without separate specialist networks.
- Vision tasks become interchangeable under one image-generation interface, analogous to how text generation unifies language tasks.
- Lightweight tuning on top of generative pretraining is sufficient to reach competitive or superior results on standard benchmarks.
- Generalist vision models built this way can maintain generation ability while acquiring new understanding skills.
Where Pith is reading between the lines
- If generative pretraining is the dominant source of capability, then larger-scale image-generation datasets should produce even stronger understanding without additional task data.
- The same unified interface could extend naturally to video or multi-view generation, enabling consistent 3D scene reasoning in one model.
- Models trained this way might reduce the need for task-specific architectures and evaluation protocols across computer vision.
Load-bearing premise
The reported performance gains on vision tasks arise mainly from the generative pretraining rather than from the particular data mixture or the details of the instruction-tuning stage.
What would settle it
Train an identical architecture from scratch on only the vision-task data, without any generative pretraining, and measure whether it reaches the same accuracy on the same benchmarks.
read the original abstract
Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks. We introduce Vision Banana, a generalist model built by instruction-tuning Nano Banana Pro (NBP) on a mixture of its original training data alongside a small amount of vision task data. By parameterizing the output space of vision tasks as RGB images, we seamlessly reframe perception as image generation. Our generalist model, Vision Banana, achieves SOTA results on a variety of vision tasks involving both 2D and 3D understanding, beating or rivaling zero-shot domain-specialists, including Segment Anything Model 3 on segmentation tasks, and the Depth Anything series on metric depth estimation. We show that these results can be achieved with lightweight instruction-tuning without sacrificing the base model's image generation capabilities. The superior results suggest that image generation pretraining is a generalist vision learner. It also shows that image generation serves as a unified and universal interface for vision tasks, similar to text generation's role in language understanding and reasoning. We could be witnessing a major paradigm shift for computer vision, where generative vision pretraining takes a central role in building Foundational Vision Models for both generation and understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that image generation training develops general visual representations analogous to LLM pretraining. It introduces Vision Banana by instruction-tuning Nano Banana Pro (NBP) on a mixture of its original generative data plus limited vision-task data, reframing all outputs as RGB images. This yields SOTA or near-SOTA results on segmentation, depth estimation, and other 2D/3D tasks while preserving generation ability, suggesting generative pretraining as a unified interface for vision understanding.
Significance. If the attribution to generative pretraining holds after proper controls, the work would support a paradigm in which generative vision pretraining becomes foundational for generalist models, unifying generation and perception in a manner parallel to text-based LLMs. This could influence training strategies for foundational vision models.
major comments (2)
- [Abstract] Abstract: The abstract asserts SOTA results on segmentation and depth tasks but supplies no baselines, error bars, ablation details, or data-selection protocol, preventing assessment of whether gains are robust or post-hoc.
- [Experimental setup] Experimental setup (inferred from abstract and methods description): The central claim attributes performance to generative pretraining on NBP, yet the design applies identical instruction-tuning and RGB-output parameterization to the same backbone; no control uses a non-generatively-pretrained model or removes generative data, leaving the contribution of pretraining unisolated from tuning mixture and output interface effects.
minor comments (2)
- [Methods] Clarify the exact proportion of vision-task data versus original generative data and the precise RGB parameterization scheme for each task (e.g., how depth or segmentation masks are encoded as images).
- [Results] Add error bars, multiple random seeds, and statistical comparisons against the cited specialists (SAM3, Depth Anything) in all result tables.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and describe the revisions that will be made to improve the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract asserts SOTA results on segmentation and depth tasks but supplies no baselines, error bars, ablation details, or data-selection protocol, preventing assessment of whether gains are robust or post-hoc.
Authors: We agree that the abstract should be more self-contained. In the revised version we will expand it to name the primary baselines (SAM 3 for segmentation, Depth Anything series for depth), note that reported numbers include standard deviations across runs, and briefly state the data protocol (original generative data mixed with a small fraction of task-specific examples). Full tables, error bars, and ablation studies already appear in Sections 4 and 5 and will be cross-referenced in the abstract. revision: yes
-
Referee: [Experimental setup] Experimental setup (inferred from abstract and methods description): The central claim attributes performance to generative pretraining on NBP, yet the design applies identical instruction-tuning and RGB-output parameterization to the same backbone; no control uses a non-generatively-pretrained model or removes generative data, leaving the contribution of pretraining unisolated from tuning mixture and output interface effects.
Authors: We acknowledge that a direct control with a non-generatively-pretrained backbone would more cleanly isolate the pretraining source. Our current evidence rests on the fact that NBP, after generative pretraining, reaches or exceeds specialist SOTA with only lightweight instruction tuning while retaining generation quality; this is contrasted against published zero-shot specialists that lack the unified RGB interface. In the revision we will add an explicit limitations paragraph discussing the missing control and, if compute permits, include a small-scale comparison against a classification-pretrained ViT under the identical tuning recipe. We believe the existing results still support the central hypothesis that generative pretraining supplies the necessary visual representations for the unified interface to succeed. revision: partial
Circularity Check
No circularity: purely empirical claims with no derivations or self-referential reductions
full rationale
The paper reports empirical training of Vision Banana via instruction-tuning on mixed generative and vision-task data, followed by benchmark results on segmentation, depth, etc. No equations, parameter fits presented as predictions, or derivation chains appear in the provided text. Claims rest on observed SOTA performance rather than any self-definitional mapping, fitted-input renaming, or load-bearing self-citation of a uniqueness theorem. The absence of a mathematical derivation chain means there is nothing to reduce to its own inputs by construction. Self-citations (if any) are not invoked to force a central result. This is the standard case of an empirical paper whose validity hinges on experimental controls and data, not on circular logic.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Image generation pretraining builds general visual representations usable for perception tasks
invented entities (1)
-
Vision Banana
no independent evidence
Forward citations
Cited by 5 Pith papers
-
PathPainter: Transferring the Generalization Ability of Image Generation Models to Embodied Navigation
PathPainter transfers image generation models to embodied navigation by generating traversability masks from BEV images and language instructions while using cross-view localization to reduce odometry drift.
-
Leveraging Image Generators to Address Training Data Scarcity: The Gen4Regen Dataset for Forest Regeneration Mapping
Mixing real UAV imagery with 2101 AI-generated image-mask pairs improves semantic segmentation F1 scores for fine-grained forest species by over 15 percentage points overall and up to 30 points for rare classes.
-
Open-Source Image Editing Models Are Zero-Shot Vision Learners
Open-source image-editing models show competitive zero-shot performance on monocular depth, surface normals, and semantic segmentation, sometimes matching tuned models.
-
Diffusion Model as a Generalist Segmentation Learner
DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.
-
Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence
Integrating generative novel-view synthesis into LMM reasoning loops improves accuracy on spatial subtasks by 1.3 to 3.9 percentage points across multiple models and tasks.
Reference graph
Works this paper leans on
-
[1]
H. Bao, L. Dong, S. Piao, and F. Wei. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254,
work page internal anchor Pith review Pith/arXiv arXiv
- [2]
-
[3]
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y. Zhou, S. R. Richter, and V. Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073,
work page internal anchor Pith review Pith/arXiv arXiv
- [4]
-
[5]
Accessed: 2026-03-18. Y. Cabon, N. Murray, and M. Humenberger. Virtual kitti 2.arXiv preprint arXiv:2001.10773,
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [6]
-
[7]
B. Cao, K. Chen, K.-K. Maninis, K. Chen, A. Karpur, Y. Xia, S. Dua, T. Dabral, G. Han, B. Han, et al. Tipsv2: Advancing vision-language pretraining with enhanced patch-text alignment.arXiv preprint arXiv:2604.12012,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
SAM 3: Segment Anything with Concepts
N. Carion, L. Gustafson, Y.-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V. Alwala, H. Khedr, A. Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever. Generative pretraining from pixels. InInternational conference on machine learning, pages 1691–1703. PMLR, 2020a. 15 Image Generators are Generalist Vision Learners T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations...
work page internal anchor Pith review Pith/arXiv arXiv 2003
-
[10]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Min- derer, G.Heigold, S.Gelly, etal. Animageisworth16x16words: Transformersforimagerecognition at scale.arXiv preprint arXiv:2010.11929,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[11]
Google. Introducing nano banana pro. https://blog.google/innovation-and-ai/ products/nano-banana-pro/, 2025a. Accessed: 2026-03-15. Google. Veo 3 announcement. https://blog.google/innovation-and-ai/products/ generative-media-models-io-2025/, 2025b. Accessed: 2026-03-15. J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch,...
work page 2026
- [12]
- [13]
-
[14]
S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798,
work page 2014
- [15]
-
[16]
H. Lin, S. Chen, J. Liew, D. Y. Chen, Z. Li, G. Shi, J. Feng, and B. Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647,
work page internal anchor Pith review Pith/arXiv arXiv
- [17]
-
[18]
Accessed: 2026-03-19. M. Minderer, A. Gritsenko, and N. Houlsby. Scaling open-vocabulary object detection.Advances in Neural Information Processing Systems, 36:72983–73007,
work page 2026
-
[19]
Accessed: 2026-03-19. M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[20]
N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C.-Y. Wu, R. Girshick, P. Dollár, and C. Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714,
work page internal anchor Pith review Pith/arXiv arXiv
- [21]
-
[22]
18 Image Generators are Generalist Vision Learners O. Siméoni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y. Xia, B. Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
URLhttp://arxiv.org/abs/1908.00463. H. Wang, L. Qiao, Z. Jie, Z. Huang, C. Feng, Q. Zheng, L. Ma, X. Lan, and X. Liang. X-sam: From segmentanythingtoanysegmentation. InProceedingsoftheAAAIConferenceonArtificialIntelligence, volume 40, pages 26187–26196, 2026a. J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry gro...
-
[26]
J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Video models are zero-shot learners and reasoners
T. Wiedemer, Y. Li, P. Vicol, S. S. Gu, N. Matarese, K. Swersky, B. Kim, P. Jaini, and R. Geirhos. Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Y. Ye, X. He, Z. Li, B. Lin, S. Yuan, Z. Yan, B. Hou, and L. Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275,
work page internal anchor Pith review Pith/arXiv arXiv
- [29]
- [30]
- [31]
-
[32]
J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
A lantern casting dim light in a haunted forest
20 Image Generators are Generalist Vision Learners Appendix - Additional Demonstrations A ghostly ship sailing on a fog-shrouded, moonlit sea. A lantern casting dim light in a haunted forest. A yellow taxi waiting outside a modern glass building. A samurai with a silk sash in a cherry blossom garden. Figure 9 | Comparing Vision Banana (left) and Nano Bana...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.