GPIC: A Giant Permissive Image Corpus for Visual Generation
Pith reviewed 2026-06-29 07:34 UTC · model grok-4.3
The pith
GPIC introduces a 28-trillion-pixel image corpus with permissive licenses for visual generative modeling.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GPIC is a Giant Permissive Image Corpus of approximately 28 trillion pixels comprising diverse internet images captioned by a state-of-the-art vision-language model, including 100M training, 200K validation, and 1M test examples, with all images permissively licensed for both research and commercial use, safety-filtered, deduplicated, and centrally hosted.
What carries the argument
The Giant Permissive Image Corpus (GPIC), a large-scale dataset of captioned images with permissive licensing that carries the argument for accessible training data.
If this is right
- Enables training of visual generative models without licensing barriers for both research and commercial applications.
- Supplies a standardized benchmarking protocol to compare generative modeling approaches on this corpus.
- Includes a reference baseline using pixel-space flow matching for direct performance comparisons.
- Provides a centrally hosted, deduplicated, and safety-filtered resource to reduce setup costs for large-scale experiments.
Where Pith is reading between the lines
- Access to such a permissively licensed corpus could accelerate experimentation by removing data acquisition hurdles common in visual generation research.
- The scale of 28 trillion pixels may support training regimes that reveal scaling behaviors not visible in smaller datasets.
- Central hosting on a public platform could encourage community contributions of improved models or evaluations on the same data.
Load-bearing premise
Captions produced by the state-of-the-art vision-language model are accurate and detailed enough to support effective training of generative models.
What would settle it
Training multiple generative models on GPIC and measuring whether their output quality and diversity fall substantially below equivalent models trained on human-captioned datasets of similar scale.
Figures
read the original abstract
Studying scalable methods for visual generative modeling requires large, accessible, and stable datasets. We introduce GPIC, a Giant Permissive Image Corpus of approximately 28 trillion pixels. GPIC comprises diverse internet images captioned by a state-of-the-art vision-language model, including 100M training, 200K validation, and 1M test examples. Moreover, all GPIC images are permissively licensed for both research and commercial use. GPIC is safety-filtered, deduplicated, and centrally hosted on Hugging Face. We provide a benchmarking protocol for generative modeling on GPIC. Finally, we provide a reference baseline for pixel-space flow matching on GPIC. Our dataset, benchmark, and models are available at https://huggingface.co/datasets/stanford-vision-lab/gpic. Evaluation toolkit and code are available at https://gpic.stanford.edu
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GPIC, a Giant Permissive Image Corpus comprising ~100M training images (plus 200K validation and 1M test examples) sourced from the internet and captioned by a state-of-the-art vision-language model, totaling approximately 28 trillion pixels. All images are asserted to be permissively licensed for research and commercial use; the corpus is safety-filtered, deduplicated, and centrally hosted on Hugging Face. A benchmarking protocol for generative modeling is provided along with a reference baseline using pixel-space flow matching. The dataset, benchmark, and models are released publicly.
Significance. If the licensing, filtering, and caption-quality claims are substantiated, GPIC would offer a valuable large-scale, openly accessible resource for scalable visual generative modeling research, addressing limitations of existing restricted datasets. The explicit public release of the full dataset, evaluation toolkit, and code on Hugging Face and the project site is a clear strength that supports reproducibility and community use.
major comments (2)
- [Abstract] Abstract: The central utility claim—that GPIC enables effective training of generative models—rests on the unverified assumption that captions from the state-of-the-art VLM are sufficiently accurate and detailed; no human evaluation, image-caption fidelity metrics (e.g., CLIPScore or human preference studies), or downstream ablation comparing VLM captions to human captions is supplied.
- [Abstract] Abstract and dataset description: No quantitative details or verification steps are given for licensing checks across the full 100M images, the effectiveness of the safety filter, or the deduplication procedure; these omissions are load-bearing for the permissiveness and safety assertions that distinguish GPIC.
minor comments (1)
- [Abstract] The abstract states 'approximately 28 trillion pixels' without providing a per-split breakdown or total pixel count verification method.
Simulated Author's Rebuttal
We thank the referee for the detailed feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central utility claim—that GPIC enables effective training of generative models—rests on the unverified assumption that captions from the state-of-the-art VLM are sufficiently accurate and detailed; no human evaluation, image-caption fidelity metrics (e.g., CLIPScore or human preference studies), or downstream ablation comparing VLM captions to human captions is supplied.
Authors: We agree that the manuscript would be strengthened by additional evidence on caption quality. The current version relies on a state-of-the-art VLM without including CLIPScore, human studies, or ablations against human captions. In revision we will add CLIPScore computed on a representative subset of images, qualitative caption examples, and an explicit discussion of this limitation. A full-scale human preference study or exhaustive ablation is not feasible at this corpus size, but the provided baseline training results demonstrate practical utility of the captions as-is. revision: partial
-
Referee: [Abstract] Abstract and dataset description: No quantitative details or verification steps are given for licensing checks across the full 100M images, the effectiveness of the safety filter, or the deduplication procedure; these omissions are load-bearing for the permissiveness and safety assertions that distinguish GPIC.
Authors: We acknowledge that the abstract and high-level dataset description omit quantitative verification statistics. The full manuscript describes the overall pipeline, but we will expand the relevant section with concrete numbers: the fraction of images removed by the safety filter, the deduplication rate and method (e.g., perceptual hash or embedding similarity threshold), and the licensing verification approach (source-level permissive license filtering with automated metadata checks). These additions will directly support the permissiveness and safety claims. revision: yes
Circularity Check
No circularity: direct dataset release with no derivations or predictions
full rationale
The paper is a dataset introduction paper that releases GPIC (100M captioned images, splits, safety filtering, and a hosted baseline). It contains no equations, no claimed predictions, no fitted parameters, and no derivation chain that could reduce to self-referential inputs. The central claim is the existence and permissiveness of the corpus itself, which is externally verifiable by download and inspection rather than by any internal reduction. Self-citations, if present, are not load-bearing for any result. This is the standard non-circular outcome for a data release.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
High- resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
2022
-
[3]
Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 1
2022
-
[4]
Qwen-Image-VAE-2.0 Technical Report
Zekai Zhang, Deqing Li, Kuan Cao, Yujia Wu, Chenfei Wu, Yu Wu, Liang Peng, Hao Meng, Jia- hao Li, Jie Zhang, et al. Qwen-image-vae-2.0 technical report.arXiv preprint arXiv:2605.13565,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators. 1
2024
-
[6]
Nano banana 2: Google’s latest ai image generation model
Google. Nano banana 2: Google’s latest ai image generation model. https://blog.google/ innovation-and-ai/technology/ai/nano-banana-2/ , February 2026. Accessed: 2026- 05-24. 1
2026
-
[7]
Large Scale GAN Training for High Fidelity Natural Image Synthesis
Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis.CoRR, abs/1809.11096, 2018. URL http://arxiv.org/abs/1809. 11096. 1
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[8]
Neural Discrete Representation Learning
Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning.CoRR, abs/1711.00937, 2017. URLhttp://arxiv.org/abs/1711.00937. 1
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[9]
Taming transformers for high-resolution image synthesis, 2021
Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis, 2021. URLhttps://arxiv.org/abs/2012.09841. 1
-
[10]
Scalable Diffusion Models with Transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URL https://arxiv.org/abs/2212.09748. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Diffusion transformers with representation autoencoders, 2025
Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders, 2025. 1
2025
-
[12]
Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion, 2025
Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion, 2025. URLhttps://arxiv.org/abs/2410.19324. 8
-
[13]
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimiza- tion dilemma in latent diffusion models, 2025. URLhttps://arxiv.org/abs/2501.01423
-
[14]
Visual autoregressive modeling: Scalable image generation via next-scale prediction, 2024
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction, 2024. URL https://arxiv. org/abs/2404.02905. 1
-
[15]
Datacomp: In search of the next generation of multimodal datasets
Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexan...
-
[16]
Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li
Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: the new data in multimedia research.Communications of the ACM, 59(2):64–73, January 2016. ISSN 1557-7317. doi: 10.1145/2812802. URL http://dx.doi.org/10.1145/2812802. 3, 10
-
[17]
Laion-5b: an open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: an open large-scale dataset for training next generation image-text models....
2022
-
[18]
img2dataset: Easily turn large sets of image urls to an image dataset
Romain Beaumont. img2dataset: Easily turn large sets of image urls to an image dataset. https://github.com/rom1504/img2dataset, 2021. 3
2021
-
[19]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.21631 2025
-
[20]
Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models
George Stein, Jesse Cresswell, Rasa Hosseinzadeh, Yi Sui, Brendan Ross, Valentin Villecroze, Zhaoyan Liu, Anthony L Caterini, Eric Taylor, and Gabriel Loaiza-Ganem. Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. InAdvances in Neural Information Processing Systems, volume 36, 2023. 3, 7, 16, 17
2023
-
[21]
Imagenet: A large- scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. IEEE, 2009. 3
2009
-
[22]
Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.International Journal of Computer Vision, 128...
-
[23]
A self- supervised descriptor for image copy detection
Ed Pizzi, Sreya Dutta Roy, Sugosh Nagavara Ravindra, Priya Goyal, and Matthijs Douze. A self- supervised descriptor for image copy detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14532–14542, 2022. 4
2022
-
[24]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. 6
2023
-
[25]
Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37: 62557–62583, 2024
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37: 62557–62583, 2024. 6
2024
-
[26]
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018. URL https://arxiv.org/abs/1706.08500. 7 12
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[27]
Going Deeper with Convolutions
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015. URLhttps://arxiv.org/abs/1409.4842. 7
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[28]
Improved precision and recall metric for assessing generative models
Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. InNeural Information Processing Systems, 2019. URLhttps://api.semanticscholar.org/CorpusID:118648975. 7
2019
-
[29]
Mehdi S. M. Sajjadi, Olivier Bachem, Mario Lu ˇci´c, Olivier Bousquet, and Sylvain Gelly. Assessing generative models via precision and recall.arXiv, abs/1806.00035, 2018. URL https://api.semanticscholar.org/CorpusID:44104089
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[30]
Reli- able fidelity and diversity metrics for generative models
Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reli- able fidelity and diversity metrics for generative models. InInternational Conference on Machine Learning, 2020. URLhttps://api.semanticscholar.org/CorpusID:211259260. 7
2020
-
[31]
Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Sigmoid Loss for Language Image Pre-Training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training, 2023. URLhttps://arxiv.org/abs/2303.15343. 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Back to Basics: Let Denoising Generative Models Denoise
Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025. 9
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
PixelGen: Improving Pixel Diffusion with Perceptual Supervision
Zehong Ma, Ruihan Xu, and Shiliang Zhang. Pixelgen: Pixel diffusion beats latent diffusion with perceptual loss.arXiv preprint arXiv:2602.02493, 2026. 9
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[35]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report, 2025. URL https: //arxiv.org/abs/2505.09388. 9
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Pillow (pil fork) documentation, 2015
Alex Clark. Pillow (pil fork) documentation, 2015. URL https://buildmedia. readthedocs.org/media/pdf/pillow/latest/pillow.pdf. 16
2015
-
[37]
KEEP BACK
Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in GAN evaluation. InProceedings of the Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022. 16 13 Appendix Contents A Additional Image-Text examples from GPIC 14 B Evaluation 16 B.1 Construction of Imagenet-256 and GPIC-256 . . . . . . . . . ...
2022
-
[38]
Center crop along the longer edge to form a square image
-
[39]
Olympic Games
Bicubic downsampling to 256×256 from the Pillow library [36]. We note that popular Python image libraries use different bicubic interpolation kernels. Our choice of Pillow is consistent with prior work [20, 37]. B.2 Additional Oracle Reference Metrics GPIC Subset FD↓Precision↑Recall↑Density↑Coverage↑ Full 0.07 0.757 0.762 0.973 0.966 Lite 0.07 0.762 0.768...
2012
-
[40]
MAIN SUBJECTS: - Include the 1–3 most important people, animals, objects, or landmarks
-
[41]
red car",
ATTRIBUTES: - Include visible attributes (color, texture, condition) (example: "red car", "snowy road", "bare trees")
-
[42]
forest",
SETTING: - Include 1–2 coarse environment tags (example: "forest", "street", "kitchen")
-
[43]
running",
ACTION (RARE): - Only include if extremely obvious and short (e.g., "running", "sitting"). - Prefer nouns over verbs. COMPRESSION RULE: - Do NOT try to describe everything. - Include only key visual elements. - Missing details are acceptable. COUNTING: - Avoid exact numbers unless extremely obvious. - Prefer plural forms (e.g., "trees"). TEXT IN IMAGE: - ...
-
[44]
Output ONLY the tag list
-
[45]
No sentences, no explanations
-
[46]
No punctuation except commas
-
[47]
snowy road, forest, bare trees, winter, cloudy
All lowercase. EXAMPLES (style reference only): - "snowy road, forest, bare trees, winter, cloudy" - "white cat, sunlight, cozy" - "city street, night, race car, neon lights" USER MESSAGE Write a keyword-style caption (tag-style) for the image shown. Figure F.1: Prompt used to generate tag-styled captions for GPIC images. 21 VLM Captioning: Short MAIN INS...
-
[48]
MAIN SUBJECTS: - Mention the 1–3 most important people/animals/objects/landmarks
-
[49]
red tie",
SIMPLE DETAILS: - Add 1–3 simple visible details that help identify them (example: "red tie", "blue bottle", "white car")
-
[50]
street",
SETTING: - Add ONE short setting word (example: "street", "park", "kitchen")
-
[51]
walking",
MAIN ACTION (ONLY if clearly visible): - Use a simple verb (example: "walking", "sitting", "holding hands"). - If the action is not clearly visible, DO NOT include action(s). COUNTING (STRICT): - Use an exact number ONLY if it is very easy and unambiguous to count. - If an exact number is unclear, do NOT guess. - You may use "several" or "a group of" only...
-
[52]
Output 1–2 sentences (max 2)
-
[53]
Start immediately with the main subjects (no meta phrases)
-
[55]
photo",
Do NOT mention the words "photo", "image" or "picture"
-
[56]
Two cyclists ride on a paved road
Use neutral, literal language. LENGTH (STRICT): - Aim for ~12–25 words total. - Keep sentences short and easy to read. EXAMPLES (style reference only): - "Two cyclists ride on a paved road." - "A white cat lies on a bed near a window." - "A bowl of noodles sits on a table with chopsticks." USER MESSAGE Write a short caption (1–2 sentences) for the image s...
-
[57]
Two cyclists
Start immediately with the main visible entity/entities (no meta phrases). Example starts: "Two cyclists ...", "Close-up of ...", "Passengers ...", "A street ..."
-
[58]
Include the following (when clearly visible): - main objects/entities (people/animals/vehicles/objects/structures) - key visible attributes (color/material/clothing/object type) - scene context (indoor/outdoor + setting such as street/room/park/store/stadium) - grounded spatial layout (foreground/background/left/right/next to/in front of)
-
[59]
several" or
Count entities ONLY when clearly countable. If an exact number is unclear, do NOT guess. You may use "several" or "a group of" only when clearly correct
-
[60]
Otherwise describe a static configuration
Describe actions/poses ONLY when directly supported visually. Otherwise describe a static configuration
-
[61]
NOT VISIBLE
Do NOT write any text from the image. Exception: include visible text ONLY if it is large, clearly readable, and necessary to identify the main subject or scene. EDGE CASES: - If the image is blank OR the main content is not visible/understandable (for example, all black/white, too blurry, too dark, overexposed, or corrupted), output exactly: "NOT VISIBLE...
-
[62]
Output TWO sentences by default
-
[63]
Use THREE sentences ONLY when absolutely required to identify the scene clearly
-
[64]
Aim for ~25–60 words total
-
[67]
image",
Do NOT mention the words "image", "photo", or "picture"
-
[68]
Use neutral, literal language
-
[69]
- "Close-up of a white cat lying on a bed near a window. Soft daylight falls across the blanket, and a curtain is visible along the edge of the frame
Be informative but do NOT attempt exhaustive object listing. EXAMPLES (style reference only): - "Two cyclists ride on a paved road with dashed lane markings. An orange barrier lines the left side, with several people standing behind it on the sidewalk. Trees and buildings appear in the background." - "Close-up of a white cat lying on a bed near a window. ...
-
[70]
1.2) Cover important secondary elements, but do NOT attempt to list every small background object
OBJECTS / ENTITIES (nodes) 1.1) Identify the main visible entities in the scene (people, animals, vehicles, objects, structures). 1.2) Cover important secondary elements, but do NOT attempt to list every small background object. 1.3) Prefer describing entities in a grounded order such as foreground → background when possible. 1.4) If multiple similar enti...
-
[71]
2.2) Do NOT guess precise brands, logos, or fine details unless the text/marking is clearly readable
ATTRIBUTES (visible-only) 2.1) Describe visible attributes only when clearly observable: color, size, shape, material, texture, patterns. 2.2) Do NOT guess precise brands, logos, or fine details unless the text/marking is clearly readable
-
[72]
3.2) If a pose or action is not clearly verifiable, do NOT infer it
POSE + ACTIONS (confidence-gated) 3.1) Describe poses (standing, sitting, leaning, arms extended, head direction) and actions (riding, walking, holding) ONLY when directly supported by clearly visible body position and/or physical contact with an object. 3.2) If a pose or action is not clearly verifiable, do NOT infer it. Instead describe what the body lo...
-
[73]
side-by-side
RELATIONS / LAYOUT 4.1) Describe spatial layout using grounded relationships such as: left, right, top, bottom. 4.2) Do NOT overstate alignment or formation (e.g., do not say “side-by-side” unless clearly true). TEXT (OCR) REQUIREMENT:
-
[74]
If any text is visible anywhere (signs, labels, screens, posters, documents, packaging, subtitles, watermarks, UI elements, logos with words, etc.), you MUST try to transcribe it
-
[75]
Reproduce visible text exactly as written, preserving casing, punctuation, numbers, symbols, and spelling
-
[76]
Only transcribe text that is clearly legible
-
[77]
If text is present but not fully readable, do NOT guess; simply say that text is present
-
[78]
NOT VISIBLE
When including OCR text, place it naturally into the caption (prefer Sentences 3–5), so the caption remains coherent and readable. EDGE CASES: - If the image is blank OR the main content is not visible/understandable (for example, all black/white, too blurry, too dark, overexposed, or corrupted), output exactly: "NOT VISIBLE." OUTPUT RULES (STRICT):
-
[79]
Produce exactly 5–7 sentences
-
[80]
Use this sentence structure (STRICT): - Sentences 1–3: main subjects + key attributes + main actions + core setting - Sentences 4–6: layout + secondary elements + background context (include OCR here when possible) - Sentence 7 (optional): extra fine details that help reconstruction
-
[81]
The sentences must be information-dense rather than brief
-
[82]
Do NOT use bullet points, lists, headings, or JSON
-
[83]
Do NOT include disclaimers or meta commentary
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.