PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset
Pith reviewed 2026-05-20 05:14 UTC · model grok-4.3
The pith
A dataset of 95,000 ultra-high-resolution images enables text-to-image models to generate at native 100-megapixel resolution through three training schemes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By curating the PixVerve-95K dataset of 95K images at minimum 100MP resolution with seven-dimensional annotations and applying three training schemes to various T2I foundation models, native 100MP generation is shown to be feasible, as validated by the PixVerve-Bench protocol that measures both visual quality and semantic alignment using standard metrics and multimodal large language model judgments.
What carries the argument
The PixVerve-95K dataset, consisting of 95,000 images each with at least 100 million pixels and seven-dimensional annotations, paired with three training schemes that adapt text-to-image foundation models for direct native 100MP output.
If this is right
- Existing text-to-image models can reach native 100MP output without depending on separate upsampling stages.
- The three training schemes supply concrete ways to manage the added complexity of ultra-high-resolution content during adaptation.
- PixVerve-Bench supplies a repeatable protocol for judging both visual quality and prompt alignment at these resolutions.
- Experimental comparisons across schemes yield practical guidance on data use and training choices for higher-resolution work.
Where Pith is reading between the lines
- If the dataset generalizes, the same curation approach could scale to create training sets for resolutions beyond 100MP.
- The results imply that targeted high-quality data collection may matter more than major model redesigns when increasing output resolution.
- Similar techniques could transfer to related tasks such as high-resolution video generation or domain-specific imagery like medical scans.
Load-bearing premise
The curated PixVerve-95K dataset is assumed to contain sufficiently diverse, high-quality, and correctly annotated ultra-high-resolution content that generalizes beyond the specific collection pipeline used to build it.
What would settle it
Generate 100MP images from the adapted models on text prompts describing scenes or objects poorly represented in the 95K dataset; if the outputs exhibit visible artifacts, loss of coherence, or weaker text alignment than lower-resolution baselines, the central claim would be challenged.
read the original abstract
Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With the extreme desire for better visual experience and the rapid development of imaging technology, the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However, UHR image generation poses great challenges due to the scarcity and complexity of high-resolution content. In this paper, we first introduce PixVerve-95K, a high-quality, open-source UHR T2I dataset curated with a carefully designed data pipeline, which contains 95K images across diverse scenarios (each image has a minimum pixel-count of 100M) and seven-dimensional annotations. Based on our large-scale image-text dataset, we take a pioneering step to extend various T2I foundation models to native 100MP generation with three training schemes. Finally, leveraging both conventional metrics and multimodal large language model-based assessments, our proposed PixVerve-Bench benchmark establishes a comprehensive evaluation protocol for UHR images encompassing visual quality and semantic alignment. Extensive experimental results on our benchmark and the constructive exploration of training strategies collaboratively provide valuable insights for future breakthroughs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce PixVerve-95K, a high-quality open-source UHR T2I dataset with 95K images of at least 100MP each and seven-dimensional annotations, curated via a carefully designed pipeline. It extends various T2I foundation models to native 100MP generation using three training schemes and establishes the PixVerve-Bench benchmark for comprehensive evaluation of UHR images using conventional metrics and MLLM-based assessments. The work provides extensive experimental results and insights for future UHR generation breakthroughs.
Significance. If the results hold, this would be a significant contribution to the field of text-to-image generation by enabling native ultra-high-resolution outputs, which is currently limited. The large-scale dataset and benchmark could serve as valuable resources for the community, promoting further advancements in handling high-resolution content. The empirical exploration of training strategies is a strength if they prove effective beyond the specific dataset.
major comments (3)
- [Abstract] The assertion that the curated PixVerve-95K dataset contains sufficiently diverse, high-quality, and correctly annotated ultra-high-resolution content is central to the paper's claims. However, no quantitative checks on annotation accuracy, inter-annotator agreement, or out-of-pipeline generalization are reported, which is critical for validating that the benchmark gains are due to the training schemes rather than data-specific artifacts.
- [Training Schemes] Details on the three training schemes are provided, but the manuscript lacks specific information on how they handle the computational challenges of 100MP images, such as memory efficiency or resolution-specific adaptations, making it difficult to assess the stability of native generation.
- [PixVerve-Bench] The benchmark is described as using multimodal large language model-based assessments, but the specific MLLMs employed and the validation of their assessments against human judgments should be detailed to ensure the reliability of the evaluation protocol.
minor comments (2)
- [Abstract] Consider replacing 'pioneering step' with a less hyperbolic term to align with standard academic tone.
- Verify that all acronyms are defined at first use and that the reference list is complete for prior work on high-resolution image generation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, providing clarifications and indicating where revisions have been made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] The assertion that the curated PixVerve-95K dataset contains sufficiently diverse, high-quality, and correctly annotated ultra-high-resolution content is central to the paper's claims. However, no quantitative checks on annotation accuracy, inter-annotator agreement, or out-of-pipeline generalization are reported, which is critical for validating that the benchmark gains are due to the training schemes rather than data-specific artifacts.
Authors: We agree that quantitative validation would further substantiate the dataset quality claims. In the revised manuscript, we have added a dedicated subsection in the data curation pipeline section reporting annotation accuracy on a manually verified sample of 2,000 images, inter-annotator agreement via Fleiss' kappa scores from multiple annotators on a 500-image subset, and out-of-pipeline generalization results on an external set of 1,000 UHR images. These additions confirm that performance gains stem from the training schemes rather than dataset artifacts. revision: yes
-
Referee: [Training Schemes] Details on the three training schemes are provided, but the manuscript lacks specific information on how they handle the computational challenges of 100MP images, such as memory efficiency or resolution-specific adaptations, making it difficult to assess the stability of native generation.
Authors: The referee correctly notes the need for more granular implementation details. We have revised the training schemes section to explicitly describe our approaches to computational challenges, including the use of DeepSpeed ZeRO-3 for distributed memory optimization, activation checkpointing to reduce peak memory, and a progressive resolution adaptation strategy that initializes at 4K before scaling to native 100MP. These details demonstrate training stability and feasibility on standard high-end hardware. revision: yes
-
Referee: [PixVerve-Bench] The benchmark is described as using multimodal large language model-based assessments, but the specific MLLMs employed and the validation of their assessments against human judgments should be detailed to ensure the reliability of the evaluation protocol.
Authors: We acknowledge the importance of specifying the evaluation components for reproducibility. The revised manuscript now details the exact MLLMs employed (GPT-4V and LLaVA-1.5) and includes a new validation subsection reporting results from a human study on 300 images, where MLLM scores were compared against averaged human ratings, yielding a Pearson correlation of 0.87. This supports the reliability of the MLLM-based protocol. revision: yes
Circularity Check
No circularity: empirical dataset and training contribution is self-contained
full rationale
The paper introduces a new UHR dataset (PixVerve-95K) curated via a described pipeline, applies three training schemes to extend existing T2I models, and evaluates on a new benchmark (PixVerve-Bench). No equations, first-principles derivations, or fitted parameters are presented that reduce claimed performance to quantities defined by or fitted on the same inputs used for evaluation. The contribution is empirical and procedural rather than a closed mathematical chain; reported gains are attributed to experimental outcomes on held-out or constructed benchmarks, with no self-definitional loops, renamed predictions, or load-bearing self-citations that collapse the central claims.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing T2I foundation models can be fine-tuned or adapted to much higher native resolutions without fundamental architectural changes.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce PixVerve-95K, the first large-scale, high-quality T2I dataset to push image resolution to 100MP. With a five-stage, automated data pipeline, we curate 95,735 100MP images with fine-grained annotations
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we extend existing T2I foundation models ... with three distinct training schemes
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Improving image generation with better captions.Computer Science
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023. 6
work page 2023
-
[3]
Jiazi Bu, Pengyang Ling, Yujie Zhou, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Hiflow: Training-free high-resolution image generation with flow-aligned guidance.arXiv preprint arXiv:2504.06232, 2025. 2, 3, 10, 11
-
[4]
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
ArtiMuse: Fine-grained image aesthetics assessment with joint scoring and expert-level understanding
Shuo Cao, Nan Ma, Jiayang Li, Xiaohui Li, Lihao Shao, Kaiwen Zhu, Yu Zhou, Yuandong Pu, Jiarui Wu, Jiaquan Wang, et al. Artimuse: Fine-grained image aesthetics assessment with joint scoring and expert-level understanding. arXiv preprint arXiv:2507.14533, 2025. 5, 7
-
[6]
Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024. 3, 4, 8
work page 2024
-
[7]
Dip: Taming diffusion models in pixel space.arXiv preprint arXiv:2511.18822, 2025
Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, and Ying Tai. Dip: Taming diffusion models in pixel space.arXiv preprint arXiv:2511.18822, 2025. 3, 9
-
[8]
L2P: Unlocking Latent Potential for Pixel Generation
Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Jiawei Chen, Zhuoqi Zeng, Wei Zhang, Chengjie Wang, Jian Yang, and Ying Tai. L2p: Unlocking latent potential for pixel generation.arXiv preprint arXiv:2605.12013,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Unsplash.https://unsplash.com/images, 2013
Mikael Cho. Unsplash.https://unsplash.com/images, 2013. 4, 17
work page 2013
-
[10]
Notes on the resolution and other details of the human eye.Clarkvision
Roger N Clark. Notes on the resolution and other details of the human eye.Clarkvision. com, 2005. 1
work page 2005
-
[11]
Demofusion: Democratising high-resolution image generation with no$
Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, and Zhanyu Ma. Demofusion: Democratising high-resolution image generation with no$. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6159–6168, 2024. 2, 3, 10, 11, 19
work page 2024
-
[12]
Ruoyi Du, Dongyang Liu, Le Zhuo, Qin Qi, Hongsheng Li, Zhanyu Ma, and Peng Gao. I-max: Maximize the resolution potential of pre-trained rectified flow transformers with projected flow. 2024. 2
work page 2024
-
[13]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024. 3 13
work page 2024
-
[14]
Yushun Fang, Yuxiang Chen, Shibo Yin, Qiang Hu, Jiangchao Yao, Ya Zhang, Xiaoyun Zhang, and Yanfeng Wang. One-step diffusion transformer for controllable real-world image super-resolution.arXiv preprint arXiv:2511.17138,
-
[15]
Generative adversarial nets.Advances in neural information processing systems, 27, 2014
Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014. 3
work page 2014
-
[16]
Gemini.https://gemini.google.com/, 2025
Google. Gemini.https://gemini.google.com/, 2025. 5
work page 2025
-
[17]
Textural features for image classification
Robert M Haralick, Karthikeyan Shanmugam, and Its’ Hak Dinstein. Textural features for image classification. IEEE Transactions on systems, man, and cybernetics, (6):610–621, 2007. 9, 19
work page 2007
-
[18]
Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models
Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. InThe Twelfth International Conference on Learning Representations, 2023. 3
work page 2023
-
[19]
Clipscore: A reference-free evaluation metric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021. 2, 10
work page 2021
-
[20]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017. 2, 9
work page 2017
-
[21]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3
work page 2020
-
[22]
Ultragen: High-resolution video generation with hierarchical attention
Teng Hu, Jiangning Zhang, Zihan Su, and Ran Yi. Ultragen: High-resolution video generation with hierarchical attention. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 4923–4931, 2026. 1
work page 2026
-
[23]
Fouriscale: A frequency perspective on training-free high-resolution image synthesis
Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li. Fouriscale: A frequency perspective on training-free high-resolution image synthesis. InEuropean conference on computer vision, pages 196–212. Springer, 2024. 2, 3
work page 2024
-
[24]
Open-set image tagging with multi-grained text supervision
Xinyu Huang, Yi-Jie Huang, Youcai Zhang, Weiwei Tian, Rui Feng, Yuejie Zhang, Yanchun Xie, Yaqian Li, and Lei Zhang. Open-set image tagging with multi-grained text supervision. InProceedings of the 33rd ACM International Conference on Multimedia, pages 4117–4126, 2025. 6
work page 2025
-
[25]
Pexels images.https://www.pexels.com/images/, 2014
Ingo, Bruno Joseph, and Daniel Frese. Pexels images.https://www.pexels.com/images/, 2014. 4, 17
work page 2014
-
[26]
arXiv preprint arXiv:2510.12798 (2025)
Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, and Lei Zhang. Detect anything via next point prediction.arXiv preprint arXiv:2510.12798, 2025. 6
-
[27]
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023. 3
work page 2023
-
[28]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024. 3
work page 2024
-
[29]
FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025
Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025. 1, 3, 8, 10, 11
work page 2025
-
[30]
aesthetic-predictor.https://github.com/LAION-AI/aesthetic-predictor, 2022
LAION-AI. aesthetic-predictor.https://github.com/LAION-AI/aesthetic-predictor, 2022. 5, 9
work page 2022
-
[31]
Back to Basics: Let Denoising Generative Models Denoise
Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025. 3, 9
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
arXiv preprint arXiv:2409.10695 , year=
Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Chase Lambert, Joao Souza, Suhail Doshi, and Daiqing Li. Playground v3: Improving text-to-image alignment with deep-fusion large language models.arXiv preprint arXiv:2409.10695, 2024. 3
-
[33]
Linfusion: 1 gpu, 1 minute, 16k image.arXiv preprint arXiv:2409.02097, 2024
Songhua Liu, Weihao Yu, Zhenxiong Tan, and Xinchao Wang. Linfusion: 1 gpu, 1 minute, 16k image.arXiv preprint arXiv:2409.02097, 2024. 10, 11
-
[34]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023. 3
work page 2023
-
[35]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 3 14
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Qwen3.5: Towards Native Multimodal Agents.https://qwen.ai/blog?id=qwen3.5, February 2026
Qwen Team. Qwen3.5: Towards Native Multimodal Agents.https://qwen.ai/blog?id=qwen3.5, February 2026. 9, 10, 20, 21
work page 2026
-
[37]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational conference on machine learning, pages 8821–8831. Pmlr, 2021. 3
work page 2021
-
[38]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, and Lei Zhu. Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks.Advances in Neural Information Processing Systems, 37:111131–111171, 2024. 3, 10, 11
work page 2024
-
[40]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3
work page 2022
-
[41]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs.arXiv preprint arXiv:2111.02114, 2021. 3
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[42]
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294,
-
[43]
A mathematical theory of communication.The Bell system technical journal, 27(3): 379–423, 1948
Claude Elwood Shannon. A mathematical theory of communication.The Bell system technical journal, 27(3): 379–423, 1948. 5, 19
work page 1948
-
[44]
Resmaster: Mastering high-resolution image generation via structural and fine-grained guidance
Shuwei Shi, Wenbo Li, Yuechen Zhang, Jingwen He, Biao Gong, and Yinqiang Zheng. Resmaster: Mastering high-resolution image generation via structural and fine-grained guidance. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 6887–6895, 2025. 3
work page 2025
-
[45]
Freeu: Free lunch in diffusion u-net
Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. Freeu: Free lunch in diffusion u-net. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4733–4743, 2024. 3
work page 2024
-
[46]
Latent Wavelet Diffusion For Ultra-High-Resolution Image Synthesis
Luigi Sigillo, Shengfeng He, and Danilo Comminiello. Latent wavelet diffusion for ultra-high-resolution image synthesis.arXiv preprint arXiv:2506.00433, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025. 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025
Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. Pixnerd: Pixel neural field diffusion. arXiv preprint arXiv:2507.23268, 2025. 3
-
[49]
WangXuan95. Image-compression-benchmark. https://github.com/WangXuan95/ Image-Compression-Benchmark, 2025. 23
work page 2025
-
[50]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 3, 5, 10, 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Ji Ao, Dawei Leng, and Yuhui Yin. Fg-clip 2: A bilingual fine-grained vision-language alignment model.arXiv preprint arXiv:2510.10921, 2025. 10
-
[52]
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Zhucun Xue, Jiangning Zhang, Teng Hu, Haoyang He, Yinan Chen, Yuxuan Cai, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li, et al. Ultravideo: High-quality uhd video dataset with comprehensive captions.arXiv preprint arXiv:2506.13691, 2025. 1
-
[54]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 6 15
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Tian Ye, Song Fei, and Lei Zhu. Ultraflux: Data-model co-design for high-quality native 4k text-to-image generation across diverse aspect ratios.arXiv preprint arXiv:2511.18050, 2025. 2, 3, 4, 6, 10, 11
-
[56]
Jiangning Zhang, Teng Hu, Haoyang He, Zhucun Xue, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li, and Dacheng Tao. Emov2: Pushing 5 m vision model frontier.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 8
work page 2025
-
[57]
One-step diffusion with distribution matching distillation
Jiangning Zhang, Junwei Zhu, Teng Hu, Yabiao Wang, Donghao Luo, Weijian Cao, Zhenye Gan, Xiaobin Hu, Zhucun Xue, and Chengjie Wang. Transform trained transformer: Accelerating naive 4k video generation over 10×.arXiv preprint arXiv:2512.13492, 2025. 1, 9
-
[58]
Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models
Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, and Di Huang. Ultra-high-resolution image synthesis: Data, method and evaluation.arXiv preprint arXiv:2506.01331, 2025. 2, 3, 4, 8, 17
-
[59]
Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models
Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, and Di Huang. Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23464–23473, 2025. 2, 3, 4, 8, 10, 11
work page 2025
-
[60]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018. 6
work page 2018
-
[61]
Chen Zhao, En Ci, Yunzhe Xu, Tiehan Fan, Shanyan Guan, Yanhao Ge, Jian Yang, and Ying Tai. Ultrahr-100k: Enhancing uhr image synthesis with a large-scale high-quality dataset.arXiv preprint arXiv:2510.20661, 2025. 2, 3, 4, 6, 8, 17
-
[62]
4kagent: agentic any image to 4k super-resolution.arXiv preprint arXiv:2507.07105,
Yushen Zuo, Qi Zheng, Mingyang Wu, Xinrui Jiang, Renjie Li, Jian Wang, Yide Zhang, Gengchen Mai, Lihong V Wang, James Zou, et al. 4kagent: agentic any image to 4k super-resolution.arXiv preprint arXiv:2507.07105,
-
[63]
1 16 Appendix The appendix presents the following sections to strengthen the main manuscript: —Sec. Aprovides implementation details of flatness detection. —Sec. Bprovides a further frequency-domain analysis to confirm the quality of PixVerve-95K. — Sec. Cprovides a detailed clarification on the licensing for our proposed dataset to ensure transparency an...
-
[67]
**Rigor:** Apply strict criteria and any noticeable artifact should be reflected in the scoring. Maintain a high standard for what constitutes a “5” (Excellent). #EVALUATION RUBRICS: ##1. Structural Coherence (SC-global) Check whether the geometric structure of the entities is correct, whether there are any missing or redundant limbs, and whether the over...
-
[71]
Keys: “SC-global”, “PI”, “LC”, “CH” represent the scores for the 4 dimensions respectively, and “reasoning” is a concise explanation justifying the scores. Ensure the JSON property names are enclosed in double quotes and there are no trailing commas in the JSON object. #OUTPUT FORMAT: <json> {{ “SC-global”: int, “PI”: int, “LC”: int, “CH”: int, “reasoning...
-
[72]
**Objectivity & Fairness:** Maintain an objective stance throughout the evaluation process and base your judgement on visual evidence with the same standard instead of subjective preference
-
[73]
Score based on the visual quality and fidelity aspects solely
**Focus Solely on Fidelity:** Consider the image category and expected characteristics while avoiding any bias towards the content of the image. Score based on the visual quality and fidelity aspects solely
-
[74]
**Local-to-Global Evaluation:** Evaluate the details in Image 1, and use Image 2 to distinguish between “intended bokeh/blur” and “accidental artifacts”. 26
-
[75]
**Coordinates Reference:** Use the rectangular bounding box only to understand the local patch’s location in the overall image context, but DO NOT directly compare the local patch to the global image for pixel-level details
-
[76]
**Independence:** Evaluate each dimension independently without any halo effects
-
[77]
**Rigor:** Apply strict criteria and any noticeable artifact should be reflected in the scoring. Maintain a high standard for what constitutes a “5” (Excellent). #EVALUATION RUBRICS: Please evaluate the microscopic details and fidelity of the **Local Patch (Image 1)** across the 5 dimensions below, while using the Global Image (Image 2) and the relative c...
-
[78]
You MUST follow a strict 5-point scale and provide a score as an **INTEGER from 1 to 5 only** for each dimension
-
[81]
Keys: “NGE”, “GA”, “TF”, “MGC”, “SC-local” represent the scores for the 5 dimensions respectively, and “reasoning” is a concise explanation justifying the scores. Ensure the JSON property names are enclosed in double quotes and there are no trailing commas in the JSON object. #OUTPUT FORMAT: <json> {{ “NGE”: int, “GA”: int, “TF”: int, “MGC”: int, 27 “SC-l...
-
[82]
Focus strictly and solely on presence or absence rather than quality
**IEV (Instance Existence Verification):** Inspect whether all instances explicitly mentioned in the long caption are present. Focus strictly and solely on presence or absence rather than quality
-
[83]
This requires detailed cross-referencing between the caption and the visual content
**AAA (Appearance Attribute Alignment):** For each instance that exists, assess whether its visual attributes (color, texture, material, size, shape) align with the description in the long caption. This requires detailed cross-referencing between the caption and the visual content
-
[84]
#CRITICAL SCORING RULES (Must Strictly Follow):
**SRA (Spatial Relation Accuracy):** Evaluate whether the relative positioning (e.g., left/right, top/bottom, fore- ground/background) and the logical perspective between multiple instances are accurately depicted in the image. #CRITICAL SCORING RULES (Must Strictly Follow):
-
[85]
**Hierarchical Dependence:** **IEV** is the gatekeeper. If any critical instance is missing (IEV below 4), the corresponding AAA and SRA for the image must be penalized accordingly, as attributes and relations cannot exist without the entity
-
[86]
**Detail Awareness:** Since this is a high-resolution image evaluation task, you must meticulously scan **the entire canvas**, including corners and background, to identify all mentioned instances and their micro-details
-
[87]
**Strict Adherence to Explicit Constraints:** Judge the image ONLY based on what is explicitly stated in the long caption. Do not impose imaginary constraints or personal aesthetic preferences. For any visual aspects NOT mentioned (e.g., specific lighting, background nuances, or artistic style), the generation model is allowed creative autonomy. Do not pe...
-
[88]
**Hallucination Penalty:** If the synthesized image contains prominent instances that are NOT mentioned in the long caption and significantly distract from the caption’s content (severe hallucination), deduct 1-2 points from **IEV**
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.