Recognition: unknown
UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection
Pith reviewed 2026-05-09 22:17 UTC · model grok-4.3
The pith
A single framework unifies image generation and fake-image detection so each task strengthens the other.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By placing a generative network and a discriminative detector inside one model and connecting them with symbiotic multimodal self-attention plus detector-informed generative alignment, the generation task supplies richer features that improve the interpretability of authenticity judgments, while authenticity criteria in turn steer the generator toward higher-fidelity outputs. The authors state that this mutual guidance produces better results than training the two tasks separately.
What carries the argument
symbiotic multimodal self-attention mechanism together with detector-informed generative alignment, which allows information to flow between the generative and discriminative branches without requiring separate architectures.
If this is right
- Generation quality rises because authenticity signals from the detector guide synthesis toward more realistic outputs.
- Detection accuracy rises because the generator supplies features that make authenticity decisions more interpretable.
- The same model produces state-of-the-art numbers on both tasks across multiple public datasets.
- Seamless information exchange occurs between the two tasks through the shared attention and alignment components.
Where Pith is reading between the lines
- The same joint-training pattern could be tested on paired tasks such as text generation paired with AI-text detection.
- If the approach generalizes, future foundation models may need to include built-in verification heads rather than relying on external detectors.
- The framework could be evaluated on newer diffusion or transformer-based generators to check whether the co-evolution benefit persists beyond the models used in the original experiments.
Load-bearing premise
The symbiotic attention and alignment steps can bridge the architectural gap between generative and discriminative models so that neither task loses performance.
What would settle it
Train the same generator and detector both jointly under the unified framework and independently; if the independent versions outperform the unified model on generation quality or detection accuracy, the co-evolutionary benefit is falsified.
Figures
read the original abstract
In recent years, significant progress has been made in both image generation and generated image detection. Despite their rapid, yet largely independent, development, these two fields have evolved distinct architectural paradigms: the former predominantly relies on generative networks, while the latter favors discriminative frameworks. A recent trend in both domains is the use of adversarial information to enhance performance, revealing potential for synergy. However, the significant architectural divergence between them presents considerable challenges. Departing from previous approaches, we propose UniGenDet: a Unified generative-discriminative framework for co-evolutionary image Generation and generated image Detection. To bridge the task gap, we design a symbiotic multimodal self-attention mechanism and a unified fine-tuning algorithm. This synergy allows the generation task to improve the interpretability of authenticity identification, while authenticity criteria guide the creation of higher-fidelity images. Furthermore, we introduce a detector-informed generative alignment mechanism to facilitate seamless information exchange. Extensive experiments on multiple datasets demonstrate that our method achieves state-of-the-art performance. Code: \href{https://github.com/Zhangyr2022/UniGenDet}{https://github.com/Zhangyr2022/UniGenDet}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes UniGenDet, a unified generative-discriminative framework for co-evolutionary image generation and generated-image detection. It introduces a symbiotic multimodal self-attention mechanism together with a detector-informed generative alignment and a unified fine-tuning algorithm to bridge the architectural divergence between generative and discriminative models, allowing each task to improve the other; extensive experiments on multiple datasets are reported to establish state-of-the-art performance, with code released.
Significance. If the empirical claims hold, the work is significant for demonstrating a concrete route to mutual improvement between two fields that have evolved largely independently. The explicit design of information exchange (symbiotic attention and detector-informed alignment) and the release of code are strengths that support reproducibility and further investigation. The approach could influence subsequent research on adversarial and multimodal vision models.
major comments (2)
- [§4] §4 (Experiments) and associated tables: the SOTA claim is load-bearing on the quantitative results; the manuscript must supply per-dataset numerical comparisons against recent baselines, ablation studies isolating the contribution of the symbiotic attention and alignment modules, and evidence that neither task degrades when the other is active.
- [§3.3] §3.3 (unified fine-tuning algorithm): the description of how the two heads are jointly optimized must include the precise loss weighting schedule and any hyper-parameters that control the information exchange; without these, it is impossible to verify that the claimed synergy is not the result of task-specific tuning.
minor comments (2)
- [Figure 2] Figure 2 (architecture diagram): the flow of the detector-informed alignment signal is difficult to trace; adding explicit arrows or a step-by-step legend would improve clarity.
- [Abstract] The abstract states 'state-of-the-art performance' without any numerical anchors; a single sentence summarizing the largest reported gains would help readers assess the magnitude of the advance.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive recognition of the potential impact of UniGenDet. We address each major comment below and will revise the manuscript to incorporate the requested details.
read point-by-point responses
-
Referee: [§4] §4 (Experiments) and associated tables: the SOTA claim is load-bearing on the quantitative results; the manuscript must supply per-dataset numerical comparisons against recent baselines, ablation studies isolating the contribution of the symbiotic attention and alignment modules, and evidence that neither task degrades when the other is active.
Authors: We agree that additional experimental details are necessary to robustly support the SOTA claims. In the revised manuscript, Section 4 and its tables will be expanded to include per-dataset numerical comparisons against recent baselines. We will add ablation studies that isolate the individual contributions of the symbiotic multimodal self-attention mechanism and the detector-informed generative alignment. We will also report performance metrics for both the generation and detection tasks under joint training versus isolated training to confirm that neither task degrades when the other is active. revision: yes
-
Referee: [§3.3] §3.3 (unified fine-tuning algorithm): the description of how the two heads are jointly optimized must include the precise loss weighting schedule and any hyper-parameters that control the information exchange; without these, it is impossible to verify that the claimed synergy is not the result of task-specific tuning.
Authors: We acknowledge that the current description in Section 3.3 requires more precise details for full reproducibility. In the revised manuscript, we will specify the exact loss weighting schedule for joint optimization of the two heads and list all hyper-parameters that govern information exchange, including coefficients and schedules for the symbiotic attention and alignment components. This will allow verification that the reported synergy stems from the unified framework. revision: yes
Circularity Check
No significant circularity
full rationale
The paper proposes a new unified generative-discriminative framework (UniGenDet) with symbiotic multimodal self-attention, unified fine-tuning, and detector-informed generative alignment as design elements to enable co-evolution between generation and detection tasks. No equations, mathematical derivations, parameter fittings, or self-citations appear in the abstract or high-level description that reduce any claimed result to its own inputs by construction. The SOTA performance claims rest on empirical experiments rather than any self-definitional or fitted-input logic, making the derivation chain self-contained and independent of the patterns that would trigger circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Nanobanana pro.https://gemini.google.com/,
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv:2303.08774, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Improving image generation with better captions.OpenAI blog, 2023
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.OpenAI blog, 2023. 7
2023
-
[5]
Detecting generated images by real images only,
Xiuli Bi, Bo Liu, Fan Yang, Bin Xiao, Weisheng Li, Gao Huang, and Pamela C Cosman. Detecting generated images by real images only.arXiv preprint arXiv:2311.00962, 2023. 6
-
[6]
arXiv preprint arXiv:2310.17419 , year=
You-Ming Chang, Chen Yeh, Wei-Chen Chiu, and Ning Yu. Antifakeprompt: Prompt-tuned vision-language models are fake image detectors.arXiv preprint arXiv:2310.17419,
-
[7]
Pixart-sigma: Weak-to-strong train- ing of diffusion transformer for 4k text-to-image generation
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-sigma: Weak-to-strong train- ing of diffusion transformer for 4k text-to-image generation. InECCV, 2024. 7
2024
-
[8]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Sil- vio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025. 2
work page Pith review arXiv 2025
-
[9]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Chengyue Wu, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo. Janus-pro: Unified mul- timodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025. 7
work page internal anchor Pith review arXiv 2025
-
[10]
arXiv preprint arXiv:2410.06126 , year=
Yize Chen, Zhiyuan Yan, Guangliang Cheng, Kangran Zhao, Siwei Lyu, and Baoyuan Wu. X2-dfd: A framework for ex- plainable and extendable deepfake detection.arXiv preprint arXiv:2410.06126, 2024. 2
-
[11]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR, pages 24185–24198, 2024. 5, 6
2024
-
[12]
On the de- tection of synthetic images generated by diffusion models
Riccardo Corvi, Davide Cozzolino, Giada Zingarini, Gio- vanni Poggi, Koki Nagano, and Luisa Verdoliva. On the de- tection of synthetic images generated by diffusion models. InICASSP, pages 1–5. IEEE, 2023. 2, 5, 6
2023
-
[13]
Gemini 2.0: Enhanced multimodal world models.https://blog.google/, 2025
Google DeepMind. Gemini 2.0: Enhanced multimodal world models.https://blog.google/, 2025. Ac- cessed: 2025-04-06. 2
2025
-
[14]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 2, 3, 5, 6, 7, 9
work page internal anchor Pith review arXiv 2025
-
[15]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InCVPR,
-
[16]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InICML, 2024. 2, 7
2024
-
[17]
Dire for diffusion-generated image detection
Wang et al. Dire for diffusion-generated image detection. In ICCV, pages 22445–22455, 2023. 2, 3
2023
-
[18]
Leveraging fre- quency analysis for deep fake image recognition
Joel Frank, Thorsten Eisenhofer, Lea Sch ¨onherr, Asja Fis- cher, Dorothea Kolossa, and Thorsten Holz. Leveraging fre- quency analysis for deep fake image recognition. InInter- national conference on machine learning, pages 3247–3258. PMLR, 2020. 2, 6
2020
-
[19]
Geneval: An object-focused framework for evaluating text- to-image alignment
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment. InNeurIPS, 2023. 5, 7
2023
-
[20]
Gans trained by a two time-scale update rule converge to a local nash equilib- rium.NeurIPS, 30, 2017
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.NeurIPS, 30, 2017. 5
2017
-
[21]
Denoising diffu- sion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020. 2
2020
-
[22]
Ffaa: Multimodal large language model based explainable open-world face forgery analysis assistant
Zhengchao Huang, Bin Xia, Zicheng Lin, Zhun Mou, and Wenming Yang. Ffaa: Multimodal large language model based explainable open-world face forgery analysis assistant. arXiv preprint arXiv:2408.10072, 2024. 2
-
[23]
Sida: Social media image deepfake detection, localization and explanation with large multimodal model
Zhenglin Huang, Jinwei Hu, Xiangtai Li, Yiwei He, Xingyu Zhao, Bei Peng, Baoyuan Wu, Xiaowei Huang, and Guan- gliang Cheng. Sida: Social media image deepfake detection, localization and explanation with large multimodal model. InCVPR, pages 28831–28841, 2025. 1, 2, 3, 5, 6
2025
-
[24]
Frepgan: robust deepfake detection using frequency- level perturbations
Yonghyun Jeong, Doyeon Kim, Youngmin Ro, and Jongwon Choi. Frepgan: robust deepfake detection using frequency- level perturbations. InAAAI, pages 1060–1068, 2022. 1
2022
-
[25]
Fusing global and local features for gener- alized ai-synthesized image detection
Yan Ju, Shan Jia, Lipeng Ke, Hongfei Xue, Koki Nagano, and Siwei Lyu. Fusing global and local features for gener- alized ai-synthesized image detection. InICIP, pages 3465–
-
[26]
Hengrui Kang, Siwei Wen, Zichen Wen, Junyan Ye, Wei- jia Li, Peilin Feng, Baichuan Zhou, Bin Wang, Dahua Lin, Linfeng Zhang, et al. Legion: Learning to ground and explain for synthetic image detection.arXiv preprint arXiv:2503.15264, 2025. 2, 3, 5
-
[27]
Runway announces gen-4: New ai model for consistent media generation, 2025
Valeriia Kuka. Runway announces gen-4: New ai model for consistent media generation, 2025. Accessed: 2025-04-06. 1
2025
-
[28]
Flux, 2024
Black Forest Labs. Flux, 2024. 2, 4, 5, 7
2024
-
[29]
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024. 2, 3
work page internal anchor Pith review arXiv 2024
-
[30]
ForgeryGPT: A Multimodal LLM for Interpretable Image Forgery Detection and Localization
Jiawei Li, Fanrui Zhang, Jiaying Zhu, Esther Sun, Qiang Zhang, and Zheng-Jun Zha. Forgerygpt: Multimodal large language model for explainable image forgery detection and localization.arXiv preprint arXiv:2410.10238, 2024. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Fakebench: Uncover the achilles’ heels of fake images with large multimodal models.TIFS, 2025
Yixuan Li, Xuelin Liu, Xiaoyang Wang, Shiqi Wang, and Weisi Lin. Fakebench: Uncover the achilles’ heels of fake images with large multimodal models.TIFS, 2025. 2
2025
-
[32]
Skyra: Ai- generated video detection via grounded artifact reasoning
Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, and Jiwen Lu. Skyra: Ai- generated video detection via grounded artifact reasoning. In CVPR, 2026. 2
2026
-
[33]
Dual diffusion for uni- fied image generation and understanding
Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, and Peng Wang. Dual diffusion for uni- fied image generation and understanding. InCVPR, pages 2779–2790, 2025. 2
2025
-
[34]
Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025. 2
-
[35]
Detecting generated images by real images
Bo Liu, Fan Yang, Xiuli Bi, Bin Xiao, Weisheng Li, and Xinbo Gao. Detecting generated images by real images. In ECCV, pages 95–110. Springer, 2022. 2, 6
2022
-
[36]
Forgery-aware adaptive transformer for generalizable synthetic image detection
Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and Yao Zhao. Forgery-aware adaptive transformer for generalizable synthetic image detection. In CVPR, pages 10770–10780, 2024. 1, 5, 6
2024
-
[37]
Global texture enhancement for fake face detection in the wild
Zhengzhe Liu, Xiaojuan Qi, and Philip HS Torr. Global texture enhancement for fake face detection in the wild. In CVPR, pages 8060–8069, 2020. 6
2020
-
[38]
Lareˆ 2: Latent reconstruction error based method for diffusion-generated image detection
Yunpeng Luo, Junlong Du, Ke Yan, and Shouhong Ding. Lareˆ 2: Latent reconstruction error based method for diffusion-generated image detection. InCVPR, pages 17006–17015, 2024. 2, 3
2024
-
[39]
Detecting gan-generated imagery using color cues, 2018
Scott McCloskey and Michael Albright. Detecting GAN- generated imagery using color cues.arXiv preprint arXiv:1812.08247, 2018. 2
-
[40]
Finite scalar quantization: Vq-vae made simple.arXiv preprint arXiv:2309.15505, 2023
Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple.arXiv preprint arXiv:2309.15505, 2023. 1
-
[41]
Towards uni- versal fake image detectors that generalize across generative models
Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards uni- versal fake image detectors that generalize across generative models. InCVPR, pages 24480–24489, 2023. 2, 5, 6
2023
-
[42]
Introducing 4o image generation, 2025
OpenAI. Introducing 4o image generation, 2025. 2, 5, 6
2025
-
[43]
Sora: A data-driven physical engine for world mod- eling.https://openai.com/, 2025
OpenAI. Sora: A data-driven physical engine for world mod- eling.https://openai.com/, 2025. Accessed: 2025- 04-06. 1
2025
-
[44]
Dinov2: Learning robust visual features without supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. TMLR, 2024. 5
2024
-
[45]
Sdxl: Improving latent diffusion models for high-resolution image synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InICLR, 2024. 7
2024
-
[46]
Tokenflow: Unified image tokenizer for multi- modal understanding and generation
Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xin- glong Wu. Tokenflow: Unified image tokenizer for multi- modal understanding and generation. InCVPR, 2025. 2, 7
2025
-
[47]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gen- eration with clip latents.arXiv preprint arxiv:2204.06125,
work page internal anchor Pith review arXiv
-
[48]
Aerob- lade: Training-free detection of latent diffusion images using autoencoder reconstruction error
Jonas Ricker, Denis Lukovnikov, and Asja Fischer. Aerob- lade: Training-free detection of latent diffusion images using autoencoder reconstruction error. InCVPR, 2024. 2
2024
-
[49]
High-resolution image syn- thesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, pages 10684– 10695, 2022. 2, 7
2022
-
[50]
Laion-5b: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. InNeurIPS, 2022. 5
2022
-
[51]
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024. 1, 2
work page internal anchor Pith review arXiv 2024
-
[52]
Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning
Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning. InAAAI, pages 5052–5060, 2024. 5, 6
2024
-
[53]
Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection
Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InCVPR, pages 28130–28139, 2024. 2, 5, 6
2024
-
[54]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 2
work page internal anchor Pith review arXiv 2024
-
[55]
Visual autoregressive modeling: Scalable image generation via next-scale prediction
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Li- wei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. InNeurIPS, 2025. 1, 2
2025
-
[56]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 4, 5, 9
work page internal anchor Pith review arXiv 2025
-
[57]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
Cnn-generated images are sur- prisingly easy to spot
Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are sur- prisingly easy to spot... for now. InCVPR, pages 8695–8704,
-
[59]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arxiv:2409.18869, 2024. 2, 7
work page internal anchor Pith review arXiv 2024
-
[60]
Spot the fake: Large multimodal model- based synthetic image detection with artifact explanation
Siwei Wen, Junyan Ye, Peilin Feng, Hengrui Kang, Zichen Wen, Yize Chen, Jiang Wu, Wenjun Wu, Conghui He, and Weijia Li. Spot the fake: Large multimodal model- based synthetic image detection with artifact explanation. In NeurIPS, 2025. 2, 3, 5, 6
2025
-
[61]
Janus: Decoupling visual encoding for unified multimodal understanding and genera- tion
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo. Janus: Decoupling visual encoding for unified multimodal understanding and genera- tion. InCVPR, 2025. 7
2025
-
[62]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 1
work page internal anchor Pith review arXiv 2025
-
[63]
Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Zhonghua Wu, Qingyi Tao, Wentao Liu, Wei Li, and Chen Change Loy. Harmonizing visual representations for unified mul- timodal understanding and generation.arXiv preprint arXiv:2503.21979, 2025. 2
-
[64]
Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding, 2024
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek-vl2: Mixture-of-experts visio...
2024
-
[65]
Show-o: One single transformer to unify multimodal understanding and generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InICLR, 2025. 7
2025
-
[66]
Show-o: One single transformer to unify multimodal understanding and generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InICLR, 2025. 2
2025
-
[67]
A sanity check for ai- generated image detection.ICLR, 2025
Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi Xie. A sanity check for ai- generated image detection.ICLR, 2025. 3, 5, 6
2025
-
[68]
Mmada: Multimodal large diffusion language models
Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Mul- timodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025. 2
-
[69]
Loki: A comprehensive synthetic data detection benchmark using large multimodal models
Junyan Ye, Baichuan Zhou, Zilong Huang, Junan Zhang, Tianyi Bai, Hengrui Kang, Jun He, Honglin Lin, Zihao Wang, Tong Wu, et al. Loki: A comprehensive synthetic data detection benchmark using large multimodal models. In ICLR, 2025. 2, 3
2025
-
[70]
Representation alignment for generation: Training diffusion transformers is easier than you think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InICLR, 2025. 5
2025
-
[71]
Xinjie Zhang, Jintao Guo, Shanshan Zhao, Minghao Fu, Lunhao Duan, Jiakui Hu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, et al. Unified multimodal under- standing and generation models: Advances, challenges, and opportunities.arXiv preprint arXiv:2505.02567, 2025. 2
-
[72]
D3qe: Learning discrete distribution discrepancy-aware quantiza- tion error for autoregressive-generated image detection
Yanran Zhang, Bingyao Yu, Yu Zheng, Wenzhao Zheng, Yueqi Duan, Lei Chen, Jie Zhou, and Jiwen Lu. D3qe: Learning discrete distribution discrepancy-aware quantiza- tion error for autoregressive-generated image detection. In ICCV, pages 16292–16301, 2025. 2, 5, 6
2025
-
[73]
Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. In-context edit: Enabling instructional image editing with in- context generation in large scale diffusion transformer.arXiv preprint arXiv:2504.20690, 2025. 1
-
[74]
Ziyin Zhou, Yunpeng Luo, Yuanchen Wu, Ke Sun, Jiayi Ji, Ke Yan, Shouhong Ding, Xiaoshuai Sun, Yunsheng Wu, and Rongrong Ji. Aigi-holmes: Towards explainable and gener- alizable ai-generated image detection via multimodal large language models.arXiv preprint arXiv:2507.02664, 2025. 2, 3
-
[75]
Genimage: A million-scale benchmark for detecting ai-generated image.NeurIPS, 36:77771–77782,
Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. Genimage: A million-scale benchmark for detecting ai-generated image.NeurIPS, 36:77771–77782,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.