Recognition: unknown
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
Pith reviewed 2026-05-10 00:46 UTC · model grok-4.3
The pith
LLaDA2.0-Uni unifies multimodal understanding and image generation inside one discrete diffusion large language model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLaDA2.0-Uni combines a semantic discrete tokenizer based on SigLIP-VQ, an MoE-based discrete diffusion LLM backbone that applies block-level masked diffusion to both text and vision inputs, and a diffusion decoder that reconstructs visual tokens into images. Supported by large-scale curated data and multi-stage training, the model matches specialized VLMs on multimodal understanding benchmarks and delivers strong performance on image generation and editing while enabling native interleaved generation and reasoning.
What carries the argument
Block-level masked diffusion inside the MoE dLLM backbone that treats discretized visual tokens identically to text tokens for joint understanding and generation tasks.
Load-bearing premise
Discretizing continuous visual inputs via SigLIP-VQ preserves enough semantic and perceptual information for both high-fidelity reconstruction and competitive understanding performance without task-specific trade-offs.
What would settle it
Compare LLaDA2.0-Uni scores on standard multimodal understanding benchmarks such as VQAv2 or POPE against leading specialized VLMs, and measure FID or human preference on image generation and editing tasks against dedicated diffusion models; consistent gaps in either direction would falsify the unification claim.
Figures
read the original abstract
We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) for multimodal understanding and generation. It discretizes visual inputs using a SigLIP-VQ tokenizer, employs an MoE-based dLLM backbone for block-level masked diffusion on interleaved text and vision tokens, and uses a diffusion decoder to reconstruct high-fidelity images. Supported by large-scale curated data and multi-stage training, the model claims to match specialized VLMs on understanding benchmarks while achieving strong results on image generation and editing tasks, with native support for interleaved reasoning and generation. Code and models are released.
Significance. If the empirical results hold after verification, this work would advance unified multimodal foundation models by demonstrating that a single discrete diffusion framework can handle both dense understanding and high-fidelity generation without apparent task specialization, potentially simplifying architectures and enabling more flexible interleaved workflows. The open release of code and models strengthens reproducibility and follow-up research.
major comments (2)
- [Abstract and architecture description] The central claim that SigLIP-VQ discretization enables competitive understanding and high-fidelity generation without task trade-offs is load-bearing, yet the architecture description provides no quantitative bounds on reconstruction error, semantic preservation metrics, or ablation studies isolating the VQ step's impact on downstream performance (e.g., no comparison of continuous vs. discrete visual tokens on the same backbone).
- [Abstract] The abstract states that the model 'matches specialized VLMs in multimodal understanding' and delivers 'strong performance in image generation and editing,' but reports no specific benchmark scores, baselines, or error analysis; without these in the experiments section, the no-trade-off claim cannot be evaluated.
minor comments (2)
- [Method] Clarify the exact number of diffusion steps, distillation schedule, and MoE routing details in the methods to allow reproduction.
- [Training] Add a table summarizing key hyperparameters (e.g., expert count, training data mixtures) for the multi-stage pipeline.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on strengthening the presentation of our central claims. We address each major comment below and have made revisions to the manuscript to provide the requested quantitative details and clarifications.
read point-by-point responses
-
Referee: [Abstract and architecture description] The central claim that SigLIP-VQ discretization enables competitive understanding and high-fidelity generation without task trade-offs is load-bearing, yet the architecture description provides no quantitative bounds on reconstruction error, semantic preservation metrics, or ablation studies isolating the VQ step's impact on downstream performance (e.g., no comparison of continuous vs. discrete visual tokens on the same backbone).
Authors: We agree that additional quantitative support for the SigLIP-VQ tokenizer would strengthen the architecture description and better substantiate the no-trade-off claim. In the revised manuscript, we will add explicit metrics on reconstruction error (including FID and LPIPS for tokenized image reconstruction) and semantic preservation (such as downstream task accuracy using discrete tokens). We will also incorporate an ablation study comparing the discrete SigLIP-VQ approach against continuous visual token variants on the identical MoE dLLM backbone, evaluating impacts on both multimodal understanding and image generation/editing performance. revision: yes
-
Referee: [Abstract] The abstract states that the model 'matches specialized VLMs in multimodal understanding' and delivers 'strong performance in image generation and editing,' but reports no specific benchmark scores, baselines, or error analysis; without these in the experiments section, the no-trade-off claim cannot be evaluated.
Authors: We will revise the abstract to include representative quantitative results, such as key benchmark scores for understanding tasks (with main VLM baselines) and generation/editing metrics (e.g., FID scores), to allow immediate evaluation of the claims. We will also expand the experiments section with more detailed error analysis, explicit no-trade-off comparisons across tasks, and additional baseline results to ensure the supporting evidence is fully accessible and rigorous. revision: yes
Circularity Check
No circularity: empirical architecture and training results are self-contained
full rationale
The paper presents LLaDA2.0-Uni as a model architecture (SigLIP-VQ tokenizer + MoE dLLM backbone + diffusion decoder) trained via multi-stage pipeline on curated data. All performance claims (matching VLMs on understanding, strong generation/editing, interleaved support) are stated as outcomes of that training and evaluation, with no derivation chain, equations, or 'predictions' that reduce by construction to fitted inputs or self-citations. The discretization step is an explicit design choice whose information-preservation properties are left to empirical verification rather than assumed via prior self-referential results. This is a standard empirical model paper with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
free parameters (3)
- MoE expert count and routing parameters
- Diffusion timestep schedule and distillation steps
- Multi-stage training data mixture ratios
axioms (2)
- domain assumption Discrete tokenization via SigLIP-VQ preserves both semantic and reconstructible visual information
- standard math Standard transformer/MoE scaling laws apply to the discrete diffusion setting
invented entities (2)
-
SigLIP-VQ tokenizer
no independent evidence
-
Diffusion decoder
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Relative Score Policy Optimization for Diffusion Language Models
RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
Reference graph
Works this paper leans on
-
[1]
gpt-oss-120b & gpt-oss-20b Model Card
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,
work page internal anchor Pith review arXiv
-
[2]
Inclusion AI, Bowen Ma, Cheng Zou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Chenyu Lian, Dandan Zheng, Fudong Wang, Furong Xu, et al. Ming-flash-omni: A sparse, unified architecture for multimodal perception and generation.arXiv preprint arXiv:2510.24821,
-
[3]
Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, et al. Llada2. 0: Scaling up diffusion language models to 100b.arXiv preprint arXiv:2512.15745,
work page internal anchor Pith review arXiv
-
[4]
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699,
work page internal anchor Pith review arXiv
-
[5]
Shuo Cao, Nan Ma, Jiayang Li, Xiaohui Li, Lihao Shao, Kaiwen Zhu, Yu Zhou, Yuandong Pu, Jiarui Wu, Jiaquan Wang, Bo Qu, Wenhai Wang, Yu Qiao, Dajuin Yao, and Yihao Liu. Artimuse: Fine-grained image aesthetics assessment with joint scoring and expert-level understanding.arXiv preprint arXiv:2507.14533, 2025a. Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng,...
-
[6]
Dongping Chen, Ruoxi Chen, Shu Pu, Zhaoyi Liu, Yanru Wu, Caixi Chen, Benlin Liu, Yue Huang, Yao Wan, Pan Zhou, et al. Interleaved scene graphs for interleaved text-and-image generation assessment.arXiv preprint arXiv:2411.17188, 2024a. Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et a...
-
[7]
Weave: A benchmark for evaluating multimodal editing models.arXiv preprint arXiv:2511.15738,
Wei Chow, Jiachun Pan, Yongyuan Liang, Mingze Zhou, Xue Song, Liyu Jia, Saining Zhang, Siliang Tang, Juncheng Li, Fengda Zhang, et al. Weave: A benchmark for evaluating multimodal editing models.arXiv preprint arXiv:2511.15738,
-
[8]
20 Wei Chow, Linfeng Li, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, et al. Editmgt: Unleashing potentials of masked generative transformers in image editing.arXiv preprint arXiv:2512.11715,
-
[9]
PaddleOCR 3.0 Technical Report
Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report.arXiv preprint arXiv:2507.05595, 2025a. Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multim...
work page internal anchor Pith review arXiv
-
[10]
arXiv preprint arXiv:2503.23461 (2025)
Nikai Du, Zhennan Chen, Zhizhou Chen, Shan Gao, Xi Chen, Zhengkai Jiang, Jian Yang, and Ying Tai. Textcrafter: Accurately rendering multiple texts in complex visual scenes.arXiv preprint arXiv:2503.23461,
-
[11]
Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai, Yuxuan Cai, Kun Wang, Si Liu, Xihui Liu, and Hongsheng Li. Flux-reason-6m & prism-bench: A million-scale text-to-image reasoning dataset and comprehensive benchmark.arXiv preprint arXiv:2509.09680,
-
[12]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394,
work page internal anchor Pith review arXiv
-
[13]
Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346,
work page internal anchor Pith review arXiv
-
[14]
arXiv preprint arXiv:2507.22058 (2025)
Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058,
-
[15]
Ui-venus technical report: Building high-performance ui agents with rft
Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al. Ui-venus technical report: Building high-performance ui agents with rft.arXiv preprint arXiv:2508.10833,
-
[16]
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135,
work page internal anchor Pith review arXiv
-
[17]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas M¨uller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image...
work page internal anchor Pith review arXiv
-
[18]
Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, et al. Vl-rewardbench: A challenging benchmark for vision-language generative reward models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025a. Shufan Li, Konstantinos Kallidromitis, Hritik...
-
[19]
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147,
work page internal anchor Pith review arXiv
-
[20]
Ling Team. Every activation boosted: Scaling general reasoner to 1 trillion open language foundation.arXiv preprint arXiv:2510.22115,
-
[21]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024a. Dongyang Liu, Yi Xin, Shitian Zhao, Le Zhuo, Weifeng Lin, Xinyue Li, Qi Qin, Guangtao Zhai, Xiaohong Liu, Hongsheng Li, et al. Lumina-mgpt: Flexible photore...
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Step1X-Edit: A Practical Framework for General Image Editing
Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, and Daxin Jiang. Step1x-edit: A practical framework for general image editing.arXiv pre...
work page internal anchor Pith review arXiv
-
[24]
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InProceedings of the European Conference on Computer Vision (ECCV), 2024b. Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin ...
work page internal anchor Pith review arXiv
-
[25]
Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models
Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081,
work page internal anchor Pith review arXiv
-
[26]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255,
work page internal anchor Pith review arXiv
-
[27]
Veomni: Scaling any modality model training with model-centric distributed recipe zoo
Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, et al. Veomni: Scaling any modality model training with model-centric distributed recipe zoo.arXiv preprint arXiv:2508.02317,
-
[28]
Large Language Diffusion Models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992,
work page internal anchor Pith review arXiv
-
[29]
Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Kunpeng Ning, Chaoran Feng, Bin Zhu, and Li Yuan. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265,
-
[30]
Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808,
-
[31]
Qwen Team. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025a. Qwen Team. Qwen3-vl: Sharper vision, deeper thought, broader action.Qwen Blog. Accessed, pp. 10–04, 2025b. Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion l...
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864,
work page internal anchor Pith review arXiv
-
[33]
Peng Sun, Yi Jiang, and Tao Lin. Unified continuous generative models.arXiv preprint arXiv:2505.07447,
-
[34]
Peng Sun, Xinyi Shang, Tao Lin, and Zhiqiang Shen. Duality models: An embarrassingly simple one-step generation paradigm.arXiv preprint arXiv:2602.17682,
-
[35]
Longcat-image technical report
23 Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report.arXiv preprint arXiv:2512.07584,
-
[36]
Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, et al. Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing.arXiv preprint arXiv:2603.09877,
-
[37]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786,
work page internal anchor Pith review arXiv
-
[38]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems (NeurIPS), 2024a. Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, ...
work page internal anchor Pith review arXiv
-
[39]
MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition
Xinyu Wei, Kangrui Cen, Hongyang Wei, Zhen Guo, Bairui Li, Zeqing Wang, Jinrui Zhang, and Lei Zhang. Mico-150k: A comprehensive dataset advancing multi-image composition.arXiv preprint arXiv:2512.07348,
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025a. Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, et al. Janus: Decoupling visual encoding for unified multimodal understanding...
work page internal anchor Pith review arXiv
-
[41]
Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, Yan Tai, Jiayi Lei, Yuewen Cao, Keqi Wang, Yibin Wang, et al. Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding.arXiv preprint arXiv:2510.06308, 2025a. 24 Yi Xin, Juncheng Yan, Qi Qin, Zhen Li, Dongyang Liu, Shicheng Li, Victor Shea-Jay Huang, Yupeng Zhou, ...
-
[42]
Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, et al. Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation.arXiv preprint arXiv:2508.09987, 2025a. Keming Ye, Zhipeng Huang, Canmiao Fu, Qingyang Liu, Jiani Cai, Zheqi Lv, Chen Li, Jing Lyu, Zhou Zhao...
-
[43]
Boqiang Zhang, Lei Ke, Ruihan Yang, Qi Gao, Tianyuan Qu, Rossell Chen, Dong Yu, and Leoweiliang. Penguin-vl: Exploring the efficiency limits of vlm with llm-based vision encoders.arXiv preprint arXiv:2603.06569, 2026a. Huichao Zhang, Liao Qu, Yiheng Liu, Hang Chen, Yangyang Song, Yongsheng Dong, Shikun Sun, Xian Li, Xu Wang, Yi Jiang, et al. Nextflow: Uni...
-
[44]
Le Zhuo, Songhao Han, Yuandong Pu, Boxiang Qiu, Sayak Paul, Yue Liao, Yihao Liu, Jie Shao, Xi Chen, Si Liu, et al. Factuality matters: When image generation and editing meet structured visuals.arXiv preprint arXiv:2510.05091,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.