SPAR: Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Multimodal Models
Pith reviewed 2026-07-03 23:13 UTC · model grok-4.3
The pith
A dual-stream tokenizer with internal self-alignment unifies semantic understanding and pixel generation in one multimodal model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SPAR establishes a unified multimodal framework through semantic-pixel self-alignment and adaptive routing. The asymmetric dual-stream unified tokenizer reconciles semantic perception with pixel-level reconstruction by using a lightweight semantic stream and a Transformer-augmented pixel stream into a unified compact latent space. The self-aligned generation paradigm leverages the optimized tokenizer as an internal alignment teacher for the diffusion model without external dependencies. Dynamic token routing enables each token to adaptively aggregate multi-layer MLLM features based on its semantic demands. Extensive experiments show this yields state-of-the-art performance for unified archit
What carries the argument
The asymmetric dual-stream unified tokenizer that anchors discriminative semantic features in a lightweight stream while recovering fine-grained pixel details in a Transformer-augmented stream into one compact latent space.
If this is right
- Unified multimodal models can perform both high-fidelity image generation and accurate visual understanding without separate specialized components.
- The diffusion model aligns to semantic spaces using only internal tokenizer feedback, removing the need for external teacher models.
- Dynamic token routing allows flexible aggregation of features from different MLLM layers for each token according to its distinct demands.
- The framework achieves state-of-the-art results among unified architectures while preserving core visual understanding capabilities.
Where Pith is reading between the lines
- The internal self-alignment could lower training costs by avoiding separate teacher networks for alignment.
- The single latent space might support downstream tasks such as language-guided image editing within the same model without additional adapters.
- Applying the dual-stream design to video or audio modalities could test whether the same unification pattern holds beyond static images.
- Direct measurement of token routing decisions on held-out image categories would show whether the adaptive aggregation improves robustness on diverse inputs.
Load-bearing premise
The asymmetric dual-stream unified tokenizer can anchor discriminative semantic features and recover fine-grained pixel details simultaneously in a single compact latent space without external supervision or loss of either capability.
What would settle it
A controlled comparison in which a model built with the SPAR tokenizer shows either substantially lower visual understanding accuracy than standard MLLMs or substantially lower generation and reconstruction quality than specialized diffusion models would disprove the central claim.
Figures
read the original abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable success in visual understanding but remain constrained in visual generation due to the fundamental feature discrepancy between semantic perception and pixel-level reconstruction. Bridging this gap requires overcoming two core challenges: endowing semantic encoders with high-fidelity reconstruction capabilities, and effectively aligning generative models with semantic spaces without relying on external teachers. To this end, we propose a novel unified multimodal framework featuring \textbf{S}emantic-\textbf{P}ixel self-alignment and \textbf{A}daptive \textbf{R}outing (\textbf{SPAR}). First, to reconcile semantic perception with pixel-level reconstruction, we introduce an asymmetric dual-stream unified tokenizer. A lightweight semantic stream anchors discriminative features, while a Transformer-augmented pixel stream recovers fine-grained visual details into a unified compact latent space. Second, to eliminate external dependencies, we propose a self-aligned generation paradigm that natively leverages this optimized tokenizer as an internal alignment teacher for the diffusion model. Furthermore, to facilitate flexible multimodal interaction within this unified space, we introduce Dynamic Token Routing, which enables each token to adaptively aggregate multi-layer MLLM features based on its distinct semantic demands. Extensive experiments demonstrate that SPAR establishes the state-of-the-art for unified architectures, achieving exceptional generation and reconstruction quality while preserving foundational visual understanding capabilities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SPAR, a unified multimodal framework for MLLMs that addresses the discrepancy between semantic perception and pixel-level reconstruction. It proposes an asymmetric dual-stream unified tokenizer (lightweight semantic stream plus Transformer-augmented pixel stream) to create a compact latent space, a self-aligned generation paradigm that uses the tokenizer as an internal teacher for the diffusion model, and Dynamic Token Routing to aggregate multi-layer MLLM features adaptively. The abstract claims this yields SOTA performance on generation, reconstruction, and visual understanding without external supervision.
Significance. If the quantitative results and ablations hold, the work would be significant for enabling self-contained unified architectures that avoid external teachers while preserving both discriminative and generative capabilities, a key open challenge in multimodal modeling.
major comments (2)
- [Abstract] Abstract: the central claim that the asymmetric dual-stream tokenizer simultaneously anchors discriminative semantic features and recovers fine-grained pixel details 'without external supervision or loss of either capability' is load-bearing for all subsequent claims (self-alignment, SOTA performance, preservation of understanding), yet no derivation, loss formulation, or ablation is provided to demonstrate this is achievable.
- [Abstract] Abstract: the assertion of 'establishing the state-of-the-art for unified architectures' with 'exceptional generation and reconstruction quality' cannot be evaluated because the text supplies no tables, metrics, baselines, or experimental setup, leaving the performance claims unsupported.
Simulated Author's Rebuttal
We thank the referee for the detailed feedback. We address the two major comments on the abstract below, clarifying where the supporting material appears in the full manuscript while noting that abstracts are necessarily concise.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the asymmetric dual-stream tokenizer simultaneously anchors discriminative semantic features and recovers fine-grained pixel details 'without external supervision or loss of either capability' is load-bearing for all subsequent claims (self-alignment, SOTA performance, preservation of understanding), yet no derivation, loss formulation, or ablation is provided to demonstrate this is achievable.
Authors: The loss formulation for the asymmetric dual-stream tokenizer (lightweight semantic stream + Transformer-augmented pixel stream) and the self-aligned generation paradigm (using the tokenizer as internal teacher) is derived in Sections 3.1 and 3.2. The objective combines semantic anchoring losses with pixel reconstruction terms without external models. Ablations confirming preservation of both capabilities appear in Section 4.3 (Table 4), showing performance retention on understanding benchmarks when generation is enabled. We can add a parenthetical reference to these sections in a revised abstract if the editor prefers. revision: partial
-
Referee: [Abstract] Abstract: the assertion of 'establishing the state-of-the-art for unified architectures' with 'exceptional generation and reconstruction quality' cannot be evaluated because the text supplies no tables, metrics, baselines, or experimental setup, leaving the performance claims unsupported.
Authors: Quantitative results, metrics (FID, PSNR, CLIP score, VQA accuracy, etc.), baselines, and experimental setup are reported in Section 4, with direct comparisons in Tables 1 (generation), 2 (reconstruction), and 5 (understanding). The setup details (datasets, training protocol, evaluation protocols) are in Section 4.1. These tables support the SOTA claims for unified models. If the referee did not locate these sections, we can improve cross-referencing from the abstract. revision: no
Circularity Check
No significant circularity identified
full rationale
The paper proposes an asymmetric dual-stream tokenizer and self-aligned generation paradigm as novel architectural choices that reconcile semantic and pixel features without external supervision. No equations, fitted parameters, or self-citations are visible in the provided text that reduce any claimed prediction or result to an input by construction. The design elements (Dynamic Token Routing, internal alignment teacher) are presented as independent innovations rather than renamings or self-referential definitions. The derivation chain remains self-contained against external benchmarks, with the central assumption being an empirical design claim rather than a tautology.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., 16 H. Li et al. Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Black Forest Labs: FLUX.2: Analyzing and enhancing the latent space of FLUX – representation comparison (2025),https://bfl.ai/research/representation- comparison3
2025
-
[4]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18392–18402 (2023) 14
2023
-
[5]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12m: Pushing web- scale image-text pre-training to recognize long-tail visual concepts. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3558–3568 (2021) 11
2021
-
[6]
Bi-Anchor Interpolation Solver for Accelerating Generative Modeling
Chen, H., Li, H., Wang, Z., Chen, L.: Bi-anchor interpolation solver for accelerating generative modeling. arXiv preprint arXiv:2601.21542 (2026) 2
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[7]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Chen, J., Xu, Z., Pan, X., Hu, Y., Qin, C., Goldstein, T., Huang, L., Zhou, T., Xie, S., Savarese, S., et al.: Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568 (2025) 4, 11, 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [8]
-
[9]
arXiv preprint arXiv:2410.10733 (2024) 10
Chen, J., Cai, H., Chen, J., Xie, E., Yang, S., Tang, H., Li, M., Lu, Y., Han, S.: Deep compression autoencoder for efficient high-resolution diffusion models. arXiv preprint arXiv:2410.10733 (2024) 10, 11
-
[10]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 (2025) 11, 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Emerging Properties in Unified Multimodal Pretraining
Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., Fan, H.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 4, 5, 11, 12, 14
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
In: Forty-first international conference on machine learning (2024) 2, 12
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024) 2, 12
2024
-
[13]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023) 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
Ge,Y.,Zhao,S.,Zhu,J.,Ge,Y.,Yi,K.,Song,L.,Li,C.,Ding,X.,Shan,Y.:Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396 (2024) 11
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Advances in Neural Information Processing Systems36, 52132–52152 (2023) 12
Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems36, 52132–52152 (2023) 12
2023
-
[16]
arXiv preprint arXiv:2506.18898 (2025) 11, 12
Han, J., Chen, H., Zhao, Y., Wang, H., Zhao, Q., Yang, Z., He, H., Yue, X., Jiang, L.: Vision as a dialect: Unifying visual understanding and generation via text-aligned representations. arXiv preprint arXiv:2506.18898 (2025) 11, 12
-
[17]
Huang, R., Wang, C., Yang, J., Lu, G., Yuan, Y., Han, J., Hou, L., Zhang, W., Hong, L., Zhao, H., et al.: Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement. arXiv preprint arXiv:2504.01934 (2025) 10
-
[18]
Auto-Encoding Variational Bayes
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013) 2 Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Models 17
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[19]
Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024) 12
2024
-
[20]
Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025) 2
2025
-
[21]
Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025),https://arxiv.org/abs/2...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
LLaVA-OneVision: Easy Visual Task Transfer
Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 11
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Li, B., Ge, Y., Ge, Y., Wang, G., Wang, R., Zhang, R., Shan, Y.: Seed- bench: Benchmarking multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13299– 13308 (2024) 12
2024
-
[24]
In: The Fourteenth International Conference on Learning Representations (2026), https://openreview.net/forum?id=4c1gAsVd9C4
Li, H., Li, Y., Lin, B., Niu, Y., Yang, Y., Huang, X., Cai, J., Jiang, X., Hu, Y., Chen, L.: GIR-bench: Versatile benchmark for generating images with reasoning. In: The Fourteenth International Conference on Learning Representations (2026), https://openreview.net/forum?id=4c1gAsVd9C4
2026
-
[25]
In: International Conference on Learning Representations
Li, H., Li, Y., Yang, Y., Cao, J., Zhu, Z., Cheng, X., Chen, L.: Dispose: Disen- tangling pose guidance for controllable human image animation. In: International Conference on Learning Representations. vol. 2025, pp. 72213–72231 (2025) 2
2025
-
[26]
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025) 5, 14
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
arXiv preprint arXiv:2505.05422 (2025) 11
Lin, H., Wang, T., Ge, Y., Ge, Y., Lu, Z., Wei, Y., Zhang, Q., Sun, Z., Shan, Y.: Toklip: Marry visual tokens to clip for multimodal comprehension and generation. arXiv preprint arXiv:2505.05422 (2025) 11
-
[28]
Advances in neural information processing systems36, 34892–34916 (2023) 1
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 1
2023
-
[29]
Step1X-Edit: A Practical Framework for General Image Editing
Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025) 14
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: European conference on computer vision. pp. 216–233. Springer (2024) 12
2024
- [31]
-
[32]
Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321,
Ma, C., Jiang, Y., Wu, J., Yang, J., Yu, X., Yuan, Z., Peng, B., Qi, X.: Uni- tok: A unified tokenizer for visual generation and understanding. arXiv preprint arXiv:2502.20321 (2025) 4
-
[33]
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Niu, Y., Ning, M., Zheng, M., Jin, W., Lin, B., Jin, P., Liao, J., Ning, K., Feng, C., Zhu, B., Yuan, L.: Wise: A world knowledge-informed semantic evaluation for text-to-image generation. arXiv preprint arXiv:2503.07265 (2025) 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
OpenAI: Introducing 4o Image Generation (2025),https://openai.com/index/ introducing-4o-image-generation/14
2025
-
[35]
Li et al
Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,Howes,R.,Huang,P.Y.,Xu,H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, 18 H. Li et al. G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual f...
2023
-
[36]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023) 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
In: Proceedings of the Computer Vision and Pattern Recognition Con- ference
Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 2545–2555 (2025) 4, 10, 12
2025
-
[38]
Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wa...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 2, 10
2022
- [40]
-
[41]
DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
Song, W., Wang, Y., Song, Z., Li, Y., Sun, H., Chen, W., Zhou, Z., Xu, J., Wang, J., Yu, K.: Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. arXiv preprint arXiv:2503.14324 (2025) 10
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Advances in neural information processing systems36, 49659–49678 (2023) 11
Sun, K., Pan, J., Ge, Y., Li, H., Duan, H., Wu, X., Zhang, R., Zhou, A., Qin, Z., Wang, Y., et al.: Journeydb: A benchmark for generative image understanding. Advances in neural information processing systems36, 49659–49678 (2023) 11
2023
-
[43]
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Sun, P., Jiang, Y., Chen, S., Zhang, S., Peng, B., Luo, P., Yuan, Z.: Autoregres- sive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525 (2024) 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition
Sun, Q., Cui, Y., Zhang, X., Zhang, F., Yu, Q., Wang, Y., Rao, Y., Liu, J., Huang, T., Wang, X.: Generative multimodal models are in-context learners. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14398–14409 (2024) 4, 10
2024
-
[45]
Emu: Generative Pretraining in Multimodality
Sun, Q., Yu, Q., Cui, Y., Zhang, F., Zhang, X., Wang, Y., Gao, H., Liu, J., Huang, T., Wang, X.: Emu: Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222 (2023) 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Unilip: Adapting clip for unified multimodal understanding, generation and editing
Tang, H., Xie, C., Bao, X., Weng, T., Li, P., Zheng, Y., Wang, L.: Unilip: Adapting clip for unified multimodal understanding, generation and editing. arXiv preprint arXiv:2507.23278 (2025) 4, 5, 10
-
[47]
Team,C.:Chameleon:Mixed-modalearly-fusionfoundationmodels.arXivpreprint arXiv:2405.09818 (2024) 4, 11
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Tian, K., Jiang, Y., Yuan, Z., Peng, B., Wang, L.: Visual autoregressive modeling: Scalableimagegenerationvianext-scaleprediction.Advancesinneuralinformation processing systems37, 84839–84865 (2024) 2, 10
2024
-
[49]
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
Tong, S., Fan, D., Zhu, J., Xiong, Y., Chen, X., Sinha, K., Rabbat, M., LeCun, Y., Xie, S., Liu, Z.: Metamorph: Multimodal understanding and generation via instruction tuning. arXiv preprint arXiv:2412.14164 (2024) 11 Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Models 19
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9568–9578 (2024) 12
2024
-
[51]
LLaMA: Open and Efficient Foundation Language Models
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Advances in neural information processing systems30(2017) 4
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. Advances in neural information processing systems30(2017) 4
2017
-
[53]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025) 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Emu3: Next-Token Prediction is All You Need
Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024) 11, 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
LISA: Likelihood Score Alignment for Visual-condition Controllable Generation
Wang, Y., Chen, H., Liu, J., He, Z., Liu, R., Wang, Z., Chen, L.: Lisa: Likeli- hood score alignment for visual-condition controllable generation. arXiv preprint arXiv:2606.27192 (2026) 2
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[56]
arXiv preprint arXiv:2603.12057 (2026) 2
Wang, Y., Jiang, Z., Wang, Z., Chen, L.: Coarse-guided visual generation via weighted h-transform sampling. arXiv preprint arXiv:2603.12057 (2026) 2
-
[57]
arXiv preprint arXiv:2510.20212 (2025) 2
Wang, Y., Wang, Z., Chen, L.: Target-aware image editing via cycle-consistent constraints. arXiv preprint arXiv:2510.20212 (2025) 2
-
[58]
arXiv preprint arXiv:2507.21033 (2025) 11
Wang, Y., Yang, S., Zhao, B., Zhang, L., Liu, Q., Zhou, Y., Xie, C.: Gpt- image-edit-1.5 m: A million-scale, gpt-generated image dataset. arXiv preprint arXiv:2507.21033 (2025) 11
-
[59]
Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025) 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
OmniGen2: Towards Instruction-Aligned Multimodal Generation
Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025) 14
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[61]
arXiv preprint arXiv:2505.23661 (2025) 4, 12 Abbreviated paper title 19
Wu, S., Wu, Z., Gong, Z., Tao, Q., Jin, S., Li, Q., Li, W., Loy, C.C.: Openuni: A simple baseline for unified multimodal understanding and generation. arXiv preprint arXiv:2505.23661 (2025) 4, 12
- [62]
-
[63]
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
Wu, Y., Zhang, Z., Chen, J., Tang, H., Li, D., Fang, Y., Zhu, L., Xie, E., Yin, H., Yi, L., et al.: Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429 (2024) 4, 10, 11, 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[64]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Xiao, S., Wang, Y., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Omnigen: Unified image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13294–13304 (2025) 14
2025
-
[65]
Xie, E., Chen, J., Chen, J., Cai, H., Tang, H., Lin, Y., Zhang, Z., Li, M., Zhu, L., Lu, Y., Han, S.: Sana: Efficient high-resolution image synthesis with linear diffusion transformer (2024),https://arxiv.org/abs/2410.1062910, 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[66]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation. arXiv preprint arXiv:2408.12528 (2024) 4, 11
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
Show-o2: Improved Native Unified Multimodal Models
Xie,J.,Yang,Z.,Shou,M.Z.:Show-o2:Improvednativeunifiedmultimodalmodels. arXiv preprint arXiv:2506.15564 (2025) 12 20 H. Li et al
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[68]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[69]
generation: Taming optimization dilemma in latent diffusion models
Yao, J., Yang, B., Wang, X.: Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15703–15712 (2025) 2, 3, 10
2025
-
[70]
Yu, L., Shi, B., Pasunuru, R., Muller, B., Golovneva, O., Wang, T., Babu, A., Tang, B., Karrer, B., Sheynin, S., Ross, C., Polyak, A., Howes, R., Sharma, V., Xu, P., Tamoyan, H., Ashual, O., Singer, U., Li, S.W., Zhang, S., James, R., Ghosh, G., Taigman, Y., Fazel-Zarandi, M., Celikyilmaz, A., Zettlemoyer, L., Aghajanyan, A.: Scaling autoregressive multi-...
-
[71]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Yu, Q., Chow, W., Yue, Z., Pan, K., Wu, Y., Wan, X., Li, J., Tang, S., Zhang, H., Zhuang, Y.: Anyedit: Mastering unified high-quality image editing for any idea. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26125–26135 (2025) 14
2025
-
[72]
Advances in Neural Information Processing Systems37, 128940–128966 (2024) 2
Yu, Q., Weber, M., Deng, X., Shen, X., Cremers, D., Chen, L.C.: An image is worth 32 tokens for reconstruction and generation. Advances in Neural Information Processing Systems37, 128940–128966 (2024) 2
2024
-
[73]
In: International Conference on Learning Representations (2025) 2, 3, 4
Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think. In: International Conference on Learning Representations (2025) 2, 3, 4
2025
-
[74]
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., Wang, L.: Mm- vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023) 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[75]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9556–9567 (2024) 12
2024
-
[76]
arXiv preprint arXiv:2510.10575 (2025) 4
Yue, Z., Zhang, H., Zeng, X., Chen, B., Wang, C., Zhuang, S., Dong, L., Du, K., Wang, Y., Wang, L., et al.: Uniflow: A unified pixel flow tokenizer for visual understanding and generation. arXiv preprint arXiv:2510.10575 (2025) 4
-
[77]
Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023) 14
Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: Magicbrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023) 14
2023
-
[78]
arXiv preprint arXiv:2512.17909 (2025) 3, 4
Zhang, S., Zhang, H., Zhang, Z., Ge, C., Xue, S., Liu, S., Ren, M., Kim, S.Y., Zhou, Y., Liu, Q., et al.: Both semantics and reconstruction matter: Making repre- sentation encoders ready for text-to-image generation and editing. arXiv preprint arXiv:2512.17909 (2025) 3, 4
-
[79]
Zhang, Z., Xie, J., Lu, Y., Yang, Z., Yang, Y.: Enabling instructional image editing within-contextgenerationinlargescalediffusiontransformer.In:TheThirty-ninth Annual Conference on Neural Information Processing Systems (2025) 14
2025
-
[80]
Advances in Neural Information Processing Systems37, 3058–3093 (2024) 14 Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Models 21
Zhao, H., Ma, X.S., Chen, L., Si, S., Wu, R., An, K., Yu, P., Zhang, M., Li, Q., Chang, B.: Ultraedit: Instruction-based fine-grained image editing at scale. Advances in Neural Information Processing Systems37, 3058–3093 (2024) 14 Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Models 21
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.