pith. machine review for the scientific record. sign in

arxiv: 2602.01554 · v2 · submitted 2026-02-02 · 💻 cs.LG · cs.AI· cs.CV

Recognition: 2 theorem links

· Lean Theorem

InfoTok: Information-Theoretic Regularization for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV
keywords information bottleneckvisual tokenizationunified multimodal modelsmutual information regularizationshared tokenizerimage understandingimage generation
0
0 comments X

The pith

InfoTok imposes mutual-information constraints on shared visual tokens to improve both understanding and generation in unified MLLMs without extra data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that existing shared visual tokenizers in unified multimodal models lack an explicit rule for preserving the right information under limited token budgets. InfoTok applies the information bottleneck principle to enforce a trade-off that favors reusable structure over redundant detail, using differentiable estimators to make the constraints practical. A reader would care because the same token interface must serve both semantic reasoning and pixel-level synthesis, and a capacity-aware regularizer offers a direct way to allocate that budget more effectively. The approach is shown to lift performance on both tasks when dropped into existing unified architectures.

Core claim

InfoTok is an information-regularized tokenization mechanism grounded in the Information Bottleneck principle that explicitly controls information flow from images to shared tokens by imposing mutual-information constraints, instantiated via variational IB and HSIC estimators, thereby encouraging compression of task-irrelevant variation while preserving cross-modal consistency for both understanding and generation.

What carries the argument

InfoTok, the information-regularized tokenization mechanism that imposes mutual-information constraints on the shared visual tokenizer to enforce a compression-versus-relevance trade-off.

If this is right

  • Shared visual tokens become more reusable across understanding and generation tasks under explicit information constraints.
  • No additional training data is required to obtain consistent gains in both modalities.
  • The same capacity-constrained perspective can be applied to other tokenization stages inside unified models.
  • Practical MI estimators such as variational IB and HSIC suffice to realize the regularization in high-dimensional visual settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same regularization idea could be tested on tokenizers for video or audio to check whether capacity-aware compression generalizes across modalities.
  • If the MI constraints prove stable, they might reduce the need for hand-tuned architecture choices in future unified models.
  • Extending the approach to later layers of the language model itself could further tighten the overall information budget.

Load-bearing premise

The chosen differentiable estimators for mutual information sufficiently enforce the intended constraints and produce tokens that simultaneously support semantic abstraction and visual detail.

What would settle it

Applying InfoTok to any of the three tested unified MLLMs and observing no gain or a net loss on standard image-understanding and image-generation benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2602.01554 by Bo Li, Lv Tang, Tianyi Zheng, Xingyu Li.

Figure 1
Figure 1. Figure 1: Performance comparison of three representative unified MLLMs (Show-o2 [10], OpenUni [8] and Harmon [4]) before [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of our information-regularized tokenization (InfoTok). (a) depicts a standard unified MLLM with shared [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: We apply InfoTok to three representative shared-token unified MLLMs and observe consistent improvements in generation [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

Unified multimodal large language models (MLLMs) aim to unify image understanding and image generation within a single framework, where a shared visual tokenizer serves as the sole interface that maps high-dimensional images into a limited token budget for downstream multimodal reasoning and synthesis. However, existing shared-token designs are largely architecture-driven and lack an explicit criterion for what information should be preserved to simultaneously support semantic abstraction and visual detail. In this paper, we adopt a capacity-constrained perspective, viewing the shared tokenizer as a compute-bounded learner whose finite representational budget should prioritize reusable structure over hard-to-exploit high-entropy variations and redundancy. Motivated by this view, we propose \textbf{\textit{InfoTok}}, an information-regularized tokenization mechanism grounded in the Information Bottleneck (IB) principle. InfoTok explicitly controls information flow from images to shared tokens to multimodal outputs by imposing mutual-information (MI) constraints that enforce a principled trade-off between compression and task relevance, while also encouraging cross-modal consistency. Because MI is intractable for high-dimensional visual representations, we instantiate InfoTok with practical, differentiable dependence estimators, including a variational IB formulation and a Hilbert Schmidt Independence Criterion (HSIC) based alternative. Integrated into three representative unified MLLMs without introducing any additional training data, InfoTok consistently improves both image understanding and generation performance. These results support information-regularized visual tokenization as a sound basis for token learning in unified MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes InfoTok, an information-regularized tokenization mechanism for shared visual tokenizers in unified MLLMs. Grounded in the Information Bottleneck principle, it imposes mutual-information constraints via differentiable estimators (variational IB and HSIC) to enforce a compression-relevance trade-off that supports both semantic abstraction and visual detail. The approach is integrated into three representative unified MLLMs without extra training data and is reported to yield consistent gains in image understanding and generation performance.

Significance. If the central claim holds, InfoTok offers a principled, architecture-agnostic way to regularize capacity-constrained tokenizers using information theory, addressing a gap in existing shared-token designs. The no-extra-data integration and dual-task improvements would be practically valuable for unified MLLMs. However, significance hinges on whether the surrogate MI estimators reliably enforce the intended constraints or merely provide generic regularization; without that verification the contribution reduces to an empirical regularization trick.

major comments (2)
  1. [Abstract, §3] Abstract and §3 (method): The central claim that variational IB and HSIC estimators enforce explicit MI constraints for a principled compression-relevance trade-off is load-bearing, yet the manuscript provides no quantitative verification (e.g., estimated I(image; tokens) bounds, ablation on estimator tightness, or comparison against true MI proxies) that these surrogates actually achieve the intended capacity control in high-dimensional visual feature regimes. If the estimators exhibit the known high variance/bias documented for HSIC and variational bounds on visual data, observed gains may not stem from the information-theoretic mechanism.
  2. [§4] §4 (experiments): The statement that InfoTok 'consistently improves' both understanding and generation across three models lacks reported effect sizes, statistical significance, or controls that isolate the MI regularization from other training differences. Without these, the cross-model claim cannot be evaluated as evidence for the IB-based design.
minor comments (2)
  1. [§3] Notation for the two MI estimators should be unified and clearly distinguished from the true mutual information I(·;·) to avoid implying exact enforcement.
  2. [Abstract] The abstract claims 'no additional training data' but does not specify whether the original MLLM training recipes were held exactly constant or whether hyper-parameters were re-tuned for the new regularizers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the need for stronger validation of the information-theoretic claims and more rigorous experimental reporting. We address each point below and will revise the manuscript to incorporate additional analyses and controls.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (method): The central claim that variational IB and HSIC estimators enforce explicit MI constraints for a principled compression-relevance trade-off is load-bearing, yet the manuscript provides no quantitative verification (e.g., estimated I(image; tokens) bounds, ablation on estimator tightness, or comparison against true MI proxies) that these surrogates actually achieve the intended capacity control in high-dimensional visual feature regimes. If the estimators exhibit the known high variance/bias documented for HSIC and variational bounds on visual data, observed gains may not stem from the information-theoretic mechanism.

    Authors: We agree that explicit verification of the MI constraints would strengthen the central claim. The variational IB formulation provides a tractable surrogate lower bound on the relevant mutual information terms by design, and HSIC serves as a kernel-based dependence measure whose consistency properties are established in the literature. However, we acknowledge the absence of direct quantitative checks such as reported I(image; tokens) estimates or tightness ablations in the current manuscript. In the revision we will add these: (i) estimated MI values computed via the variational bounds and HSIC on held-out visual features before and after regularization, (ii) sensitivity analysis varying the regularization coefficients to demonstrate the compression-relevance trade-off, and (iii) discussion of known estimator biases with empirical evidence that performance gains scale with the strength of the MI terms rather than generic regularization. revision: yes

  2. Referee: [§4] §4 (experiments): The statement that InfoTok 'consistently improves' both understanding and generation across three models lacks reported effect sizes, statistical significance, or controls that isolate the MI regularization from other training differences. Without these, the cross-model claim cannot be evaluated as evidence for the IB-based design.

    Authors: We accept that the current experimental section would benefit from greater statistical rigor. The reported improvements are based on standard benchmark metrics across three distinct unified MLLM architectures, but we did not include effect sizes, multiple-run statistics, or explicit ablations that remove only the IB/HSIC terms while keeping all other training elements fixed. In the revised version we will: (i) report mean improvements with standard deviations over at least three random seeds, (ii) include statistical significance tests (e.g., paired t-tests or Wilcoxon tests) for the key metrics, (iii) add a dedicated ablation table isolating the contribution of the mutual-information regularizers, and (iv) provide controls that compare against equivalent-capacity models trained with non-information-theoretic regularizers to better attribute gains to the IB mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation adopts external IB principle with standard estimators

full rationale

The paper grounds InfoTok in the established Information Bottleneck principle and instantiates it via known differentiable estimators (variational IB and HSIC) because exact MI is intractable. No equations or steps are shown that reduce claimed performance gains to a fitted parameter renamed as prediction, a self-defined quantity, or a self-citation chain. The capacity constraint is enforced through external regularization objectives whose validity rests on prior literature rather than the present work's outputs. Empirical improvements on three unified MLLMs without extra data constitute independent evidence, not a tautology. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling is visible in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that mutual information between high-dimensional images and tokens can be tractably approximated by variational IB and HSIC estimators, and that enforcing these approximations yields tokens that are simultaneously useful for understanding and generation.

axioms (1)
  • domain assumption Mutual information for high-dimensional visual data can be reliably estimated via variational IB and HSIC without introducing bias that harms downstream task performance
    Abstract states these as practical instantiations because exact MI is intractable.

pith-pipeline@v0.9.0 · 5572 in / 1310 out tokens · 26289 ms · 2026-05-16T08:06:57.711582+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 15 internal anchors

  1. [1]

    Janus: Decoupling visual encoding for unified multimodal understanding and generation,

    C. Wu, X. Chen, Z. Wu, Y . Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruan, and P. Luo, “Janus: Decoupling visual encoding for unified multimodal understanding and generation,” inCVPR. IEEE, 2025, pp. 12 966–12 977

  2. [2]

    Emerging Properties in Unified Multimodal Pretraining

    C. Deng, D. Zhu, K. Li, C. Gou, F. Li, Z. Wang, S. Zhong, W. Yu, X. Nie, Z. Song, S. Guang, and H. Fan, “Emerging properties in unified multimodal pretraining,”CoRR, vol. abs/2505.14683, 2025

  3. [3]

    VILA-U: a unified foundation model integrating visual understanding and generation,

    Y . Wu, Z. Zhang, J. Chen, H. Tang, D. Li, Y . Fang, L. Zhu, E. Xie, H. Yin, L. Yi, S. Han, and Y . Lu, “VILA-U: a unified foundation model integrating visual understanding and generation,” inICLR. OpenReview.net, 2025

  4. [4]

    Harmonizing visual representations for unified multimodal understanding and generation,

    S. Wu, W. Zhang, L. Xu, S. Jin, Z. Wu, Q. Tao, W. Liu, W. Li, and C. C. Loy, “Harmonizing visual representations for unified multimodal understanding and generation,” inICCV. IEEE, 2025

  5. [5]

    Ming-univision: Joint image under- standing and generation with a unified continuous tokenizer,

    Z. Huang, D. Zheng, C. Zou, R. Liu, X. Wang, K. Ji, W. Chai, J. Sun, L. Wang, Y . Lvet al., “Ming-univision: Joint image under- standing and generation with a unified continuous tokenizer,”CoRR, vol. abs/2510.06590, 2025

  6. [6]

    Unitok: A unified tokenizer for visual generation and understanding,

    C. Ma, Y . Jiang, J. Wu, J. Yang, X. Yu, Z. Yuan, B. Peng, and X. Qi, “Unitok: A unified tokenizer for visual generation and understanding,” CoRR, vol. abs/2502.20321, 2025

  7. [7]

    Vision as a dialect: Unifying visual understanding and generation via text-aligned representations,

    J. Han, H. Chen, Y . Zhao, H. Wang, Q. Zhao, Z. Yang, H. He, X. Yue, and L. Jiang, “Vision as a dialect: Unifying visual understanding and generation via text-aligned representations,”CoRR, vol. abs/2506.18898, 2025

  8. [8]

    Openuni: A simple baseline for unified multimodal understanding and generation,

    S. Wu, Z. Wu, Z. Gong, Q. Tao, S. Jin, Q. Li, W. Li, and C. C. Loy, “Openuni: A simple baseline for unified multimodal understanding and generation,”CoRR, vol. abs/2505.23661, 2025

  9. [9]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    B. Lin, Z. Li, X. Cheng, Y . Niu, Y . Ye, X. He, S. Yuan, W. Yu, S. Wang, Y . Ge, Y . Pang, and L. Yuan, “Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation,”CoRR, vol. abs/2506.03147, 2025

  10. [10]

    Show-o2: Improved native unified multimodal models,

    J. Xie, Z. Yang, and M. Z. Shou, “Show-o2: Improved native unified multimodal models,” inNeurIPS, 2025

  11. [11]

    Unieval: Unified holistic evaluation for unified multimodal understanding and generation,

    Y . Li, H. Wang, Q. Zhang, B. Xiao, C. Hu, H. Wang, and X. Li, “Unieval: Unified holistic evaluation for unified multimodal understanding and generation,”CoRR, vol. abs/2505.10483, 2025

  12. [12]

    GQA: A new dataset for real-world visual reasoning and compositional question answering,

    D. A. Hudson and C. D. Manning, “GQA: A new dataset for real-world visual reasoning and compositional question answering,” inCVPR. IEEE, 2019, pp. 6700–6709

  13. [13]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    B. Li, R. Wang, G. Wang, Y . Ge, Y . Ge, and Y . Shan, “Seed-bench: Benchmarking multimodal llms with generative comprehension,”CoRR, vol. abs/2307.16125, 2023

  14. [14]

    Evaluating object hallucination in large vision-language models,

    Y . Li, Y . Du, K. Zhou, J. Wang, W. X. Zhao, and J. Wen, “Evaluating object hallucination in large vision-language models,” inEMNLP. Association for Computational Linguistics, 2023, pp. 292–305

  15. [15]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, Z. Qiu, W. Lin, J. Yang, X. Zheng, K. Li, X. Sun, and R. Ji, “MME: A comprehensive evaluation benchmark for multimodal large language models,”CoRR, vol. abs/2306.13394, 2023. 13

  16. [16]

    Mm-vet: Evaluating large multimodal models for integrated capabilities,

    W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang, “Mm-vet: Evaluating large multimodal models for integrated capabilities,” inICML. OpenReview.net, 2024

  17. [17]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,

    X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sunet al., “Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,” inCVPR, 2024, pp. 9556–9567

  18. [18]

    Geneval: An object-focused framework for evaluating text-to-image alignment,

    D. Ghosh, H. Hajishirzi, and L. Schmidt, “Geneval: An object-focused framework for evaluating text-to-image alignment,” inNeurIPS, 2023

  19. [19]

    Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation,

    J. Ye, D. Jiang, Z. Wang, L. Zhu, Z. Hu, Z. Huang, J. He, Z. Yan, J. Yu, H. Li, C. He, and W. Li, “Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation,”CoRR, vol. abs/2508.09987, 2025

  20. [20]

    WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

    Y . Niu, M. Ning, M. Zheng, B. Lin, P. Jin, J. Liao, K. Ning, B. Zhu, and L. Yuan, “WISE: A world knowledge-informed semantic evaluation for text-to-image generation,”CoRR, vol. abs/2503.07265, 2025

  21. [21]

    The information bottleneck method

    N. Tishby, F. C. N. Pereira, and W. Bialek, “The information bottleneck method,”CoRR, vol. physics/0004057, 2000

  22. [22]

    Deep learning and the information bottleneck principle,

    N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” inITW. IEEE, 2015, pp. 1–5

  23. [23]

    Deep variational information bottleneck,

    A. A. Alemi, I. Fischer, J. V . Dillon, and K. Murphy, “Deep variational information bottleneck,” inICLR. OpenReview.net, 2017

  24. [24]

    Revisiting hilbert-schmidt information bottleneck for adversarial robustness,

    Z. Wang, T. Jian, A. Masoomi, S. Ioannidis, and J. G. Dy, “Revisiting hilbert-schmidt information bottleneck for adversarial robustness,” in NeurIPS, 2021, pp. 586–597

  25. [25]

    A survey on multimodal large language models,

    S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,”National Science Review, vol. 11, no. 12, p. nwae403, 2024

  26. [26]

    A survey of multimodal learning: Methods, applications, and future,

    Y . Yuan, Z. Li, and B. Zhao, “A survey of multimodal learning: Methods, applications, and future,”ACM Comput. Surv., vol. 57, no. 7, pp. 167:1– 167:34, 2025

  27. [27]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” in NeurIPS, 2023

  28. [28]

    Minigpt-4: Enhancing vision-language understanding with advanced large language models,

    D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” in ICLR. OpenReview.net, 2024

  29. [29]

    Instructblip: Towards general-purpose vision-language models with instruction tuning,

    W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. C. H. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,” inNeurIPS, 2023

  30. [30]

    BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,” inICML, ser. Proceedings of Machine Learning Research, vol. 202. PMLR, 2023, pp. 19 730–19 742

  31. [31]

    Flamingo: a visual language model for few-shot learning,

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynoldset al., “Flamingo: a visual language model for few-shot learning,” inNeurIPS, 2022, pp. 23 716–23 736

  32. [32]

    Querying as prompt: Parameter-efficient learning for multimodal language model,

    T. Liang, J. Huang, M. Kong, L. Chen, and Q. Zhu, “Querying as prompt: Parameter-efficient learning for multimodal language model,” inCVPR, 2024, pp. 26 855–26 865

  33. [33]

    High- resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inCVPR. IEEE, 2022, pp. 10 674–10 685

  34. [34]

    Scalable diffusion models with transformers,

    W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inICCV. IEEE, 2023, pp. 4172–4182

  35. [35]

    Text-to-image diffusion models in generative AI: A survey,

    C. Zhang, C. Zhang, M. Zhang, and I. S. Kweon, “Text-to-image diffusion models in generative AI: A survey,”CoRR, vol. abs/2303.07909, 2023

  36. [36]

    Diffusion-4k: Ultra- high-resolution image synthesis with latent diffusion models,

    J. Zhang, Q. Huang, J. Liu, X. Guo, and D. Huang, “Diffusion-4k: Ultra- high-resolution image synthesis with latent diffusion models,” inCVPR. IEEE, 2025, pp. 23 464–23 473

  37. [37]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchi- cal text-conditional image generation with CLIP latents,”CoRR, vol. abs/2204.06125, 2022

  38. [38]

    Editar: Unified conditional generation with autoregressive models,

    J. Mu, N. Vasconcelos, and X. Wang, “Editar: Unified conditional generation with autoregressive models,” inCVPR. IEEE, 2025, pp. 7899–7909

  39. [39]

    Dreamllm: Synergistic multimodal comprehension and creation,

    R. Dong, C. Han, Y . Peng, Z. Qi, Z. Ge, J. Yang, L. Zhao, J. Sun, H. Zhou, H. Wei, X. Kong, X. Zhang, K. Ma, and L. Yi, “Dreamllm: Synergistic multimodal comprehension and creation,” inICLR. OpenReview.net, 2024

  40. [40]

    Making llama SEE and draw with SEED tokenizer,

    Y . Ge, S. Zhao, Z. Zeng, Y . Ge, C. Li, X. Wang, and Y . Shan, “Making llama SEE and draw with SEED tokenizer,” inICLR. OpenReview.net, 2024

  41. [41]

    Generative multimodal models are in-context learners,

    Q. Sun, Y . Cui, X. Zhang, F. Zhang, Q. Yu, Y . Wang, Y . Rao, J. Liu, T. Huang, and X. Wang, “Generative multimodal models are in-context learners,” inCVPR. IEEE, 2024, pp. 14 398–14 409

  42. [42]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    C. Team, “Chameleon: Mixed-modal early-fusion foundation models,” CoRR, vol. abs/2405.09818, 2024

  43. [43]

    Fast autoregressive models for continuous latent generation,

    T. Hang, J. Bao, F. Wei, and D. Chen, “Fast autoregressive models for continuous latent generation,”CoRR, vol. abs/2504.18391, 2025

  44. [44]

    Emu3: Next-Token Prediction is All You Need

    X. Wang, X. Zhang, Z. Luo, Q. Sun, Y . Cui, J. Wang, F. Zhang, Y . Wang, Z. Li, Q. Yu, Y . Zhao, Y . Ao, X. Min, T. Li, B. Wu, B. Zhao, B. Zhang, L. Wang, G. Liu, Z. He, X. Yang, J. Liu, Y . Lin, T. Huang, and Z. Wang, “Emu3: Next-token prediction is all you need,”CoRR, vol. abs/2409.18869, 2024

  45. [45]

    MUSE-VL: modeling unified VLM through semantic discrete encoding,

    R. Xie, C. Du, P. Song, and C. Liu, “MUSE-VL: modeling unified VLM through semantic discrete encoding,”CoRR, vol. abs/2411.17762, 2024

  46. [46]

    Growing visual generative capacity for pre-trained mllms,

    H. Wang, J. Han, Z. Yang, Q. Zhao, S. Lin, X. Yue, A. Shrivastava, Z. Yang, and H. Chen, “Growing visual generative capacity for pre-trained mllms,”CoRR, vol. abs/2510.01546, 2025

  47. [47]

    Toklip: Marry visual tokens to CLIP for multimodal comprehension and generation,

    H. Lin, T. Wang, Y . Ge, Y . Ge, Z. Lu, Y . Wei, Q. Zhang, Z. Sun, and Y . Shan, “Toklip: Marry visual tokens to CLIP for multimodal comprehension and generation,”CoRR, vol. abs/2505.05422, 2025

  48. [48]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    J. Chen, Z. Xu, X. Pan, Y . Hu, C. Qin, T. Goldstein, L. Huang, T. Zhou, S. Xie, S. Savarese, L. Xue, C. Xiong, and R. Xu, “Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset,” CoRR, vol. abs/2505.09568, 2025

  49. [49]

    Qwen2.5 Technical Report

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, and Z. Qiu, “Qwen2...

  50. [50]

    Every FLOP counts: Scaling a 300b mixture-of-experts LING LLM without premium gpus,

    L. Team, B. Zeng, C. Huang, C. Zhang, C. Tian, C. Chen, D. Jin, F. Yu, F. Zhu, F. Yuan, F. Wang, G. Wang, G. Zhai, H. Zhang, H. Li, J. Zhou, J. Liu, J. Fang, J. Ou, J. Hu, J. Luo, J. Zhang, J. Liu, J. Sha, J. Qian, J. Wu, J. Zhao, J. Li, J. Feng, J. Di, J. Xu, J. Yao, K. Xu, K. Du, L. Li, L. Liang, L. Yu, L. Tang, L. Ju, P. Xu, Q. Cui, S. Liu, S. Li, S. S...

  51. [51]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

  52. [52]

    Show-o: One single transformer to unify multimodal understanding and generation,

    J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin, Y . Gu, Z. Chen, Z. Yang, and M. Z. Shou, “Show-o: One single transformer to unify multimodal understanding and generation,” inICLR. OpenReview.net, 2025

  53. [53]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    X. Chen, Z. Wu, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, and C. Ruan, “Janus-pro: Unified multimodal understanding and generation with data and model scaling,”CoRR, vol. abs/2501.17811, 2025

  54. [54]

    Transfusion: Predict the next token and diffuse images with one multi-modal model,

    C. Zhou, L. Yu, A. Babu, K. Tirumala, M. Yasunaga, L. Shamis, J. Kahn, X. Ma, L. Zettlemoyer, and O. Levy, “Transfusion: Predict the next token and diffuse images with one multi-modal model,” inICLR. OpenReview.net, 2025

  55. [55]

    On variational bounds of mutual information,

    B. Poole, S. Ozair, A. van den Oord, A. A. Alemi, and G. Tucker, “On variational bounds of mutual information,” inICML, ser. Proceedings of Machine Learning Research, vol. 97. PMLR, 2019, pp. 5171–5180

  56. [56]

    Auto-encoding variational bayes,

    D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, Y . Bengio and Y . LeCun, Eds., 2014

  57. [57]

    Representation Learning with Contrastive Predictive Coding

    A. van den Oord, Y . Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”CoRR, vol. abs/1807.03748, 2018

  58. [58]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” inCVPR. IEEE Computer Society, 2009, pp. 248–255

  59. [59]

    Densefusion- 1m: Merging vision experts for comprehensive multimodal perception,

    X. Li, F. Zhang, H. Diao, Y . Wang, X. Wang, and L. Duan, “Densefusion- 1m: Merging vision experts for comprehensive multimodal perception,” NeurIPS, pp. 18 535–18 556, 2024

  60. [60]

    Ovis-u1 technical report,

    G. Wang, S. Zhao, X. Zhang, L. Cao, P. Zhan, L. Duan, S. Lu, M. Fu, X. Chen, J. Zhao, Y . Li, and Q. Chen, “Ovis-u1 technical report,”CoRR, vol. abs/2506.23044, 2025

  61. [61]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shao, Z. Gao, E. Cui, X. Wang, Y . Cao, Y . Liu, X. Wei, H. Zhang, H. Wang, W. Xu, H. Li, J. Wang, N. Deng, S. Li, Y . He, T. Jiang, J. Luo, Y . Wang, C. He, B. Shi, X. Zhang, W. Shao, J. He, Y . Xiong, W. Qu, P. Sun, P. Jiao, H. Lv, L. Wu, K. Zhang, H. Deng, J. Ge, K. Chen, L. W...