arxiv: 2602.01554 · v2 · submitted 2026-02-02 · 💻 cs.LG · cs.AI· cs.CV

Recognition: 2 theorem links

· Lean Theorem

InfoTok: Information-Theoretic Regularization for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs

Lv Tang , Tianyi Zheng , Bo Li , Xingyu Li

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords information bottleneckvisual tokenizationunified multimodal modelsmutual information regularizationshared tokenizerimage understandingimage generation

0 comments

The pith

InfoTok imposes mutual-information constraints on shared visual tokens to improve both understanding and generation in unified MLLMs without extra data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that existing shared visual tokenizers in unified multimodal models lack an explicit rule for preserving the right information under limited token budgets. InfoTok applies the information bottleneck principle to enforce a trade-off that favors reusable structure over redundant detail, using differentiable estimators to make the constraints practical. A reader would care because the same token interface must serve both semantic reasoning and pixel-level synthesis, and a capacity-aware regularizer offers a direct way to allocate that budget more effectively. The approach is shown to lift performance on both tasks when dropped into existing unified architectures.

Core claim

InfoTok is an information-regularized tokenization mechanism grounded in the Information Bottleneck principle that explicitly controls information flow from images to shared tokens by imposing mutual-information constraints, instantiated via variational IB and HSIC estimators, thereby encouraging compression of task-irrelevant variation while preserving cross-modal consistency for both understanding and generation.

What carries the argument

InfoTok, the information-regularized tokenization mechanism that imposes mutual-information constraints on the shared visual tokenizer to enforce a compression-versus-relevance trade-off.

If this is right

Shared visual tokens become more reusable across understanding and generation tasks under explicit information constraints.
No additional training data is required to obtain consistent gains in both modalities.
The same capacity-constrained perspective can be applied to other tokenization stages inside unified models.
Practical MI estimators such as variational IB and HSIC suffice to realize the regularization in high-dimensional visual settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same regularization idea could be tested on tokenizers for video or audio to check whether capacity-aware compression generalizes across modalities.
If the MI constraints prove stable, they might reduce the need for hand-tuned architecture choices in future unified models.
Extending the approach to later layers of the language model itself could further tighten the overall information budget.

Load-bearing premise

The chosen differentiable estimators for mutual information sufficiently enforce the intended constraints and produce tokens that simultaneously support semantic abstraction and visual detail.

What would settle it

Applying InfoTok to any of the three tested unified MLLMs and observing no gain or a net loss on standard image-understanding and image-generation benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2602.01554 by Bo Li, Lv Tang, Tianyi Zheng, Xingyu Li.

**Figure 2.** Figure 2: Illustration of our information-regularized tokenization (InfoTok). (a) depicts a standard unified MLLM with shared [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: We apply InfoTok to three representative shared-token unified MLLMs and observe consistent improvements in generation [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Unified multimodal large language models (MLLMs) aim to unify image understanding and image generation within a single framework, where a shared visual tokenizer serves as the sole interface that maps high-dimensional images into a limited token budget for downstream multimodal reasoning and synthesis. However, existing shared-token designs are largely architecture-driven and lack an explicit criterion for what information should be preserved to simultaneously support semantic abstraction and visual detail. In this paper, we adopt a capacity-constrained perspective, viewing the shared tokenizer as a compute-bounded learner whose finite representational budget should prioritize reusable structure over hard-to-exploit high-entropy variations and redundancy. Motivated by this view, we propose \textbf{\textit{InfoTok}}, an information-regularized tokenization mechanism grounded in the Information Bottleneck (IB) principle. InfoTok explicitly controls information flow from images to shared tokens to multimodal outputs by imposing mutual-information (MI) constraints that enforce a principled trade-off between compression and task relevance, while also encouraging cross-modal consistency. Because MI is intractable for high-dimensional visual representations, we instantiate InfoTok with practical, differentiable dependence estimators, including a variational IB formulation and a Hilbert Schmidt Independence Criterion (HSIC) based alternative. Integrated into three representative unified MLLMs without introducing any additional training data, InfoTok consistently improves both image understanding and generation performance. These results support information-regularized visual tokenization as a sound basis for token learning in unified MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InfoTok applies information-bottleneck regularization to shared visual tokenizers in unified MLLMs, but the abstract supplies no numbers or ablations to confirm the MI estimators actually enforce the claimed capacity constraints.

read the letter

The main point is that this paper proposes InfoTok to add explicit mutual-information constraints to the shared visual tokenizer in unified MLLMs. The authors treat the tokenizer as capacity-limited and use the Information Bottleneck idea to force a trade-off between compression and task relevance for both understanding and generation, instantiated with variational IB and HSIC estimators. They report that the approach improves three existing models without any new training data. That framing is new relative to prior architecture-driven tokenizers, and the capacity-constrained view is a reasonable way to think about limited token budgets. The paper does a clear job spelling out why generic tokenizers may waste capacity on high-entropy noise that neither task needs. The integration across multiple models without extra data is also practical. The soft spots are in the evidence. The abstract states consistent gains but shows no metrics, no ablation on the estimators, and no check that the surrogates actually control the true mutual information in high-dimensional visual features. In that regime the variational and HSIC bounds are known to be loose or biased, so the observed improvements could easily come from ordinary regularization rather than the intended information-theoretic mechanism. The stress-test concern about estimator reliability therefore lands. This work is for people building or analyzing unified multimodal models who want to think about token allocation more explicitly. A reader already interested in information-theoretic regularization would get value from the setup. I would send it to peer review. The idea is coherent enough to merit referee time on the experiments and estimator validation, even though the current write-up is too thin to judge the central claim.

Referee Report

2 major / 2 minor

Summary. The paper proposes InfoTok, an information-regularized tokenization mechanism for shared visual tokenizers in unified MLLMs. Grounded in the Information Bottleneck principle, it imposes mutual-information constraints via differentiable estimators (variational IB and HSIC) to enforce a compression-relevance trade-off that supports both semantic abstraction and visual detail. The approach is integrated into three representative unified MLLMs without extra training data and is reported to yield consistent gains in image understanding and generation performance.

Significance. If the central claim holds, InfoTok offers a principled, architecture-agnostic way to regularize capacity-constrained tokenizers using information theory, addressing a gap in existing shared-token designs. The no-extra-data integration and dual-task improvements would be practically valuable for unified MLLMs. However, significance hinges on whether the surrogate MI estimators reliably enforce the intended constraints or merely provide generic regularization; without that verification the contribution reduces to an empirical regularization trick.

major comments (2)

[Abstract, §3] Abstract and §3 (method): The central claim that variational IB and HSIC estimators enforce explicit MI constraints for a principled compression-relevance trade-off is load-bearing, yet the manuscript provides no quantitative verification (e.g., estimated I(image; tokens) bounds, ablation on estimator tightness, or comparison against true MI proxies) that these surrogates actually achieve the intended capacity control in high-dimensional visual feature regimes. If the estimators exhibit the known high variance/bias documented for HSIC and variational bounds on visual data, observed gains may not stem from the information-theoretic mechanism.
[§4] §4 (experiments): The statement that InfoTok 'consistently improves' both understanding and generation across three models lacks reported effect sizes, statistical significance, or controls that isolate the MI regularization from other training differences. Without these, the cross-model claim cannot be evaluated as evidence for the IB-based design.

minor comments (2)

[§3] Notation for the two MI estimators should be unified and clearly distinguished from the true mutual information I(·;·) to avoid implying exact enforcement.
[Abstract] The abstract claims 'no additional training data' but does not specify whether the original MLLM training recipes were held exactly constant or whether hyper-parameters were re-tuned for the new regularizers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the need for stronger validation of the information-theoretic claims and more rigorous experimental reporting. We address each point below and will revise the manuscript to incorporate additional analyses and controls.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (method): The central claim that variational IB and HSIC estimators enforce explicit MI constraints for a principled compression-relevance trade-off is load-bearing, yet the manuscript provides no quantitative verification (e.g., estimated I(image; tokens) bounds, ablation on estimator tightness, or comparison against true MI proxies) that these surrogates actually achieve the intended capacity control in high-dimensional visual feature regimes. If the estimators exhibit the known high variance/bias documented for HSIC and variational bounds on visual data, observed gains may not stem from the information-theoretic mechanism.

Authors: We agree that explicit verification of the MI constraints would strengthen the central claim. The variational IB formulation provides a tractable surrogate lower bound on the relevant mutual information terms by design, and HSIC serves as a kernel-based dependence measure whose consistency properties are established in the literature. However, we acknowledge the absence of direct quantitative checks such as reported I(image; tokens) estimates or tightness ablations in the current manuscript. In the revision we will add these: (i) estimated MI values computed via the variational bounds and HSIC on held-out visual features before and after regularization, (ii) sensitivity analysis varying the regularization coefficients to demonstrate the compression-relevance trade-off, and (iii) discussion of known estimator biases with empirical evidence that performance gains scale with the strength of the MI terms rather than generic regularization. revision: yes
Referee: [§4] §4 (experiments): The statement that InfoTok 'consistently improves' both understanding and generation across three models lacks reported effect sizes, statistical significance, or controls that isolate the MI regularization from other training differences. Without these, the cross-model claim cannot be evaluated as evidence for the IB-based design.

Authors: We accept that the current experimental section would benefit from greater statistical rigor. The reported improvements are based on standard benchmark metrics across three distinct unified MLLM architectures, but we did not include effect sizes, multiple-run statistics, or explicit ablations that remove only the IB/HSIC terms while keeping all other training elements fixed. In the revised version we will: (i) report mean improvements with standard deviations over at least three random seeds, (ii) include statistical significance tests (e.g., paired t-tests or Wilcoxon tests) for the key metrics, (iii) add a dedicated ablation table isolating the contribution of the mutual-information regularizers, and (iv) provide controls that compare against equivalent-capacity models trained with non-information-theoretic regularizers to better attribute gains to the IB mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation adopts external IB principle with standard estimators

full rationale

The paper grounds InfoTok in the established Information Bottleneck principle and instantiates it via known differentiable estimators (variational IB and HSIC) because exact MI is intractable. No equations or steps are shown that reduce claimed performance gains to a fitted parameter renamed as prediction, a self-defined quantity, or a self-citation chain. The capacity constraint is enforced through external regularization objectives whose validity rests on prior literature rather than the present work's outputs. Empirical improvements on three unified MLLMs without extra data constitute independent evidence, not a tautology. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling is visible in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that mutual information between high-dimensional images and tokens can be tractably approximated by variational IB and HSIC estimators, and that enforcing these approximations yields tokens that are simultaneously useful for understanding and generation.

axioms (1)

domain assumption Mutual information for high-dimensional visual data can be reliably estimated via variational IB and HSIC without introducing bias that harms downstream task performance
Abstract states these as practical instantiations because exact MI is intractable.

pith-pipeline@v0.9.0 · 5572 in / 1310 out tokens · 26289 ms · 2026-05-16T08:06:57.711582+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

InfoTok explicitly controls information flow ... by imposing mutual-information (MI) constraints that enforce a principled trade-off between compression and task relevance ... instantiated with ... variational IB formulation and ... HSIC
IndisputableMonolith/Foundation/LogicAsFunctionalEquation.lean Translation Theorem / J-uniqueness corollary echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

L_IB = I(Z;I) − β I(Z;Y_GT) ... compactness term ... KL upper bounds ... sufficiency lower bounds

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 15 internal anchors

[1]

Janus: Decoupling visual encoding for unified multimodal understanding and generation,

C. Wu, X. Chen, Z. Wu, Y . Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruan, and P. Luo, “Janus: Decoupling visual encoding for unified multimodal understanding and generation,” inCVPR. IEEE, 2025, pp. 12 966–12 977

work page 2025
[2]

Emerging Properties in Unified Multimodal Pretraining

C. Deng, D. Zhu, K. Li, C. Gou, F. Li, Z. Wang, S. Zhong, W. Yu, X. Nie, Z. Song, S. Guang, and H. Fan, “Emerging properties in unified multimodal pretraining,”CoRR, vol. abs/2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

VILA-U: a unified foundation model integrating visual understanding and generation,

Y . Wu, Z. Zhang, J. Chen, H. Tang, D. Li, Y . Fang, L. Zhu, E. Xie, H. Yin, L. Yi, S. Han, and Y . Lu, “VILA-U: a unified foundation model integrating visual understanding and generation,” inICLR. OpenReview.net, 2025

work page 2025
[4]

Harmonizing visual representations for unified multimodal understanding and generation,

S. Wu, W. Zhang, L. Xu, S. Jin, Z. Wu, Q. Tao, W. Liu, W. Li, and C. C. Loy, “Harmonizing visual representations for unified multimodal understanding and generation,” inICCV. IEEE, 2025

work page 2025
[5]

Ming-univision: Joint image under- standing and generation with a unified continuous tokenizer,

Z. Huang, D. Zheng, C. Zou, R. Liu, X. Wang, K. Ji, W. Chai, J. Sun, L. Wang, Y . Lvet al., “Ming-univision: Joint image under- standing and generation with a unified continuous tokenizer,”CoRR, vol. abs/2510.06590, 2025

work page arXiv 2025
[6]

Unitok: A unified tokenizer for visual generation and understanding,

C. Ma, Y . Jiang, J. Wu, J. Yang, X. Yu, Z. Yuan, B. Peng, and X. Qi, “Unitok: A unified tokenizer for visual generation and understanding,” CoRR, vol. abs/2502.20321, 2025

work page arXiv 2025
[7]

Vision as a dialect: Unifying visual understanding and generation via text-aligned representations,

J. Han, H. Chen, Y . Zhao, H. Wang, Q. Zhao, Z. Yang, H. He, X. Yue, and L. Jiang, “Vision as a dialect: Unifying visual understanding and generation via text-aligned representations,”CoRR, vol. abs/2506.18898, 2025

work page arXiv 2025
[8]

Openuni: A simple baseline for unified multimodal understanding and generation,

S. Wu, Z. Wu, Z. Gong, Q. Tao, S. Jin, Q. Li, W. Li, and C. C. Loy, “Openuni: A simple baseline for unified multimodal understanding and generation,”CoRR, vol. abs/2505.23661, 2025

work page arXiv 2025
[9]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

B. Lin, Z. Li, X. Cheng, Y . Niu, Y . Ye, X. He, S. Yuan, W. Yu, S. Wang, Y . Ge, Y . Pang, and L. Yuan, “Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation,”CoRR, vol. abs/2506.03147, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Show-o2: Improved native unified multimodal models,

J. Xie, Z. Yang, and M. Z. Shou, “Show-o2: Improved native unified multimodal models,” inNeurIPS, 2025

work page 2025
[11]

Unieval: Unified holistic evaluation for unified multimodal understanding and generation,

Y . Li, H. Wang, Q. Zhang, B. Xiao, C. Hu, H. Wang, and X. Li, “Unieval: Unified holistic evaluation for unified multimodal understanding and generation,”CoRR, vol. abs/2505.10483, 2025

work page arXiv 2025
[12]

GQA: A new dataset for real-world visual reasoning and compositional question answering,

D. A. Hudson and C. D. Manning, “GQA: A new dataset for real-world visual reasoning and compositional question answering,” inCVPR. IEEE, 2019, pp. 6700–6709

work page 2019
[13]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

B. Li, R. Wang, G. Wang, Y . Ge, Y . Ge, and Y . Shan, “Seed-bench: Benchmarking multimodal llms with generative comprehension,”CoRR, vol. abs/2307.16125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Evaluating object hallucination in large vision-language models,

Y . Li, Y . Du, K. Zhou, J. Wang, W. X. Zhao, and J. Wen, “Evaluating object hallucination in large vision-language models,” inEMNLP. Association for Computational Linguistics, 2023, pp. 292–305

work page 2023
[15]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, Z. Qiu, W. Lin, J. Yang, X. Zheng, K. Li, X. Sun, and R. Ji, “MME: A comprehensive evaluation benchmark for multimodal large language models,”CoRR, vol. abs/2306.13394, 2023. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Mm-vet: Evaluating large multimodal models for integrated capabilities,

W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang, “Mm-vet: Evaluating large multimodal models for integrated capabilities,” inICML. OpenReview.net, 2024

work page 2024
[17]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,

X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sunet al., “Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,” inCVPR, 2024, pp. 9556–9567

work page 2024
[18]

Geneval: An object-focused framework for evaluating text-to-image alignment,

D. Ghosh, H. Hajishirzi, and L. Schmidt, “Geneval: An object-focused framework for evaluating text-to-image alignment,” inNeurIPS, 2023

work page 2023
[19]

Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation,

J. Ye, D. Jiang, Z. Wang, L. Zhu, Z. Hu, Z. Huang, J. He, Z. Yan, J. Yu, H. Li, C. He, and W. Li, “Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation,”CoRR, vol. abs/2508.09987, 2025

work page arXiv 2025
[20]

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

Y . Niu, M. Ning, M. Zheng, B. Lin, P. Jin, J. Liao, K. Ning, B. Zhu, and L. Yuan, “WISE: A world knowledge-informed semantic evaluation for text-to-image generation,”CoRR, vol. abs/2503.07265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

The information bottleneck method

N. Tishby, F. C. N. Pereira, and W. Bialek, “The information bottleneck method,”CoRR, vol. physics/0004057, 2000

work page internal anchor Pith review Pith/arXiv arXiv 2000
[22]

Deep learning and the information bottleneck principle,

N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” inITW. IEEE, 2015, pp. 1–5

work page 2015
[23]

Deep variational information bottleneck,

A. A. Alemi, I. Fischer, J. V . Dillon, and K. Murphy, “Deep variational information bottleneck,” inICLR. OpenReview.net, 2017

work page 2017
[24]

Revisiting hilbert-schmidt information bottleneck for adversarial robustness,

Z. Wang, T. Jian, A. Masoomi, S. Ioannidis, and J. G. Dy, “Revisiting hilbert-schmidt information bottleneck for adversarial robustness,” in NeurIPS, 2021, pp. 586–597

work page 2021
[25]

A survey on multimodal large language models,

S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,”National Science Review, vol. 11, no. 12, p. nwae403, 2024

work page 2024
[26]

A survey of multimodal learning: Methods, applications, and future,

Y . Yuan, Z. Li, and B. Zhao, “A survey of multimodal learning: Methods, applications, and future,”ACM Comput. Surv., vol. 57, no. 7, pp. 167:1– 167:34, 2025

work page 2025
[27]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” in NeurIPS, 2023

work page 2023
[28]

Minigpt-4: Enhancing vision-language understanding with advanced large language models,

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” in ICLR. OpenReview.net, 2024

work page 2024
[29]

Instructblip: Towards general-purpose vision-language models with instruction tuning,

W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. C. H. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,” inNeurIPS, 2023

work page 2023
[30]

BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,” inICML, ser. Proceedings of Machine Learning Research, vol. 202. PMLR, 2023, pp. 19 730–19 742

work page 2023
[31]

Flamingo: a visual language model for few-shot learning,

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynoldset al., “Flamingo: a visual language model for few-shot learning,” inNeurIPS, 2022, pp. 23 716–23 736

work page 2022
[32]

Querying as prompt: Parameter-efficient learning for multimodal language model,

T. Liang, J. Huang, M. Kong, L. Chen, and Q. Zhu, “Querying as prompt: Parameter-efficient learning for multimodal language model,” inCVPR, 2024, pp. 26 855–26 865

work page 2024
[33]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inCVPR. IEEE, 2022, pp. 10 674–10 685

work page 2022
[34]

Scalable diffusion models with transformers,

W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inICCV. IEEE, 2023, pp. 4172–4182

work page 2023
[35]

Text-to-image diffusion models in generative AI: A survey,

C. Zhang, C. Zhang, M. Zhang, and I. S. Kweon, “Text-to-image diffusion models in generative AI: A survey,”CoRR, vol. abs/2303.07909, 2023

work page arXiv 2023
[36]

Diffusion-4k: Ultra- high-resolution image synthesis with latent diffusion models,

J. Zhang, Q. Huang, J. Liu, X. Guo, and D. Huang, “Diffusion-4k: Ultra- high-resolution image synthesis with latent diffusion models,” inCVPR. IEEE, 2025, pp. 23 464–23 473

work page 2025
[37]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchi- cal text-conditional image generation with CLIP latents,”CoRR, vol. abs/2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[38]

Editar: Unified conditional generation with autoregressive models,

J. Mu, N. Vasconcelos, and X. Wang, “Editar: Unified conditional generation with autoregressive models,” inCVPR. IEEE, 2025, pp. 7899–7909

work page 2025
[39]

Dreamllm: Synergistic multimodal comprehension and creation,

R. Dong, C. Han, Y . Peng, Z. Qi, Z. Ge, J. Yang, L. Zhao, J. Sun, H. Zhou, H. Wei, X. Kong, X. Zhang, K. Ma, and L. Yi, “Dreamllm: Synergistic multimodal comprehension and creation,” inICLR. OpenReview.net, 2024

work page 2024
[40]

Making llama SEE and draw with SEED tokenizer,

Y . Ge, S. Zhao, Z. Zeng, Y . Ge, C. Li, X. Wang, and Y . Shan, “Making llama SEE and draw with SEED tokenizer,” inICLR. OpenReview.net, 2024

work page 2024
[41]

Generative multimodal models are in-context learners,

Q. Sun, Y . Cui, X. Zhang, F. Zhang, Q. Yu, Y . Wang, Y . Rao, J. Liu, T. Huang, and X. Wang, “Generative multimodal models are in-context learners,” inCVPR. IEEE, 2024, pp. 14 398–14 409

work page 2024
[42]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

C. Team, “Chameleon: Mixed-modal early-fusion foundation models,” CoRR, vol. abs/2405.09818, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Fast autoregressive models for continuous latent generation,

T. Hang, J. Bao, F. Wei, and D. Chen, “Fast autoregressive models for continuous latent generation,”CoRR, vol. abs/2504.18391, 2025

work page arXiv 2025
[44]

Emu3: Next-Token Prediction is All You Need

X. Wang, X. Zhang, Z. Luo, Q. Sun, Y . Cui, J. Wang, F. Zhang, Y . Wang, Z. Li, Q. Yu, Y . Zhao, Y . Ao, X. Min, T. Li, B. Wu, B. Zhao, B. Zhang, L. Wang, G. Liu, Z. He, X. Yang, J. Liu, Y . Lin, T. Huang, and Z. Wang, “Emu3: Next-token prediction is all you need,”CoRR, vol. abs/2409.18869, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

MUSE-VL: modeling unified VLM through semantic discrete encoding,

R. Xie, C. Du, P. Song, and C. Liu, “MUSE-VL: modeling unified VLM through semantic discrete encoding,”CoRR, vol. abs/2411.17762, 2024

work page arXiv 2024
[46]

Growing visual generative capacity for pre-trained mllms,

H. Wang, J. Han, Z. Yang, Q. Zhao, S. Lin, X. Yue, A. Shrivastava, Z. Yang, and H. Chen, “Growing visual generative capacity for pre-trained mllms,”CoRR, vol. abs/2510.01546, 2025

work page arXiv 2025
[47]

Toklip: Marry visual tokens to CLIP for multimodal comprehension and generation,

H. Lin, T. Wang, Y . Ge, Y . Ge, Z. Lu, Y . Wei, Q. Zhang, Z. Sun, and Y . Shan, “Toklip: Marry visual tokens to CLIP for multimodal comprehension and generation,”CoRR, vol. abs/2505.05422, 2025

work page arXiv 2025
[48]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

J. Chen, Z. Xu, X. Pan, Y . Hu, C. Qin, T. Goldstein, L. Huang, T. Zhou, S. Xie, S. Savarese, L. Xue, C. Xiong, and R. Xu, “Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset,” CoRR, vol. abs/2505.09568, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Qwen2.5 Technical Report

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, and Z. Qiu, “Qwen2...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Every FLOP counts: Scaling a 300b mixture-of-experts LING LLM without premium gpus,

L. Team, B. Zeng, C. Huang, C. Zhang, C. Tian, C. Chen, D. Jin, F. Yu, F. Zhu, F. Yuan, F. Wang, G. Wang, G. Zhai, H. Zhang, H. Li, J. Zhou, J. Liu, J. Fang, J. Ou, J. Hu, J. Luo, J. Zhang, J. Liu, J. Sha, J. Qian, J. Wu, J. Zhao, J. Li, J. Feng, J. Di, J. Xu, J. Yao, K. Xu, K. Du, L. Li, L. Liang, L. Yu, L. Tang, L. Ju, P. Xu, Q. Cui, S. Liu, S. Li, S. S...

work page arXiv 2025
[51]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Show-o: One single transformer to unify multimodal understanding and generation,

J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin, Y . Gu, Z. Chen, Z. Yang, and M. Z. Shou, “Show-o: One single transformer to unify multimodal understanding and generation,” inICLR. OpenReview.net, 2025

work page 2025
[53]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

X. Chen, Z. Wu, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, and C. Ruan, “Janus-pro: Unified multimodal understanding and generation with data and model scaling,”CoRR, vol. abs/2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Transfusion: Predict the next token and diffuse images with one multi-modal model,

C. Zhou, L. Yu, A. Babu, K. Tirumala, M. Yasunaga, L. Shamis, J. Kahn, X. Ma, L. Zettlemoyer, and O. Levy, “Transfusion: Predict the next token and diffuse images with one multi-modal model,” inICLR. OpenReview.net, 2025

work page 2025
[55]

On variational bounds of mutual information,

B. Poole, S. Ozair, A. van den Oord, A. A. Alemi, and G. Tucker, “On variational bounds of mutual information,” inICML, ser. Proceedings of Machine Learning Research, vol. 97. PMLR, 2019, pp. 5171–5180

work page 2019
[56]

Auto-encoding variational bayes,

D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, Y . Bengio and Y . LeCun, Eds., 2014

work page 2014
[57]

Representation Learning with Contrastive Predictive Coding

A. van den Oord, Y . Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”CoRR, vol. abs/1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[58]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” inCVPR. IEEE Computer Society, 2009, pp. 248–255

work page 2009
[59]

Densefusion- 1m: Merging vision experts for comprehensive multimodal perception,

X. Li, F. Zhang, H. Diao, Y . Wang, X. Wang, and L. Duan, “Densefusion- 1m: Merging vision experts for comprehensive multimodal perception,” NeurIPS, pp. 18 535–18 556, 2024

work page 2024
[60]

Ovis-u1 technical report,

G. Wang, S. Zhao, X. Zhang, L. Cao, P. Zhan, L. Duan, S. Lu, M. Fu, X. Chen, J. Zhao, Y . Li, and Q. Chen, “Ovis-u1 technical report,”CoRR, vol. abs/2506.23044, 2025

work page arXiv 2025
[61]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shao, Z. Gao, E. Cui, X. Wang, Y . Cao, Y . Liu, X. Wei, H. Zhang, H. Wang, W. Xu, H. Li, J. Wang, N. Deng, S. Li, Y . He, T. Jiang, J. Luo, Y . Wang, C. He, B. Shi, X. Zhang, W. Shao, J. He, Y . Xiong, W. Qu, P. Sun, P. Jiao, H. Lv, L. Wu, K. Zhang, H. Deng, J. Ge, K. Chen, L. W...

work page internal anchor Pith review Pith/arXiv arXiv 2025