Penguin-vl: Exploring the efficiency limits of vlm with llm-based vision encoders.arXiv preprint arXiv:2603.06569, 2026a

Boqiang Zhang, Lei Ke, Ruihan Yang, Qi Gao, Tianyuan Qu, Rossell Chen, Dong Yu, et al · arXiv 2603.06569

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

cs.CV · 2026-04-27 · unverdicted · novelty 6.0 · 2 refs

Tuna-2 shows that direct pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive generation and stronger understanding at scale.

LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.

citing papers explorer

Showing 2 of 2 citing papers.

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation cs.CV · 2026-04-27 · unverdicted · none · ref 55 · 2 links
Tuna-2 shows that direct pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive generation and stronger understanding at scale.
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model cs.CV · 2026-04-22 · unverdicted · none · ref 43
LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.

Penguin-vl: Exploring the efficiency limits of vlm with llm-based vision encoders.arXiv preprint arXiv:2603.06569, 2026a

fields

years

verdicts

representative citing papers

citing papers explorer