V ARGPT: unified understanding and generation in a visual autoregres- IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 21 sive multimodal large language model

Xianwei Zhuang, Yuxin Xie, Yufan Deng, Liming Liang, Jinghan Ru, Yuguo Yin, Yuexian Zou · 2025 · arXiv 2501.12327

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 1 baseline 1

citation-polarity summary

background 1 baseline 1

representative citing papers

HACK++: Towards More Effective Head-Aware Key-Value Compression for Efficient Visual Autoregressive Modeling

cs.CV · 2026-06-06 · unverdicted · novelty 7.0

HACK++ is a head-aware KV cache compression framework for VAR models that decouples current-scale attention from historical cache under adaptive per-head budgets to achieve near-lossless generation at 30% attention and 10% cache budgets.

Ask, Solve, Generate: Self-Evolving Unified Multimodal Understanding and Generation via Self-Consistency Rewards

cs.CV · 2026-06-25 · unverdicted · novelty 6.0

A self-evolving framework with proposer-solver-generator roles, Solver Token Entropy, and multi-scale internal evaluation improves unified LMMs on understanding and generation tasks using only self-derived consistency signals.

MMaDA: Multimodal Large Diffusion Language Models

cs.CV · 2025-05-21 · unverdicted · novelty 6.0

MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-image tasks.

A Survey on Vision-Language-Action Models for Embodied AI

cs.RO · 2024-05-23 · unverdicted · novelty 6.0

This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.

MEPA: Multi-Scale Representation Alignment for Visual Autoregressive Modeling with Mixture of Experts

cs.CV · 2026-07-01 · unverdicted · novelty 5.0

MEPA adds token-routed MoE and residual self-supervised feature alignment to VAR models, reporting better FID on ImageNet 256x256 with half the training epochs and fewer parameters than dense baselines.

Semantic Generative Tuning for Unified Multimodal Models

cs.CV · 2026-05-18 · unverdicted · novelty 5.0 · 2 refs

Semantic Generative Tuning applies segmentation-based generative proxies during post-training to align and improve both understanding and generation in unified multimodal models.

WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

WinTok is a hybrid visual tokenizer that supplements pixel tokens with learnable semantic tokens distilled asymmetrically from foundation models to improve reconstruction, understanding, and generation.

From Static Inference to Dynamic Interaction: A Survey of Streaming Large Language Models

cs.CL · 2026-03-04 · unverdicted · novelty 5.0

The paper supplies a unified definition based on data flow and dynamic interaction plus a systematic taxonomy to organize fragmented work on streaming large language models.

citing papers explorer

Showing 1 of 1 citing paper after filters.

MMaDA: Multimodal Large Diffusion Language Models cs.CV · 2025-05-21 · unverdicted · none · ref 59
MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-image tasks.

V ARGPT: unified understanding and generation in a visual autoregres- IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 21 sive multimodal large language model

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer