V ARGPT: unified understanding and generation in a visual autoregres- IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 21 sive multimodal large language model

· 2025 · arXiv 2501.12327

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 1 baseline 1

citation-polarity summary

background 1 baseline 1

representative citing papers

HACK++: Towards More Effective Head-Aware Key-Value Compression for Efficient Visual Autoregressive Modeling

cs.CV · 2026-06-06 · unverdicted · novelty 7.0

HACK++ is a head-aware KV cache compression framework for VAR models that decouples current-scale attention from historical cache under adaptive per-head budgets to achieve near-lossless generation at 30% attention and 10% cache budgets.

MMaDA: Multimodal Large Diffusion Language Models

cs.CV · 2025-05-21 · unverdicted · novelty 6.0

MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-image tasks.

A Survey on Vision-Language-Action Models for Embodied AI

cs.RO · 2024-05-23 · unverdicted · novelty 6.0

This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.

MEPA: Multi-Scale Representation Alignment for Visual Autoregressive Modeling with Mixture of Experts

cs.CV · 2026-07-01 · unverdicted · novelty 5.0

MEPA adds token-routed MoE and residual self-supervised feature alignment to VAR models, reporting better FID on ImageNet 256x256 with half the training epochs and fewer parameters than dense baselines.

Semantic Generative Tuning for Unified Multimodal Models

cs.CV · 2026-05-18 · unverdicted · novelty 5.0 · 2 refs

Semantic Generative Tuning applies segmentation-based generative proxies during post-training to align and improve both understanding and generation in unified multimodal models.

WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

WinTok is a hybrid visual tokenizer that supplements pixel tokens with learnable semantic tokens distilled asymmetrically from foundation models to improve reconstruction, understanding, and generation.

From Static Inference to Dynamic Interaction: A Survey of Streaming Large Language Models

cs.CL · 2026-03-04 · unverdicted · novelty 5.0

The paper supplies a unified definition based on data flow and dynamic interaction plus a systematic taxonomy to organize fragmented work on streaming large language models.

citing papers explorer

Showing 5 of 5 citing papers after filters.

HACK++: Towards More Effective Head-Aware Key-Value Compression for Efficient Visual Autoregressive Modeling cs.CV · 2026-06-06 · unverdicted · none · ref 33
HACK++ is a head-aware KV cache compression framework for VAR models that decouples current-scale attention from historical cache under adaptive per-head budgets to achieve near-lossless generation at 30% attention and 10% cache budgets.
MEPA: Multi-Scale Representation Alignment for Visual Autoregressive Modeling with Mixture of Experts cs.CV · 2026-07-01 · unverdicted · none · ref 65
MEPA adds token-routed MoE and residual self-supervised feature alignment to VAR models, reporting better FID on ImageNet 256x256 with half the training epochs and fewer parameters than dense baselines.
Semantic Generative Tuning for Unified Multimodal Models cs.CV · 2026-05-18 · unverdicted · none · ref 88 · 2 links
Semantic Generative Tuning applies segmentation-based generative proxies during post-training to align and improve both understanding and generation in unified multimodal models.
WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens cs.CV · 2026-05-18 · unverdicted · none · ref 112
WinTok is a hybrid visual tokenizer that supplements pixel tokens with learnable semantic tokens distilled asymmetrically from foundation models to improve reconstruction, understanding, and generation.
From Static Inference to Dynamic Interaction: A Survey of Streaming Large Language Models cs.CL · 2026-03-04 · unverdicted · none · ref 15
The paper supplies a unified definition based on data flow and dynamic interaction plus a systematic taxonomy to organize fragmented work on streaming large language models.

V ARGPT: unified understanding and generation in a visual autoregres- IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 21 sive multimodal large language model

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer