Factor- ized visual tokenization and generation

Zechen Bai, Jianxiong Gao, Ziteng Gao, Pichao Wang, Zheng Zhang, Tong He, Mike Zheng Shou · 2024 · arXiv 2411.16681

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

other 1

citation-polarity summary

unclear 1

representative citing papers

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

cs.CV · 2025-03-18 · unverdicted · novelty 7.0

DualToken disentangles semantics and appearance via separate codebooks in one tokenizer, reporting 0.25 rFID, 82% ImageNet zero-shot accuracy, and gains over VILA-U on understanding and generation benchmarks.

WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

WinTok is a hybrid visual tokenizer that supplements pixel tokens with learnable semantic tokens distilled asymmetrically from foundation models to improve reconstruction, understanding, and generation.

Hallucination of Multimodal Large Language Models: A Survey

cs.CV · 2024-04-29 · accept · novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

citing papers explorer

Showing 3 of 3 citing papers.

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies cs.CV · 2025-03-18 · unverdicted · none · ref 3
DualToken disentangles semantics and appearance via separate codebooks in one tokenizer, reporting 0.25 rFID, 82% ImageNet zero-shot accuracy, and gains over VILA-U on understanding and generation benchmarks.
WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens cs.CV · 2026-05-18 · unverdicted · none · ref 5
WinTok is a hybrid visual tokenizer that supplements pixel tokens with learnable semantic tokens distilled asymmetrically from foundation models to improve reconstruction, understanding, and generation.
Hallucination of Multimodal Large Language Models: A Survey cs.CV · 2024-04-29 · accept · none · ref 5
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

Factor- ized visual tokenization and generation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer