Canonical reference

Monoformer: One transformer for both diffusion and autoregression

Chuyang Zhao, Yuxing Song, Wenhao Wang, Haocheng Feng, Errui Ding, Yifan Sun, Xinyan Xiao, Jingdong Wang · 2024 · arXiv 2409.16280

Canonical reference. 80% of citing Pith papers cite this work as background.

7 Pith papers citing it

Background 80% of classified citations

read on arXiv browse 7 citing papers

citation-role summary

background 4 baseline 1

citation-polarity summary

background 4 baseline 1

representative citing papers

LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

cs.CV · 2026-02-12 · unverdicted · novelty 6.0

LLaMo scales pretrained LLMs for unified motion-language tasks by encoding motion into continuous causal latents and adding a flow-matching head for real-time autoregressive generation and captioning.

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

cs.LG · 2025-05-29 · unverdicted · novelty 6.0

Muddit is a unified discrete diffusion transformer that integrates strong visual priors from a pretrained text-to-image model with a lightweight text decoder to enable fast parallel generation across text and image modalities.

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

cs.CV · 2025-03-13 · unverdicted · novelty 6.0

HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.

MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset

cs.CV · 2026-05-20 · unverdicted · novelty 5.0

MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.

Show-o2: Improved Native Unified Multimodal Models

cs.CV · 2025-06-18 · unverdicted · novelty 4.0

Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

Generative AI Meets 6G and Beyond: Diffusion Models for Semantic Communications

eess.SP · 2025-11-11 · unverdicted · novelty 3.0

The tutorial synthesizes diffusion model techniques for generative semantic communications to achieve high compression while preserving meaning in wireless transmission.

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

cs.AI · 2025-01-29 · conditional · novelty 3.0

Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

citing papers explorer

Showing 7 of 7 citing papers.

LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens cs.CV · 2026-02-12 · unverdicted · none · ref 92
LLaMo scales pretrained LLMs for unified motion-language tasks by encoding motion into continuous causal latents and adding a flow-matching head for real-time autoregressive generation and captioning.
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model cs.LG · 2025-05-29 · unverdicted · none · ref 33
Muddit is a unified discrete diffusion transformer that integrates strong visual priors from a pretrained text-to-image model with a lightweight text decoder to enable fast parallel generation across text and image modalities.
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model cs.CV · 2025-03-13 · unverdicted · none · ref 68
HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset cs.CV · 2026-05-20 · unverdicted · none · ref 110
MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.
Show-o2: Improved Native Unified Multimodal Models cs.CV · 2025-06-18 · unverdicted · none · ref 146
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
Generative AI Meets 6G and Beyond: Diffusion Models for Semantic Communications eess.SP · 2025-11-11 · unverdicted · none · ref 112
The tutorial synthesizes diffusion model techniques for generative semantic communications to achieve high compression while preserving meaning in wireless transmission.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling cs.AI · 2025-01-29 · conditional · none · ref 54
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

Monoformer: One transformer for both diffusion and autoregression

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer