LLaMo scales pretrained LLMs for unified motion-language tasks by encoding motion into continuous causal latents and adding a flow-matching head for real-time autoregressive generation and captioning.
Canonical reference
Monoformer: One transformer for both diffusion and autoregression
Canonical reference. 80% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
representative citing papers
Muddit is a unified discrete diffusion transformer that integrates strong visual priors from a pretrained text-to-image model with a lightweight text decoder to enable fast parallel generation across text and image modalities.
HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.
MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
The tutorial synthesizes diffusion model techniques for generative semantic communications to achieve high compression while preserving meaning in wireless transmission.
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
citing papers explorer
-
LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens
LLaMo scales pretrained LLMs for unified motion-language tasks by encoding motion into continuous causal latents and adding a flow-matching head for real-time autoregressive generation and captioning.
-
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Muddit is a unified discrete diffusion transformer that integrates strong visual priors from a pretrained text-to-image model with a lightweight text decoder to enable fast parallel generation across text and image modalities.
-
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.
-
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset
MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.
-
Show-o2: Improved Native Unified Multimodal Models
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
-
Generative AI Meets 6G and Beyond: Diffusion Models for Semantic Communications
The tutorial synthesizes diffusion model techniques for generative semantic communications to achieve high compression while preserving meaning in wireless transmission.
-
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.