Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Chong Ruan; Wen Liu; Xiaokang Chen; Xingchao Liu; Xingkai Yu; Zhenda Xie; Zhiyu Wu; Zizheng Pan

arxiv: 2501.17811 · v1 · submitted 2025-01-29 · 💻 cs.AI · cs.CL· cs.CV

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen , Zhiyu Wu , Xingchao Liu , Zizheng Pan , Wen Liu , Zhenda Xie , Xingkai Yu , Chong Ruan This is my paper

Pith reviewed 2026-05-11 08:09 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CV

keywords multimodal understandingtext-to-image generationmodel scalingdata scalingunified multimodal modelsinstruction followingtraining optimization

0 comments

The pith

Janus-Pro improves multimodal understanding and text-to-image instruction following by optimizing training, expanding data, and scaling model size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Janus-Pro as an updated version of the prior Janus model for handling both understanding of image-text inputs and generation of images from text. It applies three changes—an optimized training strategy, larger volumes of training data, and a bigger overall model—to produce stronger results on understanding benchmarks and on tasks where generated images must match detailed instructions. The work also reports more consistent image outputs without as many artifacts or variations. A reader would care because the improvements come from straightforward extensions rather than new architectural inventions, showing a direct path to stronger single-model systems that both interpret and create visual content.

Core claim

Janus-Pro incorporates an optimized training strategy, expanded training data, and scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation.

What carries the argument

The unified Janus-Pro architecture that performs both multimodal understanding and text-to-image generation within one model, advanced through optimized training, data expansion, and increased scale.

If this is right

Unified models can reach higher capability on both comprehension and generation tasks without separate specialized systems.
Training data volume and model size continue to drive gains even in architectures that already combine vision and language.
More stable text-to-image outputs reduce the need for post-processing or multiple sampling attempts.
Public release of code and models allows direct testing and extension by others.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pattern suggests scaling laws observed in language models may transfer to joint understanding-plus-generation systems.
Similar gains could appear if the same three changes were applied to other base multimodal models.
Longer-term, this points toward simpler AI pipelines where one model handles visual input and output without task-specific retraining.

Load-bearing premise

The reported performance gains come from the three specific changes of optimized training, expanded data, and larger model size rather than from differences in evaluation protocols, data details, or other unmentioned choices.

What would settle it

A controlled experiment that applies the three changes one at a time to the original Janus model and finds no meaningful gains on the same benchmarks would show the combined improvements are not responsible for the results.

read the original abstract

In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Janus-Pro as an advancement over the prior Janus model by incorporating three changes: an optimized training strategy, expanded training data, and scaling to larger model size. It claims these yield significant improvements in multimodal understanding, text-to-image instruction following, and generation stability, with code and models released publicly.

Significance. If the gains are causally attributable to the three factors, the work provides empirical support for scaling benefits in unified multimodal models handling both understanding and generation. The public code release is a notable strength enabling reproducibility and community verification.

major comments (2)

[Abstract] Abstract: The central claim attributes performance advancements directly to the three listed changes (optimized training, expanded data, larger model), yet no controlled ablations are described that isolate each factor while holding the others and the evaluation protocol fixed. This undermines causal attribution, as differences in data curation, prompt formatting, or inference details could account for the deltas instead.
[Experiments] Experiments section (inferred from standard structure and abstract claims): Without within-paper ablation tables or results showing incremental gains from each change individually (e.g., base model with only expanded data), the magnitude of reported improvements cannot be confidently linked to the stated scaling factors rather than unmentioned implementation choices.

minor comments (1)

Ensure all reported benchmark results include standard deviations or multiple-run statistics to support the 'significant advancements' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on clarifying the attribution of improvements in Janus-Pro. We address the major comments point by point below, with planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim attributes performance advancements directly to the three listed changes (optimized training, expanded data, larger model), yet no controlled ablations are described that isolate each factor while holding the others and the evaluation protocol fixed. This undermines causal attribution, as differences in data curation, prompt formatting, or inference details could account for the deltas instead.

Authors: We agree that the abstract phrasing could be interpreted as implying direct causal effects for each factor individually. The manuscript presents Janus-Pro as the result of applying all three changes together and reports performance relative to the original Janus and other baselines. No isolated ablations holding all other variables fixed are included. In revision we will rephrase the abstract to describe the improvements as resulting from the collective incorporation of the three changes, and we will add a brief discussion of this limitation in the Experiments section. revision: yes
Referee: [Experiments] Experiments section (inferred from standard structure and abstract claims): Without within-paper ablation tables or results showing incremental gains from each change individually (e.g., base model with only expanded data), the magnitude of reported improvements cannot be confidently linked to the stated scaling factors rather than unmentioned implementation choices.

Authors: The current Experiments section focuses on the final Janus-Pro model and its comparisons to prior work rather than incremental ablations of each scaling factor. We acknowledge that this leaves open the possibility that unmentioned implementation details contribute to the observed gains. We will expand the Experiments section with additional discussion of the cumulative nature of the changes and the practical constraints on running fully controlled large-scale ablations. We will also note that the public code and model release enables the community to perform further targeted experiments. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical gains rest on external benchmarks

full rationale

The paper presents an empirical scaling study: it applies three engineering changes (optimized training, more data, larger model) to a prior architecture and measures performance on public multimodal understanding and generation benchmarks. No equations, first-principles derivations, or internal predictions are defined; the reported deltas are direct comparisons against external test sets and prior models. No self-definitional loops, fitted-input-as-prediction, or load-bearing self-citations appear. The argument is therefore self-contained against reproducible external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on standard machine-learning scaling practices and publicly available benchmarks; no new free parameters, axioms, or invented entities are introduced beyond the model architecture inherited from prior work.

pith-pipeline@v0.9.0 · 5399 in / 1117 out tokens · 87155 ms · 2026-05-11T08:09:39.433361+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities
IndisputableMonolith.Foundation.PhiForcing phi_equation unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

decoupling visual encoding for multimodal understanding and generation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MolSight: Molecular Property Prediction with Images
cs.CV 2026-05 unverdicted novelty 8.0

Vision encoders on single 2D molecular images with a chemistry-informed curriculum achieve top or near-top results on 10 property prediction tasks at 80x lower FLOPs than multi-modal competitors.
Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving
cs.LG 2025-12 conditional novelty 8.0

Cornfigurator is the first automated deployment planner for generic any-to-any multimodal models that explores the full range of colocation-to-disaggregation strategies and delivers 1.12x to 6.32x higher goodput than ...
Flow-GRPO: Training Flow Matching Models via Online RL
cs.CV 2025-05 unverdicted novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
VisAnalog: A Diagnostic Suite for Visual Concept Transfer on Natural Images
cs.CV 2026-05 unverdicted novelty 7.0

VisAnalog is a new controlled benchmark showing VLMs substantially underperform humans on visual concept transfer under one- to four-step deterministic transformations, with relation inference as the main failure mode.
MotiMotion: Motion-Controlled Video Generation with Visual Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiB...
AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in Agriculture
cs.CV 2026-05 unverdicted novelty 7.0

AgroTools is a new benchmark for tool-augmented multimodal agents in agriculture featuring 539 QA pairs, 1,097 images, five task families, and 14 tools, with evaluations showing major limitations in current models' to...
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
cs.CV 2026-05 unverdicted novelty 7.0

Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.
RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution
cs.CV 2026-05 conditional novelty 7.0

RankE co-evolves AR policy and decoder via alternating ranking optimization, improving both FID and CLIP scores on LlamaGen-XL and Janus-Pro where policy-only RL degrades FID.
Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation
cs.CV 2026-05 conditional novelty 7.0

HeadKV compresses KV cache for autoregressive image generation via head-aware budget allocation, early head-type identification from consistent patterns, and stratified token eviction.
Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models
cs.CR 2026-05 conditional novelty 7.0

ToBAC is the first backdoor attack on unified autoregressive models, using data or model poisoning to make triggers elicit cross-modal malicious behavior in text and image generation.
ImageAttributionBench: How Far Are We from Generalizable Attribution?
cs.CV 2026-05 unverdicted novelty 7.0

ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.
G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models
cs.CV 2026-05 unverdicted novelty 7.0

G²TR reduces visual tokens and prefill computation by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency while preserving reasoning accuracy and editing quality.
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
cs.CV 2026-05 unverdicted novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
cs.CV 2026-05 unverdicted novelty 7.0

Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming ...
UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning
cs.MM 2026-05 unverdicted novelty 7.0

UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixe...
Normalizing Trajectory Models
cs.CV 2026-05 unverdicted novelty 7.0

NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.
Normalizing Trajectory Models
cs.CV 2026-05 unverdicted novelty 7.0

NTM uses per-step conditional normalizing flows plus a trajectory-wide predictor to achieve exact-likelihood 4-step sampling that matches or exceeds baselines on text-to-image tasks.
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 7.0

XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...
Probing Visual Planning in Image Editing Models
cs.CV 2026-04 unverdicted novelty 7.0

Image editing models fail zero-shot visual planning on abstract mazes and queen puzzles but generalize after finetuning, yet still cannot match human zero-shot efficiency.
Exploring Spatial Intelligence from a Generative Perspective
cs.CV 2026-04 unverdicted novelty 7.0

Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.
Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks
cs.CV 2026-04 unverdicted novelty 7.0

StepSTEM benchmark and dynamic-programming step alignment show top MLLMs achieve only 38.29% accuracy on graduate STEM tasks requiring interleaved cross-modal reasoning.
Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion
cs.CV 2026-04 unverdicted novelty 7.0

3D-ARD+ unifies autoregressive token prediction with diffusion-based 3D latent generation to co-produce indoor scene layouts and object geometries that follow complex text-specified spatial and semantic constraints.
Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 7.0

Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.
Learning Vision-Language-Action World Models for Autonomous Driving
cs.CV 2026-04 unverdicted novelty 7.0

VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks
cs.CV 2026-02 unverdicted novelty 7.0

PlanViz is a new benchmark with three sub-tasks and PlanScore metric to evaluate planning-oriented image generation and editing by unified multimodal models for computer-use tasks.
A Unified and Controllable Framework for Layered Image Generation with Visual Effects
cs.CV 2026-01 unverdicted novelty 7.0

LASAGNA produces layered images with integrated visual effects in a single pass, enabling drift-free edits via alpha compositing while releasing a 48K dataset and a 242-sample benchmark.
Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation
cs.CV 2026-01 unverdicted novelty 7.0

GAR-Font is a global-aware autoregressive framework for multimodal few-shot font generation that adds global tokenization, a language-style adapter, and post-refinement to improve style coherence over patch-based methods.
dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models
cs.CV 2025-12 conditional novelty 7.0

dMLLM-TTS delivers up to 6x more efficient test-time scaling for diffusion MLLMs via O(N+T) hierarchical search and self-verified feedback, improving generation quality on GenEval across three models.
AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model
cs.CV 2025-11 unverdicted novelty 7.0

AIA loss teaches unified multimodal models task-specific cross-modal attention patterns to reduce conflicts between image understanding and generation without architecture decoupling.
Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching
cs.LG 2025-09 conditional novelty 7.0

Derives exact guidance transition rates for discrete flow matching models that require only one model evaluation per sampling step and unify prior approximation-based methods.
Transfer between Modalities with MetaQueries
cs.CV 2025-04 unverdicted novelty 7.0

MetaQueries act as an efficient bridge allowing multimodal LLMs to augment diffusion-based image generation and editing without complex training or unfreezing the LLM backbone.
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
cs.CV 2025-03 unverdicted novelty 7.0

Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.
S$^4$ST: A Strong, Self-transferable, faSt, and Simple Scale Transformation for Transferable Targeted Attack
cs.CR 2024-10 unverdicted novelty 7.0

S⁴ST shows that dimensionally consistent scaling with low-redundancy complementary transforms achieves state-of-the-art data-free transferable targeted attacks by exploiting visual data's multi-scale nature.
B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation
cs.CV 2026-05 unverdicted novelty 6.0

B-GRTO extends GRPO by reusing rollouts to optimize auxiliary segmentation decoder objectives, yielding substantial gains over plain GRPO on referring segmentation tasks.
RiT: Vanilla Diffusion Transformers Suffice in Representation Space
cs.CV 2026-05 conditional novelty 6.0

A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.
UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

UniVL unifies vision and language into one mask-rendered input processed by an OCR backbone to condition diffusion models for spatially grounded image generation without a standalone text encoder.
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
cs.CV 2026-05 unverdicted novelty 6.0

Uni-Edit frames intelligent image editing as a general task for unified multimodal models and uses an automated pipeline to synthesize complex reasoning-intensive instructions from VQA data, yielding performance gains...
FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation
cs.CV 2026-05 unverdicted novelty 6.0

FullFlow adds LoRA adapters and discrete text insertion to pretrained rectified-flow text-to-image models, achieving bidirectional generation with major gains in FID, CIDEr, VRAM, and throughput over Dual Diffusion baselines.
Semantic Generative Tuning for Unified Multimodal Models
cs.CV 2026-05 unverdicted novelty 6.0

Semantic Generative Tuning uses image segmentation as a generative proxy to align misaligned representation spaces in unified multimodal models and improve both perception and generative layout fidelity.
Lance: Unified Multimodal Modeling by Multi-Task Synergy
cs.CV 2026-05 unverdicted novelty 6.0

Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keepin...
LatentUMM: Dual Latent Alignment for Unified Multimodal Models
cs.CV 2026-05 unverdicted novelty 6.0

LatentUMM proposes dual latent alignment at modality and capacity levels plus latent dynamics stabilization to reduce semantic drift and improve consistency in unified multimodal models.
Medical Context Distorts Decisions in Clinical Vision Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Clinical VLMs over-rely on text modality, irrelevant clinical history, and prompt wording when making chest x-ray decisions on MIMIC-CXR data.
HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing
cs.CV 2026-05 unverdicted novelty 6.0

HierEdit enables efficient 4K image editing via low-resolution proxy localization followed by hierarchical local-window diffusion that reuses unaltered regions as conditioning.
Latent Action Control for Reasoning-Guided Unified Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

Latent Action Control learns unobserved action trajectories via variational alignment and GRPO to inject reasoning into flow-based image generation, yielding gains on compositional benchmarks.
Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models
cs.AI 2026-05 unverdicted novelty 6.0

Proposes HT-GRPO with sketch-then-paint staged updates, prompt-conditioned importance ratios, and hierarchical credit assignment for dMLLMs, reporting gains on GenEval and DPG plus quality metrics.
Segmentation, Detection and Explanation: A Unified Framework for CT Appearance Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

A unified autoregressive vision-language framework integrates segmentation, detection, and appearance reasoning for CT images via task-routing tokens and progressive refinement, with gains on public benchmarks.
From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding
cs.CV 2026-05 unverdicted novelty 6.0

A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.
DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models
cs.CR 2026-05 unverdicted novelty 6.0

DarkLLM trains an LLM to generate language-driven adversarial perturbations that unify targeted, untargeted, segmentation, and multi-model attacks on foundation models.
UAM: A Dual-Stream Perspective on Forgetting in VLA Training
cs.CV 2026-05 unverdicted novelty 6.0

UAM adds a Dorsal Expert initialized from a generative model and trained on visual dynamics prediction to preserve over 95% of VLM multimodal ability in VLA training while achieving top success rates on manipulation t...
Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

CLVR framework adds closed-loop visual verification, proxy prompt reinforcement learning, and delta-space weight merge to improve complex text-to-image generation over single-step or unverified multi-step baselines.
Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

CLVR couples verified logical planning with pixel diffusion, uses proxy reinforcement learning on distilled histories, and merges weights to cut inference to 4 NFEs while outperforming open-source T2I models on comple...
InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
cs.CV 2026-05 conditional novelty 6.0

InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.
G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models
cs.CV 2026-05 conditional novelty 6.0

G²TR reduces visual tokens and prefill compute by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency, balanced selection, and merging, while preserving reasoning accuracy and e...
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
cs.CV 2026-05 unverdicted novelty 6.0

V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
cs.CV 2026-05 unverdicted novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
cs.CV 2026-05 unverdicted novelty 6.0

A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
cs.AI 2026-05 unverdicted novelty 6.0

Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text...
SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
cs.CV 2026-05 unverdicted novelty 6.0

STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
cs.CV 2026-05 unverdicted novelty 6.0

CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 120 Pith papers · 29 internal anchors

[1]

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P . Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A fron- tier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Betker, G

J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

work page 2023
[3]

X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y. Wu, Z. Wang, J. Kwok, P . Luo, H. Lu, et al. Pixart- 𝑎𝑙 𝑝ℎ𝑎: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023

work page internal anchor Pith review arXiv 2023
[5]

J. Chen, C. Ge, E. Xie, Y. Wu, L. Yao, X. Ren, Z. Wang, P . Luo, H. Lu, and Z. Li. PixArt-Sigma: Weak-to-strong training of diffusion transformer for 4K text-to-image generation. arXiv preprint arXiv:2403.04692, 2024

work page arXiv 2024
[6]

X. Chu, L. Qiao, X. Lin, S. Xu, Y. Yang, Y. Hu, F. Wei, X. Zhang, B. Zhang, X. Wei, et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023

work page internal anchor Pith review arXiv 2023
[7]

X. Chu, L. Qiao, X. Zhang, S. Xu, F. Wei, Y. Yang, X. Sun, Y. Hu, X. Lin, B. Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766, 2024

work page internal anchor Pith review arXiv 2024
[8]

W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P . Fung, and S. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

work page 2023
[9]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchi- cal image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009
[10]

R. Dong, C. Han, Y. Peng, Z. Qi, Z. Ge, J. Yang, L. Zhao, J. Sun, H. Zhou, H. Wei, et al. Dream- llm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499, 2023. 9

work page arXiv 2023
[11]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

P . Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, K. Lacey, A. Goodwin, Y. Marek, and R. Rom- bach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv.org/abs/2403.03206

work page internal anchor Pith review arXiv 2024
[12]

C. Fu, P . Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Y. Ge, S. Zhao, J. Zhu, Y. Ge, K. Yi, L. Song, C. Li, X. Ding, and Y. Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024

work page internal anchor Pith review arXiv 2024
[14]

Ghosh, H

D. Ghosh, H. Hajishirzi, and L. Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[15]

Hai-llm: Efficient and lightweight training tool for large models, 2023

High-flyer. Hai-llm: Efficient and lightweight training tool for large models, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm

work page 2023
[16]

X. Hu, R. Wang, Y. Fang, B. Fu, P . Cheng, and G. Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135, 2024

work page internal anchor Pith review arXiv 2024
[17]

D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

work page 2019
[18]

Y. Jin, K. Xu, L. Chen, C. Liao, J. Tan, B. Chen, C. Lei, A. Liu, C. Song, X. Lei, et al. Unified language-vision pretraining with dynamic discrete visual tokenization. arXiv preprint arXiv:2309.04669, 2023

work page arXiv 2023
[19]

Laurençon, D

H. Laurençon, D. van Strien, S. Bekman, L. Tronchon, L. Saulnier, T. Wang, S. Karamcheti, A. Singh, G. Pistilli, Y. Jernite, and et al. Introducing idefics: An open reproduction of state-of-the-art visual language model, 2023. URL https://huggingface.co/blog/id efics

work page 2023
[20]

Laurençon, A

H. Laurençon, A. Marafioti, V . Sanh, and L. Tronchon. Building and better understanding vision-language models: insights and future directions., 2024

work page 2024
[21]

B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and Y. Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023

work page internal anchor Pith review arXiv 2023
[22]

D. Li, A. Kamko, E. Akhgari, A. Sabet, L. Xu, and S. Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024

work page internal anchor Pith review arXiv 2024
[23]

Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023

work page internal anchor Pith review arXiv 2023
[24]

Z. Li, H. Li, Y. Shi, A. B. Farimani, Y. Kluger, L. Yang, and P . Wang. Dual diffusion for unified image generation and understanding. arXiv preprint arXiv:2501.00289, 2024

work page arXiv 2024
[25]

Z. Li, J. Zhang, Q. Lin, J. Xiong, Y. Long, X. Deng, Y. Zhang, X. Liu, M. Huang, Z. Xiao, et al. Hunyuan-DiT: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding. arXiv preprint arXiv:2405.08748, 2024. 10

work page internal anchor Pith review arXiv 2024
[26]

H. Liu, C. Li, Y. Li, and Y. J. Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

work page 2024
[27]

H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024

work page 2024
[28]

H. Liu, W. Yan, M. Zaharia, and P . Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024

work page internal anchor Pith review arXiv 2024
[29]

Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al. Mm- bench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023

work page internal anchor Pith review arXiv 2023
[30]

Y. Ma, X. Liu, X. Chen, W. Liu, C. Wu, Z. Wu, Z. Pan, Z. Xie, H. Zhang, X. yu, L. Zhao, Y. Wang, J. Liu, and C. Ruan. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation, 2024

work page 2024
[31]

Yfcc-huggingface

mehdidc. Yfcc-huggingface. https://huggingface.co/datasets/mehdidc/yfcc15 m, 2024

work page 2024
[32]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rom- bach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Podell, Z

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rom- bach. SDXL: Improving latent diffusion models for high-resolution image synthesis. 2024

work page 2024
[34]

L. Qu, H. Zhang, Y. Liu, X. Wang, Y. Jiang, Y. Gao, H. Ye, D. K. Du, Z. Yuan, and X. Wu. Tokenflow: Unified image tokenizer for multimodal understanding and generation. arXiv preprint arXiv:2412.03069, 2024

work page arXiv 2024
[35]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P . Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P . Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. 2022

work page 2022
[37]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P . Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[38]

P . Sun, Y. Jiang, S. Chen, S. Zhang, B. Peng, P . Luo, and Z. Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024

work page internal anchor Pith review arXiv 2024
[39]

Q. Sun, Q. Yu, Y. Cui, F. Zhang, X. Zhang, Y. Wang, H. Gao, J. Liu, T. Huang, and X. Wang. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023

work page internal anchor Pith review arXiv 2023
[40]

C. Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024

work page internal anchor Pith review arXiv 2024
[41]

G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

S. Tong, D. Fan, J. Zhu, Y. Xiong, X. Chen, K. Sinha, M. Rabbat, Y. LeCun, S. Xie, and Z. Liu. Metamorph: Multimodal understanding and generation via instruction tuning. arXiv preprint arXiv:2412.14164, 2024

work page internal anchor Pith review arXiv 2024
[43]

Midjourney prompts dataset

Vivym. Midjourney prompts dataset. https://huggingface.co/datasets/vivym/ midjourney-prompts, 2023. Accessed: [Insert Date of Access, e.g., 2023-10-15]

work page 2023
[44]

C. Wang, G. Lu, J. Yang, R. Huang, J. Han, L. Hou, W. Zhang, and H. Xu. Illume: Il- luminating your llms to see, draw, and self-enhance. arXiv preprint arXiv:2412.06673, 2024

work page arXiv 2024
[45]

X. Wang, X. Zhang, Z. Luo, Q. Sun, Y. Cui, J. Wang, F. Zhang, Y. Wang, Z. Li, Q. Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024

work page internal anchor Pith review arXiv 2024
[46]

C. Wu, X. Chen, Z. Wu, Y. Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXiv:2410.13848, 2024

work page internal anchor Pith review arXiv 2024
[47]

S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023

work page arXiv 2023
[48]

Y. Wu, Z. Zhang, J. Chen, H. Tang, D. Li, Y. Fang, L. Zhu, E. Xie, H. Yin, L. Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429, 2024

work page internal anchor Pith review arXiv 2024
[49]

Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y. Ma, C. Wu, B. Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal un- derstanding. arXiv preprint arXiv:2412.10302, 2024

work page internal anchor Pith review arXiv 2024
[50]

J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin, Y. Gu, Z. Chen, Z. Yang, and M. Z. Shou. Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024

work page internal anchor Pith review arXiv 2024
[51]

W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023

work page internal anchor Pith review arXiv 2023
[52]

X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y. Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

work page 2024
[53]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre- training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023

work page 2023
[54]

C. Zhao, Y. Song, W. Wang, H. Feng, E. Ding, Y. Sun, X. Xiao, and J. Wang. Monoformer: One transformer for both diffusion and autoregression. arXiv preprint arXiv:2409.16280, 2024

work page arXiv 2024
[55]

C. Zhou, L. Yu, A. Babu, K. Tirumala, M. Yasunaga, L. Shamis, J. Kahn, X. Ma, L. Zettle- moyer, and O. Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039, 2024

work page internal anchor Pith review arXiv 2024
[56]

Y. Zhu, M. Zhu, N. Liu, Z. Ou, X. Mou, and J. Tang. Llava-phi: Efficient multi-modal assistant with small language model. arXiv preprint arXiv:2401.02330, 2024. 12

work page arXiv 2024
[57]

L. Zhuo, R. Du, H. Xiao, Y. Li, D. Liu, R. Huang, W. Liu, L. Zhao, F.-Y. Wang, Z. Ma, et al. Lumina-Next: Making Lumina-T2X stronger and faster with Next-DiT. arXiv preprint arXiv:2406.18583, 2024. 13

work page arXiv 2024

[1] [1]

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P . Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A fron- tier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Betker, G

J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

work page 2023

[3] [3]

X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y. Wu, Z. Wang, J. Kwok, P . Luo, H. Lu, et al. Pixart- 𝑎𝑙 𝑝ℎ𝑎: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023

work page internal anchor Pith review arXiv 2023

[5] [5]

J. Chen, C. Ge, E. Xie, Y. Wu, L. Yao, X. Ren, Z. Wang, P . Luo, H. Lu, and Z. Li. PixArt-Sigma: Weak-to-strong training of diffusion transformer for 4K text-to-image generation. arXiv preprint arXiv:2403.04692, 2024

work page arXiv 2024

[6] [6]

X. Chu, L. Qiao, X. Lin, S. Xu, Y. Yang, Y. Hu, F. Wei, X. Zhang, B. Zhang, X. Wei, et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023

work page internal anchor Pith review arXiv 2023

[7] [7]

X. Chu, L. Qiao, X. Zhang, S. Xu, F. Wei, Y. Yang, X. Sun, Y. Hu, X. Lin, B. Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766, 2024

work page internal anchor Pith review arXiv 2024

[8] [8]

W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P . Fung, and S. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

work page 2023

[9] [9]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchi- cal image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009

[10] [10]

R. Dong, C. Han, Y. Peng, Z. Qi, Z. Ge, J. Yang, L. Zhao, J. Sun, H. Zhou, H. Wei, et al. Dream- llm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499, 2023. 9

work page arXiv 2023

[11] [11]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

P . Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, K. Lacey, A. Goodwin, Y. Marek, and R. Rom- bach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv.org/abs/2403.03206

work page internal anchor Pith review arXiv 2024

[12] [12]

C. Fu, P . Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Y. Ge, S. Zhao, J. Zhu, Y. Ge, K. Yi, L. Song, C. Li, X. Ding, and Y. Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024

work page internal anchor Pith review arXiv 2024

[14] [14]

Ghosh, H

D. Ghosh, H. Hajishirzi, and L. Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[15] [15]

Hai-llm: Efficient and lightweight training tool for large models, 2023

High-flyer. Hai-llm: Efficient and lightweight training tool for large models, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm

work page 2023

[16] [16]

X. Hu, R. Wang, Y. Fang, B. Fu, P . Cheng, and G. Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135, 2024

work page internal anchor Pith review arXiv 2024

[17] [17]

D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

work page 2019

[18] [18]

Y. Jin, K. Xu, L. Chen, C. Liao, J. Tan, B. Chen, C. Lei, A. Liu, C. Song, X. Lei, et al. Unified language-vision pretraining with dynamic discrete visual tokenization. arXiv preprint arXiv:2309.04669, 2023

work page arXiv 2023

[19] [19]

Laurençon, D

H. Laurençon, D. van Strien, S. Bekman, L. Tronchon, L. Saulnier, T. Wang, S. Karamcheti, A. Singh, G. Pistilli, Y. Jernite, and et al. Introducing idefics: An open reproduction of state-of-the-art visual language model, 2023. URL https://huggingface.co/blog/id efics

work page 2023

[20] [20]

Laurençon, A

H. Laurençon, A. Marafioti, V . Sanh, and L. Tronchon. Building and better understanding vision-language models: insights and future directions., 2024

work page 2024

[21] [21]

B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and Y. Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023

work page internal anchor Pith review arXiv 2023

[22] [22]

D. Li, A. Kamko, E. Akhgari, A. Sabet, L. Xu, and S. Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024

work page internal anchor Pith review arXiv 2024

[23] [23]

Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023

work page internal anchor Pith review arXiv 2023

[24] [24]

Z. Li, H. Li, Y. Shi, A. B. Farimani, Y. Kluger, L. Yang, and P . Wang. Dual diffusion for unified image generation and understanding. arXiv preprint arXiv:2501.00289, 2024

work page arXiv 2024

[25] [25]

Z. Li, J. Zhang, Q. Lin, J. Xiong, Y. Long, X. Deng, Y. Zhang, X. Liu, M. Huang, Z. Xiao, et al. Hunyuan-DiT: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding. arXiv preprint arXiv:2405.08748, 2024. 10

work page internal anchor Pith review arXiv 2024

[26] [26]

H. Liu, C. Li, Y. Li, and Y. J. Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

work page 2024

[27] [27]

H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024

work page 2024

[28] [28]

H. Liu, W. Yan, M. Zaharia, and P . Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024

work page internal anchor Pith review arXiv 2024

[29] [29]

Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al. Mm- bench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023

work page internal anchor Pith review arXiv 2023

[30] [30]

Y. Ma, X. Liu, X. Chen, W. Liu, C. Wu, Z. Wu, Z. Pan, Z. Xie, H. Zhang, X. yu, L. Zhao, Y. Wang, J. Liu, and C. Ruan. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation, 2024

work page 2024

[31] [31]

Yfcc-huggingface

mehdidc. Yfcc-huggingface. https://huggingface.co/datasets/mehdidc/yfcc15 m, 2024

work page 2024

[32] [32]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rom- bach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Podell, Z

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rom- bach. SDXL: Improving latent diffusion models for high-resolution image synthesis. 2024

work page 2024

[34] [34]

L. Qu, H. Zhang, Y. Liu, X. Wang, Y. Jiang, Y. Gao, H. Ye, D. K. Du, Z. Yuan, and X. Wu. Tokenflow: Unified image tokenizer for multimodal understanding and generation. arXiv preprint arXiv:2412.03069, 2024

work page arXiv 2024

[35] [35]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P . Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[36] [36]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P . Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. 2022

work page 2022

[37] [37]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P . Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022

[38] [38]

P . Sun, Y. Jiang, S. Chen, S. Zhang, B. Peng, P . Luo, and Z. Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024

work page internal anchor Pith review arXiv 2024

[39] [39]

Q. Sun, Q. Yu, Y. Cui, F. Zhang, X. Zhang, Y. Wang, H. Gao, J. Liu, T. Huang, and X. Wang. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023

work page internal anchor Pith review arXiv 2023

[40] [40]

C. Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024

work page internal anchor Pith review arXiv 2024

[41] [41]

G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

S. Tong, D. Fan, J. Zhu, Y. Xiong, X. Chen, K. Sinha, M. Rabbat, Y. LeCun, S. Xie, and Z. Liu. Metamorph: Multimodal understanding and generation via instruction tuning. arXiv preprint arXiv:2412.14164, 2024

work page internal anchor Pith review arXiv 2024

[43] [43]

Midjourney prompts dataset

Vivym. Midjourney prompts dataset. https://huggingface.co/datasets/vivym/ midjourney-prompts, 2023. Accessed: [Insert Date of Access, e.g., 2023-10-15]

work page 2023

[44] [44]

C. Wang, G. Lu, J. Yang, R. Huang, J. Han, L. Hou, W. Zhang, and H. Xu. Illume: Il- luminating your llms to see, draw, and self-enhance. arXiv preprint arXiv:2412.06673, 2024

work page arXiv 2024

[45] [45]

X. Wang, X. Zhang, Z. Luo, Q. Sun, Y. Cui, J. Wang, F. Zhang, Y. Wang, Z. Li, Q. Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024

work page internal anchor Pith review arXiv 2024

[46] [46]

C. Wu, X. Chen, Z. Wu, Y. Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXiv:2410.13848, 2024

work page internal anchor Pith review arXiv 2024

[47] [47]

S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023

work page arXiv 2023

[48] [48]

Y. Wu, Z. Zhang, J. Chen, H. Tang, D. Li, Y. Fang, L. Zhu, E. Xie, H. Yin, L. Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429, 2024

work page internal anchor Pith review arXiv 2024

[49] [49]

Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y. Ma, C. Wu, B. Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal un- derstanding. arXiv preprint arXiv:2412.10302, 2024

work page internal anchor Pith review arXiv 2024

[50] [50]

J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin, Y. Gu, Z. Chen, Z. Yang, and M. Z. Shou. Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024

work page internal anchor Pith review arXiv 2024

[51] [51]

W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023

work page internal anchor Pith review arXiv 2023

[52] [52]

X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y. Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

work page 2024

[53] [53]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre- training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023

work page 2023

[54] [54]

C. Zhao, Y. Song, W. Wang, H. Feng, E. Ding, Y. Sun, X. Xiao, and J. Wang. Monoformer: One transformer for both diffusion and autoregression. arXiv preprint arXiv:2409.16280, 2024

work page arXiv 2024

[55] [55]

C. Zhou, L. Yu, A. Babu, K. Tirumala, M. Yasunaga, L. Shamis, J. Kahn, X. Ma, L. Zettle- moyer, and O. Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039, 2024

work page internal anchor Pith review arXiv 2024

[56] [56]

Y. Zhu, M. Zhu, N. Liu, Z. Ou, X. Mou, and J. Tang. Llava-phi: Efficient multi-modal assistant with small language model. arXiv preprint arXiv:2401.02330, 2024. 12

work page arXiv 2024

[57] [57]

L. Zhuo, R. Du, H. Xiao, Y. Li, D. Liu, R. Huang, W. Liu, L. Zhao, F.-Y. Wang, Z. Ma, et al. Lumina-Next: Making Lumina-T2X stronger and faster with Next-DiT. arXiv preprint arXiv:2406.18583, 2024. 13

work page arXiv 2024