Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Pith reviewed 2026-05-11 08:09 UTC · model grok-4.3
The pith
Janus-Pro improves multimodal understanding and text-to-image instruction following by optimizing training, expanding data, and scaling model size.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Janus-Pro incorporates an optimized training strategy, expanded training data, and scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation.
What carries the argument
The unified Janus-Pro architecture that performs both multimodal understanding and text-to-image generation within one model, advanced through optimized training, data expansion, and increased scale.
If this is right
- Unified models can reach higher capability on both comprehension and generation tasks without separate specialized systems.
- Training data volume and model size continue to drive gains even in architectures that already combine vision and language.
- More stable text-to-image outputs reduce the need for post-processing or multiple sampling attempts.
- Public release of code and models allows direct testing and extension by others.
Where Pith is reading between the lines
- The pattern suggests scaling laws observed in language models may transfer to joint understanding-plus-generation systems.
- Similar gains could appear if the same three changes were applied to other base multimodal models.
- Longer-term, this points toward simpler AI pipelines where one model handles visual input and output without task-specific retraining.
Load-bearing premise
The reported performance gains come from the three specific changes of optimized training, expanded data, and larger model size rather than from differences in evaluation protocols, data details, or other unmentioned choices.
What would settle it
A controlled experiment that applies the three changes one at a time to the original Janus model and finds no meaningful gains on the same benchmarks would show the combined improvements are not responsible for the results.
read the original abstract
In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Janus-Pro as an advancement over the prior Janus model by incorporating three changes: an optimized training strategy, expanded training data, and scaling to larger model size. It claims these yield significant improvements in multimodal understanding, text-to-image instruction following, and generation stability, with code and models released publicly.
Significance. If the gains are causally attributable to the three factors, the work provides empirical support for scaling benefits in unified multimodal models handling both understanding and generation. The public code release is a notable strength enabling reproducibility and community verification.
major comments (2)
- [Abstract] Abstract: The central claim attributes performance advancements directly to the three listed changes (optimized training, expanded data, larger model), yet no controlled ablations are described that isolate each factor while holding the others and the evaluation protocol fixed. This undermines causal attribution, as differences in data curation, prompt formatting, or inference details could account for the deltas instead.
- [Experiments] Experiments section (inferred from standard structure and abstract claims): Without within-paper ablation tables or results showing incremental gains from each change individually (e.g., base model with only expanded data), the magnitude of reported improvements cannot be confidently linked to the stated scaling factors rather than unmentioned implementation choices.
minor comments (1)
- Ensure all reported benchmark results include standard deviations or multiple-run statistics to support the 'significant advancements' claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on clarifying the attribution of improvements in Janus-Pro. We address the major comments point by point below, with planned revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim attributes performance advancements directly to the three listed changes (optimized training, expanded data, larger model), yet no controlled ablations are described that isolate each factor while holding the others and the evaluation protocol fixed. This undermines causal attribution, as differences in data curation, prompt formatting, or inference details could account for the deltas instead.
Authors: We agree that the abstract phrasing could be interpreted as implying direct causal effects for each factor individually. The manuscript presents Janus-Pro as the result of applying all three changes together and reports performance relative to the original Janus and other baselines. No isolated ablations holding all other variables fixed are included. In revision we will rephrase the abstract to describe the improvements as resulting from the collective incorporation of the three changes, and we will add a brief discussion of this limitation in the Experiments section. revision: yes
-
Referee: [Experiments] Experiments section (inferred from standard structure and abstract claims): Without within-paper ablation tables or results showing incremental gains from each change individually (e.g., base model with only expanded data), the magnitude of reported improvements cannot be confidently linked to the stated scaling factors rather than unmentioned implementation choices.
Authors: The current Experiments section focuses on the final Janus-Pro model and its comparisons to prior work rather than incremental ablations of each scaling factor. We acknowledge that this leaves open the possibility that unmentioned implementation details contribute to the observed gains. We will expand the Experiments section with additional discussion of the cumulative nature of the changes and the practical constraints on running fully controlled large-scale ablations. We will also note that the public code and model release enables the community to perform further targeted experiments. revision: partial
Circularity Check
No circularity; empirical gains rest on external benchmarks
full rationale
The paper presents an empirical scaling study: it applies three engineering changes (optimized training, more data, larger model) to a prior architecture and measures performance on public multimodal understanding and generation benchmarks. No equations, first-principles derivations, or internal predictions are defined; the reported deltas are direct comparisons against external test sets and prior models. No self-definitional loops, fitted-input-as-prediction, or load-bearing self-citations appear. The argument is therefore self-contained against reproducible external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities
-
IndisputableMonolith.Foundation.PhiForcingphi_equation unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
decoupling visual encoding for multimodal understanding and generation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
MolSight: Molecular Property Prediction with Images
Vision encoders on single 2D molecular images with a chemistry-informed curriculum achieve top or near-top results on 10 property prediction tasks at 80x lower FLOPs than multi-modal competitors.
-
Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving
Cornfigurator is the first automated deployment planner for generic any-to-any multimodal models that explores the full range of colocation-to-disaggregation strategies and delivers 1.12x to 6.32x higher goodput than ...
-
Flow-GRPO: Training Flow Matching Models via Online RL
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
-
VisAnalog: A Diagnostic Suite for Visual Concept Transfer on Natural Images
VisAnalog is a new controlled benchmark showing VLMs substantially underperform humans on visual concept transfer under one- to four-step deterministic transformations, with relation inference as the main failure mode.
-
MotiMotion: Motion-Controlled Video Generation with Visual Reasoning
MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiB...
-
AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in Agriculture
AgroTools is a new benchmark for tool-augmented multimodal agents in agriculture featuring 539 QA pairs, 1,097 images, five task families, and 14 tools, with evaluations showing major limitations in current models' to...
-
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.
-
RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution
RankE co-evolves AR policy and decoder via alternating ranking optimization, improving both FID and CLIP scores on LlamaGen-XL and Janus-Pro where policy-only RL degrades FID.
-
Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation
HeadKV compresses KV cache for autoregressive image generation via head-aware budget allocation, early head-type identification from consistent patterns, and stratified token eviction.
-
Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models
ToBAC is the first backdoor attack on unified autoregressive models, using data or model poisoning to make triggers elicit cross-modal malicious behavior in text and image generation.
-
ImageAttributionBench: How Far Are We from Generalizable Attribution?
ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.
-
G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models
G²TR reduces visual tokens and prefill computation by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency while preserving reasoning accuracy and editing quality.
-
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
-
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming ...
-
UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning
UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixe...
-
Normalizing Trajectory Models
NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.
-
Normalizing Trajectory Models
NTM uses per-step conditional normalizing flows plus a trajectory-wide predictor to achieve exact-likelihood 4-step sampling that matches or exceeds baselines on text-to-image tasks.
-
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...
-
Probing Visual Planning in Image Editing Models
Image editing models fail zero-shot visual planning on abstract mazes and queen puzzles but generalize after finetuning, yet still cannot match human zero-shot efficiency.
-
Exploring Spatial Intelligence from a Generative Perspective
Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.
-
Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks
StepSTEM benchmark and dynamic-programming step alignment show top MLLMs achieve only 38.29% accuracy on graduate STEM tasks requiring interleaved cross-modal reasoning.
-
Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion
3D-ARD+ unifies autoregressive token prediction with diffusion-based 3D latent generation to co-produce indoor scene layouts and object geometries that follow complex text-specified spatial and semantic constraints.
-
Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models
Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.
-
Learning Vision-Language-Action World Models for Autonomous Driving
VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
-
PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks
PlanViz is a new benchmark with three sub-tasks and PlanScore metric to evaluate planning-oriented image generation and editing by unified multimodal models for computer-use tasks.
-
A Unified and Controllable Framework for Layered Image Generation with Visual Effects
LASAGNA produces layered images with integrated visual effects in a single pass, enabling drift-free edits via alpha compositing while releasing a 48K dataset and a 242-sample benchmark.
-
Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation
GAR-Font is a global-aware autoregressive framework for multimodal few-shot font generation that adds global tokenization, a language-style adapter, and post-refinement to improve style coherence over patch-based methods.
-
dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models
dMLLM-TTS delivers up to 6x more efficient test-time scaling for diffusion MLLMs via O(N+T) hierarchical search and self-verified feedback, improving generation quality on GenEval across three models.
-
AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model
AIA loss teaches unified multimodal models task-specific cross-modal attention patterns to reduce conflicts between image understanding and generation without architecture decoupling.
-
Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching
Derives exact guidance transition rates for discrete flow matching models that require only one model evaluation per sampling step and unify prior approximation-based methods.
-
Transfer between Modalities with MetaQueries
MetaQueries act as an efficient bridge allowing multimodal LLMs to augment diffusion-based image generation and editing without complex training or unfreezing the LLM backbone.
-
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.
-
S$^4$ST: A Strong, Self-transferable, faSt, and Simple Scale Transformation for Transferable Targeted Attack
S⁴ST shows that dimensionally consistent scaling with low-redundancy complementary transforms achieves state-of-the-art data-free transferable targeted attacks by exploiting visual data's multi-scale nature.
-
B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation
B-GRTO extends GRPO by reusing rollouts to optimize auxiliary segmentation decoder objectives, yielding substantial gains over plain GRPO on referring segmentation tasks.
-
RiT: Vanilla Diffusion Transformers Suffice in Representation Space
A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.
-
UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation
UniVL unifies vision and language into one mask-rendered input processed by an OCR backbone to condition diffusion models for spatially grounded image generation without a standalone text encoder.
-
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
Uni-Edit frames intelligent image editing as a general task for unified multimodal models and uses an automated pipeline to synthesize complex reasoning-intensive instructions from VQA data, yielding performance gains...
-
FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation
FullFlow adds LoRA adapters and discrete text insertion to pretrained rectified-flow text-to-image models, achieving bidirectional generation with major gains in FID, CIDEr, VRAM, and throughput over Dual Diffusion baselines.
-
Semantic Generative Tuning for Unified Multimodal Models
Semantic Generative Tuning uses image segmentation as a generative proxy to align misaligned representation spaces in unified multimodal models and improve both perception and generative layout fidelity.
-
Lance: Unified Multimodal Modeling by Multi-Task Synergy
Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keepin...
-
LatentUMM: Dual Latent Alignment for Unified Multimodal Models
LatentUMM proposes dual latent alignment at modality and capacity levels plus latent dynamics stabilization to reduce semantic drift and improve consistency in unified multimodal models.
-
Medical Context Distorts Decisions in Clinical Vision Language Models
Clinical VLMs over-rely on text modality, irrelevant clinical history, and prompt wording when making chest x-ray decisions on MIMIC-CXR data.
-
HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing
HierEdit enables efficient 4K image editing via low-resolution proxy localization followed by hierarchical local-window diffusion that reuses unaltered regions as conditioning.
-
Latent Action Control for Reasoning-Guided Unified Image Generation
Latent Action Control learns unobserved action trajectories via variational alignment and GRPO to inject reasoning into flow-based image generation, yielding gains on compositional benchmarks.
-
Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models
Proposes HT-GRPO with sketch-then-paint staged updates, prompt-conditioned importance ratios, and hierarchical credit assignment for dMLLMs, reporting gains on GenEval and DPG plus quality metrics.
-
Segmentation, Detection and Explanation: A Unified Framework for CT Appearance Reasoning
A unified autoregressive vision-language framework integrates segmentation, detection, and appearance reasoning for CT images via task-routing tokens and progressive refinement, with gains on public benchmarks.
-
From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding
A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.
-
DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models
DarkLLM trains an LLM to generate language-driven adversarial perturbations that unify targeted, untargeted, segmentation, and multi-model attacks on foundation models.
-
UAM: A Dual-Stream Perspective on Forgetting in VLA Training
UAM adds a Dorsal Expert initialized from a generative model and trained on visual dynamics prediction to preserve over 95% of VLM multimodal ability in VLA training while achieving top success rates on manipulation t...
-
Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning
CLVR framework adds closed-loop visual verification, proxy prompt reinforcement learning, and delta-space weight merge to improve complex text-to-image generation over single-step or unverified multi-step baselines.
-
Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning
CLVR couples verified logical planning with pixel diffusion, uses proxy reinforcement learning on distilled histories, and merges weights to cut inference to 4 NFEs while outperforming open-source T2I models on comple...
-
InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.
-
G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models
G²TR reduces visual tokens and prefill compute by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency, balanced selection, and merging, while preserving reasoning accuracy and e...
-
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...
-
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
-
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...
-
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text...
-
SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation
SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.
-
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.
-
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...
Reference graph
Works this paper leans on
-
[1]
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P . Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A fron- tier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [2]
-
[3]
X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y. Wu, Z. Wang, J. Kwok, P . Luo, H. Lu, et al. Pixart- 𝑎𝑙 𝑝ℎ𝑎: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023
work page internal anchor Pith review arXiv 2023
- [5]
-
[6]
X. Chu, L. Qiao, X. Lin, S. Xu, Y. Yang, Y. Hu, F. Wei, X. Zhang, B. Zhang, X. Wei, et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023
work page internal anchor Pith review arXiv 2023
-
[7]
X. Chu, L. Qiao, X. Zhang, S. Xu, F. Wei, Y. Yang, X. Sun, Y. Hu, X. Lin, B. Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766, 2024
work page internal anchor Pith review arXiv 2024
-
[8]
W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P . Fung, and S. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023
work page 2023
-
[9]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchi- cal image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009
work page 2009
- [10]
-
[11]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
P . Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, K. Lacey, A. Goodwin, Y. Marek, and R. Rom- bach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv.org/abs/2403.03206
work page internal anchor Pith review arXiv 2024
-
[12]
C. Fu, P . Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Y. Ge, S. Zhao, J. Zhu, Y. Ge, K. Yi, L. Song, C. Li, X. Ding, and Y. Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024
work page internal anchor Pith review arXiv 2024
- [14]
-
[15]
Hai-llm: Efficient and lightweight training tool for large models, 2023
High-flyer. Hai-llm: Efficient and lightweight training tool for large models, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm
work page 2023
-
[16]
X. Hu, R. Wang, Y. Fang, B. Fu, P . Cheng, and G. Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135, 2024
work page internal anchor Pith review arXiv 2024
-
[17]
D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019
work page 2019
- [18]
-
[19]
H. Laurençon, D. van Strien, S. Bekman, L. Tronchon, L. Saulnier, T. Wang, S. Karamcheti, A. Singh, G. Pistilli, Y. Jernite, and et al. Introducing idefics: An open reproduction of state-of-the-art visual language model, 2023. URL https://huggingface.co/blog/id efics
work page 2023
-
[20]
H. Laurençon, A. Marafioti, V . Sanh, and L. Tronchon. Building and better understanding vision-language models: insights and future directions., 2024
work page 2024
-
[21]
B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and Y. Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023
work page internal anchor Pith review arXiv 2023
-
[22]
D. Li, A. Kamko, E. Akhgari, A. Sabet, L. Xu, and S. Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024
work page internal anchor Pith review arXiv 2024
-
[23]
Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023
work page internal anchor Pith review arXiv 2023
- [24]
-
[25]
Z. Li, J. Zhang, Q. Lin, J. Xiong, Y. Long, X. Deng, Y. Zhang, X. Liu, M. Huang, Z. Xiao, et al. Hunyuan-DiT: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding. arXiv preprint arXiv:2405.08748, 2024. 10
work page internal anchor Pith review arXiv 2024
-
[26]
H. Liu, C. Li, Y. Li, and Y. J. Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024
work page 2024
-
[27]
H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024
work page 2024
-
[28]
H. Liu, W. Yan, M. Zaharia, and P . Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024
work page internal anchor Pith review arXiv 2024
-
[29]
Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al. Mm- bench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023
work page internal anchor Pith review arXiv 2023
-
[30]
Y. Ma, X. Liu, X. Chen, W. Liu, C. Wu, Z. Wu, Z. Pan, Z. Xie, H. Zhang, X. yu, L. Zhao, Y. Wang, J. Liu, and C. Ruan. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation, 2024
work page 2024
-
[31]
mehdidc. Yfcc-huggingface. https://huggingface.co/datasets/mehdidc/yfcc15 m, 2024
work page 2024
-
[32]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rom- bach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [33]
- [34]
-
[35]
Hierarchical Text-Conditional Image Generation with CLIP Latents
A. Ramesh, P . Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
R. Rombach, A. Blattmann, D. Lorenz, P . Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. 2022
work page 2022
-
[37]
R. Rombach, A. Blattmann, D. Lorenz, P . Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[38]
P . Sun, Y. Jiang, S. Chen, S. Zhang, B. Peng, P . Luo, and Z. Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024
work page internal anchor Pith review arXiv 2024
-
[39]
Q. Sun, Q. Yu, Y. Cui, F. Zhang, X. Zhang, Y. Wang, H. Gao, J. Liu, T. Huang, and X. Wang. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023
work page internal anchor Pith review arXiv 2023
-
[40]
C. Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024
work page internal anchor Pith review arXiv 2024
-
[41]
G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 11
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
S. Tong, D. Fan, J. Zhu, Y. Xiong, X. Chen, K. Sinha, M. Rabbat, Y. LeCun, S. Xie, and Z. Liu. Metamorph: Multimodal understanding and generation via instruction tuning. arXiv preprint arXiv:2412.14164, 2024
work page internal anchor Pith review arXiv 2024
-
[43]
Vivym. Midjourney prompts dataset. https://huggingface.co/datasets/vivym/ midjourney-prompts, 2023. Accessed: [Insert Date of Access, e.g., 2023-10-15]
work page 2023
- [44]
-
[45]
X. Wang, X. Zhang, Z. Luo, Q. Sun, Y. Cui, J. Wang, F. Zhang, Y. Wang, Z. Li, Q. Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024
work page internal anchor Pith review arXiv 2024
-
[46]
C. Wu, X. Chen, Z. Wu, Y. Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXiv:2410.13848, 2024
work page internal anchor Pith review arXiv 2024
- [47]
-
[48]
Y. Wu, Z. Zhang, J. Chen, H. Tang, D. Li, Y. Fang, L. Zhu, E. Xie, H. Yin, L. Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429, 2024
work page internal anchor Pith review arXiv 2024
-
[49]
Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y. Ma, C. Wu, B. Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal un- derstanding. arXiv preprint arXiv:2412.10302, 2024
work page internal anchor Pith review arXiv 2024
-
[50]
J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin, Y. Gu, Z. Chen, Z. Yang, and M. Z. Shou. Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024
work page internal anchor Pith review arXiv 2024
-
[51]
W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023
work page internal anchor Pith review arXiv 2023
-
[52]
X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y. Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024
work page 2024
-
[53]
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre- training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023
work page 2023
- [54]
-
[55]
C. Zhou, L. Yu, A. Babu, K. Tirumala, M. Yasunaga, L. Shamis, J. Kahn, X. Ma, L. Zettle- moyer, and O. Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039, 2024
work page internal anchor Pith review arXiv 2024
- [56]
- [57]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.