Lance: Unified Multimodal Modeling by Multi-Task Synergy

Fei Ding; Fengyi Fu; Hao Li; Jianzhu Guo; Mengqi Huang; Qian He; Shaojin Wu; Yinghang Song; Yongdong Zhang; Yufei Huo

REVIEW 1 major objections 2 minor 3 cited by

Lance unifies image and video understanding and generation via staged multi-task training on a dual-stream MoE without capacity scaling.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-21 07:53 UTC pith:WCWSXXVL

load-bearing objection Lance combines dual-stream MoE on interleaved sequences with modality-aware RoPE and staged adaptive training to target unified image-video understanding and generation without heavy scaling, but the abstract supplies no datasets, metrics, or ablations to back the outperformance claim. the 1 major comments →

arxiv 2605.18678 v2 pith:WCWSXXVL submitted 2026-05-18 cs.CV cs.AI

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Fengyi Fu , Mengqi Huang , Shaojin Wu , Yunsheng Jiang , Yufei Huo , Hao Li , Yinghang Song , Fei Ding

show 5 more authors

Jianzhu Guo Qian He Zheren Fu Zhendong Mao Yongdong Zhang

This is my paper

classification cs.CV cs.AI

keywords unified multimodal modelmulti-task trainingmixture of expertsimage generationvideo generationmultimodal understandingrotary positional encodingstaged training

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Lance as a lightweight unified model for multimodal understanding, generation, and editing of images and videos. It relies on collaborative multi-task training rather than larger models or text-dominant designs. A dual-stream mixture-of-experts processes shared interleaved sequences to learn context jointly while keeping separate pathways for understanding versus generation. Modality-aware rotary positional encoding reduces token interference and aids cross-task alignment. Staged training with capability-oriented objectives and adaptive data scheduling then delivers stronger generation performance than prior open-source unified models while preserving understanding strength.

Core claim

Lance is trained from scratch with a dual-stream mixture-of-experts architecture on shared interleaved multimodal sequences. This enables joint context learning while decoupling pathways for understanding and generation. Modality-aware rotary positional encoding mitigates interference among heterogeneous visual tokens and boosts alignment. A staged multi-task paradigm with adaptive scheduling strengthens both semantic comprehension and visual generation, yielding substantial outperformance over existing open-source unified models on image and video generation while retaining multimodal understanding.

What carries the argument

dual-stream mixture-of-experts architecture on shared interleaved multimodal sequences, paired with modality-aware rotary positional encoding and staged multi-task training with adaptive data scheduling

Load-bearing premise

That dual-stream MoE processing of interleaved sequences plus staged multi-task training with adaptive scheduling will produce superior generation performance without any increase in model capacity.

What would settle it

A controlled comparison in which a single-stream baseline of equal or greater parameter count, trained without the staged adaptive schedule, matches or exceeds Lance on standard image and video generation metrics while preserving understanding scores.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Substantially outperforms existing open-source unified models on image and video generation benchmarks.
Preserves strong multimodal understanding capabilities alongside the generation gains.
Enables joint context modeling across modalities while decoupling understanding and generation pathways.
Reduces interference among different visual token types through modality-aware encoding.
Supports editing tasks in addition to understanding and generation within the same model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Task synergy during training may substitute for raw parameter scaling in other multimodal settings.
The same staged scheduling principle could be tested on non-MoE backbones to isolate its contribution.
Decoupled pathways may affect how capabilities evolve when the model is later fine-tuned on new domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

Lance combines dual-stream MoE on interleaved sequences with modality-aware RoPE and staged adaptive training to target unified image-video understanding and generation without heavy scaling, but the abstract supplies no datasets, metrics, or ablations to back the outperformance claim.

read the letter

The paper's core move is to train a lightweight unified model from scratch for images and videos using a dual-stream mixture-of-experts architecture on shared interleaved sequences. This setup aims to keep joint context modeling while separating understanding and generation pathways. They add modality-aware rotary positional encoding to reduce interference between different visual tokens and use a staged multi-task schedule with adaptive data balancing to strengthen both sides. The claim is that this beats existing open-source unified models on generation quality while holding up on understanding tasks.

Referee Report

1 major / 2 minor

Summary. The manuscript presents Lance, a lightweight unified multimodal model for understanding, generation, and editing of both images and videos. It is trained from scratch using a dual-stream mixture-of-experts architecture on shared interleaved multimodal sequences, introduces modality-aware rotary positional encoding, and employs a staged multi-task training paradigm with capability-oriented objectives and adaptive data scheduling. The central claim is that this approach allows Lance to substantially outperform existing open-source unified models in image and video generation while retaining strong multimodal understanding capabilities.

Significance. If the experimental claims hold with proper controls, this work could demonstrate a viable path for unified multimodal modeling that prioritizes multi-task synergy over model capacity scaling, offering an alternative design principle for efficient multimodal systems.

major comments (1)

[Experimental results section] The central claim of substantial outperformance in image and video generation (abstract) rests on the assertion that the dual-stream MoE, modality-aware RoPE, and staged adaptive scheduling are responsible. However, no ablation studies, single-stream baselines, fixed-RoPE controls, or non-adaptive scheduling comparisons are described, leaving open the possibility that gains arise from data volume, optimizer settings, or curation differences rather than the proposed components.

minor comments (2)

[Abstract] The abstract asserts experimental superiority but supplies no concrete datasets, metrics (e.g., FID, CLIP score), baselines, or quantitative deltas, which hinders immediate assessment of the strength of the results.
[Architecture description] The description of 'modality-aware rotary positional encoding' is introduced without a direct equation or pseudocode contrasting it to standard RoPE; adding this would clarify how it mitigates interference among heterogeneous visual tokens.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below and will revise the paper accordingly to strengthen the experimental validation.

read point-by-point responses

Referee: [Experimental results section] The central claim of substantial outperformance in image and video generation (abstract) rests on the assertion that the dual-stream MoE, modality-aware RoPE, and staged adaptive scheduling are responsible. However, no ablation studies, single-stream baselines, fixed-RoPE controls, or non-adaptive scheduling comparisons are described, leaving open the possibility that gains arise from data volume, optimizer settings, or curation differences rather than the proposed components.

Authors: We agree that direct ablation studies would provide stronger evidence isolating the contributions of the dual-stream MoE, modality-aware RoPE, and staged adaptive scheduling from potential confounding factors such as data curation or optimization choices. The current manuscript focuses on end-to-end comparisons against prior open-source unified models to establish overall performance, but we recognize this leaves room for alternative explanations. In the revised version, we will add a dedicated ablation subsection in the experimental results. This will include: (1) single-stream vs. dual-stream MoE comparisons, (2) standard RoPE vs. modality-aware RoPE variants, and (3) fixed vs. adaptive data scheduling, with controls for data volume and training hyperparameters where feasible. These additions will directly address the concern without changing the core claims or results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results from new architecture and training

full rationale

The paper describes Lance as trained from scratch on a dual-stream MoE architecture over interleaved sequences, with modality-aware rotary positional encoding and staged multi-task training using adaptive scheduling. Outperformance claims rest on experimental results rather than any derivation that reduces by construction to fitted inputs, self-definitions, or self-citation chains. No equations or steps exhibit self-definitional loops, fitted parameters renamed as predictions, or uniqueness theorems imported from the authors' prior work. The central claims are self-contained empirical proposals evaluated against external baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of the proposed architecture and training strategy, which are introduced without detailed external validation in the abstract.

axioms (1)

domain assumption Unified context modeling combined with decoupled capability pathways enables effective collaborative multi-task training for multimodal sequences.
Presented as one of the two core principles grounding the entire model design.

invented entities (1)

modality-aware rotary positional encoding no independent evidence
purpose: Mitigate interference among heterogeneous visual tokens and boost cross-task alignment
New component introduced to handle mixed image and video tokens in the shared sequence.

pith-pipeline@v0.9.0 · 5734 in / 1263 out tokens · 51806 ms · 2026-05-21T07:53:34.544718+00:00 · methodology

0 comments

read the original abstract

We present Lance, a lightweight native unified model supporting multimodal understanding, generation, and editing for both images and videos. Rather than relying on model capacity scaling or text-image-dominant designs, Lance explores a practical paradigm for unified multimodal modeling via collaborative multi-task training. It is grounded in two core principles: unified context modeling and decoupled capability pathways. Specifically, Lance is trained from scratch and employs a dual-stream mixture-of-experts architecture on shared interleaved multimodal sequences, enabling joint context learning while decoupling the pathways for understanding and generation. We further introduce modality-aware rotary positional encoding to mitigate interference among heterogeneous visual tokens and boost cross-task alignment. During training, Lance adopts a staged multi-task training paradigm with capability-oriented objectives and adaptive data scheduling to strengthen both semantic comprehension and visual generation performance. Experimental results demonstrate that Lance substantially outperforms existing open-source unified models in image and video generation, while retaining strong multimodal understanding capabilities. The homepage is available at https://lance-project.github.io.

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models
cs.CV 2026-06 unverdicted novelty 6.0

HarmVideoBench is a multi-layered benchmark for harmful video understanding in LVLMs with three hierarchical dimensions, and BCR is a method that raises average model performance from 61.7% to 84.4%.
S1-Omni-Image: A Unified Model for Scientific Image Understanding, Generation, and Editing
cs.CV 2026-06 unverdicted novelty 4.0

S1-Omni-Image unifies scientific image understanding, generation and editing via a think-before-generate paradigm on top of S1-VL-32B, trained on a 314K-sample SciGenEdit dataset, and reports SOTA results on multiple ...
Toward Native Multimodal Modeling: A Roadmap
cs.CV 2026-05 unverdicted novelty 3.0

A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-...

Reference graph

Works this paper leans on

151 extracted references · 151 canonical work pages · cited by 3 Pith papers · 60 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advancesin neural information processing systems, 35:23716–23736, 2022

work page 2022
[3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Improving image generation with better captions.Computer Science

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

work page 2023
[7]

Diffusion self-distillation for zero-shot customized image generation.arXiv preprint arXiv:2411.18616, 2024

Shengqu Cai, Eric Chan, Yunzhi Zhang, Leonidas Guibas, Jiajun Wu, and Gordon Wetzstein. Diffusion self-distillation for zero-shot customized image generation.arXiv preprint arXiv:2411.18616, 2024

work page arXiv 2024
[8]

HunyuanImage 3.0 Technical Report

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022

work page 2022
[10]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7310–7320, 2024

work page 2024
[11]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, pages 74–91. Springer, 2024

work page 2024
[13]

Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability

Shimin Chen, Xiaohan Lan, Yitian Yuan, Zequn Jie, and Lin Ma. Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability.arXiv preprint arXiv:2411.18211, 2024

work page arXiv 2024
[14]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences, 67(12):220101, 2024. 26

work page 2024
[16]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024

work page 2024
[17]

Umo: Scaling multi- identity consistency for image customization via matching reward.arXiv preprint arXiv:2509.06818, 2025

Yufeng Cheng, Wenxu Wu, Shaojin Wu, Mengqi Huang, Fei Ding, and Qian He. Umo: Scaling multi-identity consistency for image customization via matching reward.arXiv preprint arXiv:2509.06818, 2025

work page arXiv 2025
[18]

PaddleOCR 3.0 Technical Report

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report.arXiv preprint arXiv:2507.05595, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Instructblip: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advancesin neural information processing systems, 36:49250–49267, 2023

work page 2023
[21]

Chatumm: Robust context tracking for conversational interleaved generation

Wenxun Dai, Zhiyuan Zhao, Yule Zhong, Yiji Cheng, Jianwei Zhang, Linqing Wang, Shiyi Zhang, Yunlong Lin, Runze He, Fellix Song, et al. Chatumm: Robust context tracking for conversational interleaved generation. arXiv preprint arXiv:2602.06442, 2026

work page internal anchor Pith review arXiv 2026
[22]

Scaling vision transformers to 22 billion parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International conference on machine learning, pages 7480–7512. PMLR, 2023

work page 2023
[23]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Cogview: Mastering text-to-image generation via transformers.NIPS, 34:19822–19835, 2021

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers.NIPS, 34:19822–19835, 2021

work page 2021
[25]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

work page 2021
[26]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024

work page 2024
[27]

Unified autoregressive visual generation and understanding with continuous tokens

Lijie Fan, Luming Tang, Siyang Qin, Tianhong Li, Xuan Yang, Siyuan Qiao, Andreas Steiner, Chen Sun, Yuanzhen Li, Tao Zhu, et al. Unified autoregressive visual generation and understanding with continuous tokens. arXiv preprint arXiv:2503.13436, 2025

work page arXiv 2025
[28]

arXiv preprint arXiv:2408.14023 , year=

JiajunFei, DianLi, ZhidongDeng, ZekunWang, GangLiu, andHuiWang. Video-ccam: Enhancingvideo-language understanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024

work page arXiv 2024
[29]

DreamLite: A lightweight on-device unified model for image generation and editing,

Kailai Feng, Yuxiang Wei, Bo Chen, Yang Pan, Hu Ye, Songwei Liu, Chenqian Yan, and Yuan Gao. Dreamlite: A lightweight on-device unified model for image generation and editing.arXiv preprint arXiv:2603.28713, 2026

work page arXiv 2026
[30]

Feededit: Text-based image editing with dynamic feedback regulation

Fengyi Fu, Lei Zhang, Mengqi Huang, and Zhendong Mao. Feededit: Text-based image editing with dynamic feedback regulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2661– 2670, 2025

work page 2025
[31]

Layeredit: Disentangled multi-object editing via conflict-aware multi-layer learning

Fengyi Fu, Mengqi Huang, Lei Zhang, and Zhendong Mao. Layeredit: Disentangled multi-object editing via conflict-aware multi-layer learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 4003–4011, 2026

work page 2026
[32]

Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance

Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, et al. Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance. Visual Intelligence, 2(1):1–17, 2024. 27

work page 2024
[33]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advancesin Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advancesin Neural Information Processing Systems, 36:52132–52152, 2023

work page 2023
[35]

Gemini 3 Pro Image Model Card

Google DeepMind. Gemini 3 Pro Image Model Card. https://storage.googleapis.com/deepmind-media/ Model-Cards/Gemini-3-Pro-Image-Model-Card.pdf, November 2025. Model card published: November 2025

work page 2025
[36]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

TV2TV: A Unified Framework for Interleaved Language and Video Generation,

Xiaochuang Han, Youssef Emad, Melissa Hall, John Nguyen, Karthik Padthe, Liam Robbins, Amir Bar, Delong Chen, Michal Drozdzal, Maha Elbayad, et al. Tv2tv: A unified framework for interleaved language and video generation. arXiv preprint arXiv:2512.05103, 2025

work page arXiv 2025
[38]

arXiv 2512.04810 (2025)

Xin He, Longhui Wei, Jianbo Ouyang, Minghui Liao, Lingxi Xie, and Qi Tian. Emma: Efficient multimodal understanding, generation, and editing with a unified architecture.arXiv preprint arXiv:2512.04810, 2025

work page arXiv 2025
[39]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[40]

Denoising diffusion probabilistic models.NIPS, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.NIPS, 33:6840–6851, 2020

work page 2020
[41]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[42]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Dse-gan: Dynamic semantic evolution generative adversarial network for text-to-image generation

Mengqi Huang, Zhendong Mao, Penghui Wang, Quan Wang, and Yongdong Zhang. Dse-gan: Dynamic semantic evolution generative adversarial network for text-to-image generation. InProceedings of the 30th ACM International Conference on Multimedia, pages 4345–4354, 2022

work page 2022
[44]

Towards accurate image coding: Improved autoregressive image generation with dynamic vector quantization

Mengqi Huang, Zhendong Mao, Zhuowei Chen, and Yongdong Zhang. Towards accurate image coding: Improved autoregressive image generation with dynamic vector quantization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22596–22605, 2023

work page 2023
[45]

Realcustom: Narrowing real text word for real-time open-domain text-to-image customization

Mengqi Huang, Zhendong Mao, Mingcong Liu, Qian He, and Yongdong Zhang. Realcustom: Narrowing real text word for real-time open-domain text-to-image customization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7476–7485, 2024

work page 2024
[46]

Self forcing: Bridging the train-test gap in autoregressive video diffusion.Advances in Neural Information Processing Systems, 38:167283–167308, 2026

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.Advances in Neural Information Processing Systems, 38:167283–167308, 2026

work page 2026
[47]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

work page 2024
[48]

Vace: All-in-one video creation and editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF InternationalConference on Computer Vision, pages 17191–17202, 2025

work page 2025
[49]

EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

Xuan Ju, Tianyu Wang, Yuqian Zhou, He Zhang, Qing Liu, Nanxuan Zhao, Zhifei Zhang, Yijun Li, Yuanhao Cai, Shaoteng Liu, et al. Editverse: Unifying image and video editing and generation with in-context learning. arXiv preprint arXiv:2509.20360, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

arXiv preprint arXiv:2503.19907 (2025)

Xuan Ju, Weicai Ye, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, and Qiang Xu. Fulldit: Multi-task video generative foundation model with full attention.arXiv preprint arXiv:2503.19907, 2025

work page arXiv 2025
[51]

Kling ai.https://klingai.kuaishou.com/, 2024

Kling AI. Kling ai.https://klingai.kuaishou.com/, 2024. Accessed: 2024-06-06. 28

work page 2024
[52]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Anyv2v: A plug-and-play framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024

Max Ku, Cong Wei, Weiming Ren, Harry Yang, and Wenhu Chen. Anyv2v: A tuning-free framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024

work page arXiv 2024
[54]

Flux: Official inference repository for flux.1 models, 2024

Black Forest Labs. Flux: Official inference repository for flux.1 models, 2024. URLhttps://github.com/ black-forest-labs/flux. Accessed: 2025-02-07

work page 2024
[55]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Obelics: An open web-scale filtered dataset of interleaved image-text documents.Advancesin Neural Information Processing Systems, 36:71683–71702, 2023

Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents.Advancesin Neural Information Processing Systems, 36:71683–71702, 2023

work page 2023
[57]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing

Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. Advances in Neural Information Processing Systems, 36:30146–30166, 2023

work page 2023
[59]

Onecat: Decoder-only auto-regressive model for unified understanding and generation

Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, and Hongkai Xiong. Onecat: Decoder-only auto-regressive model for unified understanding and generation. arXiv preprint arXiv:2509.03498, 2025

work page arXiv 2025
[60]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

work page 2024
[61]

Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025

work page 2025
[62]

Autoregressive image generation without vector quantization.Advancesin Neural Information Processing Systems, 37:56424–56445, 2024

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.Advancesin Neural Information Processing Systems, 37:56424–56445, 2024

work page 2024
[63]

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint arXiv:2405.08748, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

Video-llava: Learning united visual representation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 5971–5984, 2024

work page 2024
[66]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Realgeneral: Unifying visual generation via temporal in-context learning with video models

Yijing Lin, Mengqi Huang, Shuhan Zhuang, and Zhendong Mao. Realgeneral: Unifying visual generation via temporal in-context learning with video models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14994–15004, 2025

work page 2025
[68]

Flow Matching Guide and Code

Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky TQ Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code.arXiv preprint arXiv:2412.06264, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024. 29

work page internal anchor Pith review Pith/arXiv arXiv 2024
[70]

World Model on Million-Length Video And Language With Blockwise RingAttention

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with blockwise ringattention.arXiv preprint arXiv:2402.08268, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023
[72]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

work page 2024
[73]

Llavanext: Improved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024

work page 2024
[74]

Mardini: Masked autoregressive diffusion for video generation at scale.arXiv preprint arXiv:2410.20280, 2024

Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yanping Xie, Xiao Han, Juan C Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, et al. Mardini: Masked autoregressive diffusion for video generation at scale.arXiv preprint arXiv:2410.20280, 2024

work page arXiv 2024
[75]

Flow-grpo: Training flow matching models via online rl.Advances in neural information processing systems, 38:40783–40818, 2026

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.Advances in neural information processing systems, 38:40783–40818, 2026

work page 2026
[76]

St-llm: Large language models are effective temporal learners

Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective temporal learners. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024

work page 2024
[77]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, ChunruiHan, etal. Step1x-edit: Apracticalframeworkforgeneralimageediting. arXivpreprintarXiv:2504.17761, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[78]

TUNA: Taming unified visual representations for native unified multimodal models, 2025

Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhaochong An, Fanny Yang, Aditya Patel, et al. Tuna: Taming unified visual representations for native unified multimodal models. arXiv preprint arXiv:2512.02014, 2025

work page arXiv 2025
[79]

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Zhiheng Liu, Weiming Ren, Xiaoke Huang, Shoufa Chen, Tianhong Li, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, Wenhu Chen, Ping Luo, Luke Zettlemoyer, and Yuren Cong. Tuna-2: Pixel embeddings beat vision encoders for multimodal understanding and generation.arXiv preprint arXiv:2604.24763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[80]

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model.arXiv preprint arXiv:2502.10248, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advancesin neural information processing systems, 35:23716–23736, 2022

work page 2022

[3] [3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Improving image generation with better captions.Computer Science

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

work page 2023

[7] [7]

Diffusion self-distillation for zero-shot customized image generation.arXiv preprint arXiv:2411.18616, 2024

Shengqu Cai, Eric Chan, Yunzhi Zhang, Leonidas Guibas, Jiajun Wu, and Gordon Wetzstein. Diffusion self-distillation for zero-shot customized image generation.arXiv preprint arXiv:2411.18616, 2024

work page arXiv 2024

[8] [8]

HunyuanImage 3.0 Technical Report

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022

work page 2022

[10] [10]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7310–7320, 2024

work page 2024

[11] [11]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, pages 74–91. Springer, 2024

work page 2024

[13] [13]

Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability

Shimin Chen, Xiaohan Lan, Yitian Yuan, Zequn Jie, and Lin Ma. Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability.arXiv preprint arXiv:2411.18211, 2024

work page arXiv 2024

[14] [14]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences, 67(12):220101, 2024. 26

work page 2024

[16] [16]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024

work page 2024

[17] [17]

Umo: Scaling multi- identity consistency for image customization via matching reward.arXiv preprint arXiv:2509.06818, 2025

Yufeng Cheng, Wenxu Wu, Shaojin Wu, Mengqi Huang, Fei Ding, and Qian He. Umo: Scaling multi-identity consistency for image customization via matching reward.arXiv preprint arXiv:2509.06818, 2025

work page arXiv 2025

[18] [18]

PaddleOCR 3.0 Technical Report

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report.arXiv preprint arXiv:2507.05595, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Instructblip: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advancesin neural information processing systems, 36:49250–49267, 2023

work page 2023

[21] [21]

Chatumm: Robust context tracking for conversational interleaved generation

Wenxun Dai, Zhiyuan Zhao, Yule Zhong, Yiji Cheng, Jianwei Zhang, Linqing Wang, Shiyi Zhang, Yunlong Lin, Runze He, Fellix Song, et al. Chatumm: Robust context tracking for conversational interleaved generation. arXiv preprint arXiv:2602.06442, 2026

work page internal anchor Pith review arXiv 2026

[22] [22]

Scaling vision transformers to 22 billion parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International conference on machine learning, pages 7480–7512. PMLR, 2023

work page 2023

[23] [23]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Cogview: Mastering text-to-image generation via transformers.NIPS, 34:19822–19835, 2021

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers.NIPS, 34:19822–19835, 2021

work page 2021

[25] [25]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

work page 2021

[26] [26]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024

work page 2024

[27] [27]

Unified autoregressive visual generation and understanding with continuous tokens

Lijie Fan, Luming Tang, Siyang Qin, Tianhong Li, Xuan Yang, Siyuan Qiao, Andreas Steiner, Chen Sun, Yuanzhen Li, Tao Zhu, et al. Unified autoregressive visual generation and understanding with continuous tokens. arXiv preprint arXiv:2503.13436, 2025

work page arXiv 2025

[28] [28]

arXiv preprint arXiv:2408.14023 , year=

JiajunFei, DianLi, ZhidongDeng, ZekunWang, GangLiu, andHuiWang. Video-ccam: Enhancingvideo-language understanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024

work page arXiv 2024

[29] [29]

DreamLite: A lightweight on-device unified model for image generation and editing,

Kailai Feng, Yuxiang Wei, Bo Chen, Yang Pan, Hu Ye, Songwei Liu, Chenqian Yan, and Yuan Gao. Dreamlite: A lightweight on-device unified model for image generation and editing.arXiv preprint arXiv:2603.28713, 2026

work page arXiv 2026

[30] [30]

Feededit: Text-based image editing with dynamic feedback regulation

Fengyi Fu, Lei Zhang, Mengqi Huang, and Zhendong Mao. Feededit: Text-based image editing with dynamic feedback regulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2661– 2670, 2025

work page 2025

[31] [31]

Layeredit: Disentangled multi-object editing via conflict-aware multi-layer learning

Fengyi Fu, Mengqi Huang, Lei Zhang, and Zhendong Mao. Layeredit: Disentangled multi-object editing via conflict-aware multi-layer learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 4003–4011, 2026

work page 2026

[32] [32]

Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance

Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, et al. Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance. Visual Intelligence, 2(1):1–17, 2024. 27

work page 2024

[33] [33]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advancesin Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advancesin Neural Information Processing Systems, 36:52132–52152, 2023

work page 2023

[35] [35]

Gemini 3 Pro Image Model Card

Google DeepMind. Gemini 3 Pro Image Model Card. https://storage.googleapis.com/deepmind-media/ Model-Cards/Gemini-3-Pro-Image-Model-Card.pdf, November 2025. Model card published: November 2025

work page 2025

[36] [36]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

TV2TV: A Unified Framework for Interleaved Language and Video Generation,

Xiaochuang Han, Youssef Emad, Melissa Hall, John Nguyen, Karthik Padthe, Liam Robbins, Amir Bar, Delong Chen, Michal Drozdzal, Maha Elbayad, et al. Tv2tv: A unified framework for interleaved language and video generation. arXiv preprint arXiv:2512.05103, 2025

work page arXiv 2025

[38] [38]

arXiv 2512.04810 (2025)

Xin He, Longhui Wei, Jianbo Ouyang, Minghui Liao, Lingxi Xie, and Qi Tian. Emma: Efficient multimodal understanding, generation, and editing with a unified architecture.arXiv preprint arXiv:2512.04810, 2025

work page arXiv 2025

[39] [39]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[40] [40]

Denoising diffusion probabilistic models.NIPS, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.NIPS, 33:6840–6851, 2020

work page 2020

[41] [41]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[42] [42]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

Dse-gan: Dynamic semantic evolution generative adversarial network for text-to-image generation

Mengqi Huang, Zhendong Mao, Penghui Wang, Quan Wang, and Yongdong Zhang. Dse-gan: Dynamic semantic evolution generative adversarial network for text-to-image generation. InProceedings of the 30th ACM International Conference on Multimedia, pages 4345–4354, 2022

work page 2022

[44] [44]

Towards accurate image coding: Improved autoregressive image generation with dynamic vector quantization

Mengqi Huang, Zhendong Mao, Zhuowei Chen, and Yongdong Zhang. Towards accurate image coding: Improved autoregressive image generation with dynamic vector quantization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22596–22605, 2023

work page 2023

[45] [45]

Realcustom: Narrowing real text word for real-time open-domain text-to-image customization

Mengqi Huang, Zhendong Mao, Mingcong Liu, Qian He, and Yongdong Zhang. Realcustom: Narrowing real text word for real-time open-domain text-to-image customization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7476–7485, 2024

work page 2024

[46] [46]

Self forcing: Bridging the train-test gap in autoregressive video diffusion.Advances in Neural Information Processing Systems, 38:167283–167308, 2026

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.Advances in Neural Information Processing Systems, 38:167283–167308, 2026

work page 2026

[47] [47]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

work page 2024

[48] [48]

Vace: All-in-one video creation and editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF InternationalConference on Computer Vision, pages 17191–17202, 2025

work page 2025

[49] [49]

EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

Xuan Ju, Tianyu Wang, Yuqian Zhou, He Zhang, Qing Liu, Nanxuan Zhao, Zhifei Zhang, Yijun Li, Yuanhao Cai, Shaoteng Liu, et al. Editverse: Unifying image and video editing and generation with in-context learning. arXiv preprint arXiv:2509.20360, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

arXiv preprint arXiv:2503.19907 (2025)

Xuan Ju, Weicai Ye, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, and Qiang Xu. Fulldit: Multi-task video generative foundation model with full attention.arXiv preprint arXiv:2503.19907, 2025

work page arXiv 2025

[51] [51]

Kling ai.https://klingai.kuaishou.com/, 2024

Kling AI. Kling ai.https://klingai.kuaishou.com/, 2024. Accessed: 2024-06-06. 28

work page 2024

[52] [52]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[53] [53]

Anyv2v: A plug-and-play framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024

Max Ku, Cong Wei, Weiming Ren, Harry Yang, and Wenhu Chen. Anyv2v: A tuning-free framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024

work page arXiv 2024

[54] [54]

Flux: Official inference repository for flux.1 models, 2024

Black Forest Labs. Flux: Official inference repository for flux.1 models, 2024. URLhttps://github.com/ black-forest-labs/flux. Accessed: 2025-02-07

work page 2024

[55] [55]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

Obelics: An open web-scale filtered dataset of interleaved image-text documents.Advancesin Neural Information Processing Systems, 36:71683–71702, 2023

Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents.Advancesin Neural Information Processing Systems, 36:71683–71702, 2023

work page 2023

[57] [57]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[58] [58]

Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing

Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. Advances in Neural Information Processing Systems, 36:30146–30166, 2023

work page 2023

[59] [59]

Onecat: Decoder-only auto-regressive model for unified understanding and generation

Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, and Hongkai Xiong. Onecat: Decoder-only auto-regressive model for unified understanding and generation. arXiv preprint arXiv:2509.03498, 2025

work page arXiv 2025

[60] [60]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

work page 2024

[61] [61]

Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025

work page 2025

[62] [62]

Autoregressive image generation without vector quantization.Advancesin Neural Information Processing Systems, 37:56424–56445, 2024

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.Advancesin Neural Information Processing Systems, 37:56424–56445, 2024

work page 2024

[63] [63]

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint arXiv:2405.08748, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [64]

Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[65] [65]

Video-llava: Learning united visual representation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 5971–5984, 2024

work page 2024

[66] [66]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[67] [67]

Realgeneral: Unifying visual generation via temporal in-context learning with video models

Yijing Lin, Mengqi Huang, Shuhan Zhuang, and Zhendong Mao. Realgeneral: Unifying visual generation via temporal in-context learning with video models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14994–15004, 2025

work page 2025

[68] [68]

Flow Matching Guide and Code

Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky TQ Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code.arXiv preprint arXiv:2412.06264, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[69] [69]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024. 29

work page internal anchor Pith review Pith/arXiv arXiv 2024

[70] [70]

World Model on Million-Length Video And Language With Blockwise RingAttention

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with blockwise ringattention.arXiv preprint arXiv:2402.08268, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[71] [71]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023

[72] [72]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

work page 2024

[73] [73]

Llavanext: Improved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024

work page 2024

[74] [74]

Mardini: Masked autoregressive diffusion for video generation at scale.arXiv preprint arXiv:2410.20280, 2024

Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yanping Xie, Xiao Han, Juan C Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, et al. Mardini: Masked autoregressive diffusion for video generation at scale.arXiv preprint arXiv:2410.20280, 2024

work page arXiv 2024

[75] [75]

Flow-grpo: Training flow matching models via online rl.Advances in neural information processing systems, 38:40783–40818, 2026

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.Advances in neural information processing systems, 38:40783–40818, 2026

work page 2026

[76] [76]

St-llm: Large language models are effective temporal learners

Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective temporal learners. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024

work page 2024

[77] [77]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, ChunruiHan, etal. Step1x-edit: Apracticalframeworkforgeneralimageediting. arXivpreprintarXiv:2504.17761, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[78] [78]

TUNA: Taming unified visual representations for native unified multimodal models, 2025

Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhaochong An, Fanny Yang, Aditya Patel, et al. Tuna: Taming unified visual representations for native unified multimodal models. arXiv preprint arXiv:2512.02014, 2025

work page arXiv 2025

[79] [79]

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Zhiheng Liu, Weiming Ren, Xiaoke Huang, Shoufa Chen, Tianhong Li, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, Wenhu Chen, Ping Luo, Luke Zettlemoyer, and Yuren Cong. Tuna-2: Pixel embeddings beat vision encoders for multimodal understanding and generation.arXiv preprint arXiv:2604.24763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[80] [80]

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model.arXiv preprint arXiv:2502.10248, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025