ChatUMM: Robust Context Tracking for Conversational Interleaved Generation

Chunyu Wang; Fellix Song; Haoji Zhang; Jianwei Zhang; Linqing Wang; Runze He; Shiyi Zhang; Wayne Zhuang; Wenxun Dai; Yansong Tang

ChatUMM: Robust Context Tracking for Conversational Interleaved Generation

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 2602.06442 v2 pith:2WL5V7VU submitted 2026-02-06 cs.CV

ChatUMM: Robust Context Tracking for Conversational Interleaved Generation

Wenxun Dai , Zhiyuan Zhao , Yule Zhong , Yiji Cheng , Jianwei Zhang , Linqing Wang , Shiyi Zhang , Yunlong Lin

show 7 more authors

Runze He Fellix Song Wayne Zhuang Yong Liu Haoji Zhang Yansong Tang Chunyu Wang

This is my paper

classification cs.CV

keywords chatummconversationalinterleaveddialoguesgenerationmodelsmultimodalunified

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

Unified multimodal models (UMMs) have achieved remarkable progress yet remain constrained by a single-turn interaction paradigm, effectively functioning as solvers for independent requests rather than assistants in continuous dialogue. To bridge this gap, we present ChatUMM. As a conversational unified model, it excels at robust context tracking to sustain interleaved multimodal generation. ChatUMM derives its capabilities from two key innovations: an interleaved multi-turn training strategy that models serialized text-image streams as a continuous conversational flow, and a systematic conversational data synthesis pipeline. This pipeline transforms a diverse set of standard single-turn datasets into fluid dialogues through three progressive stages: constructing basic stateful dialogues, enforcing long-range dependency resolution via ``distractor'' turns with history-dependent query rewriting, and synthesizing naturally interleaved multimodal responses. Extensive evaluations demonstrate that ChatUMM achieves state-of-the-art performance among open-source unified models on visual understanding and instruction-guided editing benchmarks, while maintaining competitive fidelity in text-to-image generation. Notably, ChatUMM exhibits superior robustness in complex multi-turn scenarios, ensuring fluid, context-aware dialogues.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lance: Unified Multimodal Modeling by Multi-Task Synergy
cs.CV 2026-05 unverdicted novelty 6.0

Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keepin...
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
Lance: Unified Multimodal Modeling by Multi-Task Synergy
cs.CV 2026-05 unverdicted novelty 5.0

Lance introduces a dual-stream MoE model with modality-aware rotary positional encoding and staged multi-task training that outperforms open-source unified models on image and video generation while retaining understa...