FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization

Bo Zheng; Hao Sun; Jinsong Lan; Liujuan Cao; Mengting Chen; Quanjian Song; Xiaoyong Zhu; Yefeng Shen

arxiv: 2605.15824 · v2 · pith:ETMRAALJnew · submitted 2026-05-15 · 💻 cs.CV

FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization

Quanjian Song , Yefeng Shen , Mengting Chen , Hao Sun , Jinsong Lan , Xiaoyong Zhu , Bo Zheng , Liujuan Cao This is my paper

Pith reviewed 2026-05-20 20:05 UTC · model grok-4.3

classification 💻 cs.CV

keywords garment video customizationreal-time video generationinteractive editingmotion coherencein-context learningKV cache reschedulingvideo distillationautoregressive models

0 comments

The pith

A framework achieves real-time interactive garment switching in human videos while preserving motion coherence using only single-garment training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to enable users to change garments on the fly during video generation without losing the consistency of the person's movements. Current methods require lots of multi-garment data and run too slow for interactive use. By training a teacher model on mismatched single-garment pairs to learn coherence implicitly, then distilling it and adding a way to reschedule key-value caches on the fly, the approach delivers fast generation. This would matter for online shopping try-ons and quick video content creation where users want to experiment with outfits instantly.

Core claim

The authors show that a combination of in-context learning on mismatched single-garment video pairs for a teacher model, followed by streaming distillation with in-context teacher forcing and gradient-reweighted distribution matching, plus training-free KV cache rescheduling for garment KV refresh, historical KV withdraw, and reference KV disentangle, allows the model to support interactive multi-garment customization in autoregressive video generation while keeping motion coherence and running at real-time speeds of 23.8 FPS on a single GPU, which is 30-180 times faster than baselines.

What carries the argument

Training-Free KV Cache Rescheduling, which refreshes garment-specific keys and values, withdraws historical ones, and disentangles reference information to enable seamless garment switches without retraining.

Load-bearing premise

That deliberately mismatching the reference garment image during training on single-garment videos will teach the model to keep motion coherent when garments are switched at generation time.

What would settle it

Generate a video where the garment is switched halfway and check if the person's limb movements and body posture continue exactly as in the original motion sequence without jumps or inconsistencies.

Figures

Figures reproduced from arXiv: 2605.15824 by Bo Zheng, Hao Sun, Jinsong Lan, Liujuan Cao, Mengting Chen, Quanjian Song, Xiaoyong Zhu, Yefeng Shen.

**Figure 2.** Figure 2: Average performance (Cur., GME, Amp., Smoo., and VQ) and inference speed comparison across different approaches. Despite this progress, existing customization methods mainly focus on human-centric subject consistency, with comparatively less emphasis on fine-grained human attributes. Among these attributes, garment-level customization is particularly desirable in practical applications such as filmmak… view at source ↗

**Figure 3.** Figure 3: Overall pipeline of FashionChameleon: Teacher Model with In-Context Learning, Streaming Distillation with In-Context Learning, and Training-Free KV Cache Rescheduling. is generated autoregressively. Self-Forcing [28] further improves this paradigm with self-rolling, conditioning on self-generated rather than ground-truth history to better align training with inference. To avoid training from scratch, most … view at source ↗

**Figure 4.** Figure 4: (Left) Generated sequences during garment switching. Directly refreshing the garment KV [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of our FashionChameleon with other baselines. Due to space [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Additional applications of FashionChameleon. It supports both [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative ablation of Gradient-Reweighted Distribution Matching Distillation (DMD) and Reference KV Disentangle. Gradient-Reweighted DMD alleviates motion collapse during extrapolation, while Reference KV Disentangle further enhances consistency during garment switching [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: The high-quality data curation pipeline of FashionChameleon. It consists of four stages: [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Data analysis and representative samples of HGC-Bench. (a) A word cloud generated from [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Quantitative results of the human evaluation. We compare our FashionChameleon [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Additional qualitative comparison between our FashionChameleon and other baselines. [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Additional qualitative comparison between our FashionChameleon and other baselines. [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Additional results for short video customization using our FashionChameleon. [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Additional results for short video customization using our FashionChameleon. [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 16.** Figure 16: Unlike conventional methods that require a reference image to be specified in advance, [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗

**Figure 15.** Figure 15: Additional visualizations for interactive multi-garment video customization using our [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: Additional visualizations for interactive multi-garment video customization using our [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

**Figure 17.** Figure 17: Additional long video extrapolation visualizations of our FashionChameleon. [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

**Figure 18.** Figure 18: Additional long video extrapolation visualizations of our FashionChameleon. [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗

read the original abstract

Human-centric video customization, particularly at the garment level, has shown significant commercial value. However, existing approaches cannot support low-latency and interactive garment control, which is crucial for applications such as e-commerce and content creation. This paper studies how to achieve interactive multi-garment video customization while preserving motion coherence using only single-garment video data. We present FashionChameleon, a real-time and interactive framework for human-garment customization in autoregressive video generation, where users can interactively switch garment during generation. FashionChameleon consists of three key techniques: (i) Instead of training on multi-garment video data, we train a Teacher Model with In-Context Learning on a single reference-garment pair. By retaining the image-to-video training paradigm while enforcing a mismatch between the reference and garment image, the model is encouraged to implicitly preserve coherence during single-garment switching. (ii) To achieve consistency and efficiency during generation, we introduce Streaming Distillation with In-Context Learning, which fine-tunes the model with in-context teacher forcing and improves extrapolation consistency via gradient-reweighted distribution matching distillation. (iii) To extend the model for interactive multi-garment video customization, we propose Training-Free KV Cache Rescheduling, which includes garment KV refresh, historical KV withdraw, and reference KV disentangle to achieve garment switching while preserving motion coherence. Our FashionChameleon uniquely supports interactive customization and consistent long-video extrapolation, while achieving real-time generation at 23.8 FPS on a single GPU, 30-180$\times$ faster than existing baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets interactive garment switching in video generation from single-garment data alone via mismatch training and KV cache rescheduling, but the abstract shows no numbers on whether coherence actually holds.

read the letter

The main thing to know is that this work trains on single-garment video pairs but still supports user-driven garment switches during autoregressive generation. They do it by mismatching the reference garment image in the teacher model so the network leans on motion context instead of appearance, then add streaming distillation and a training-free KV cache scheme with garment refresh, historical withdraw, and reference disentangle to keep things coherent at inference time. The reported 23.8 FPS on one GPU is the concrete performance number they lead with.

Referee Report

1 major / 2 minor

Summary. The manuscript presents FashionChameleon, a real-time framework for interactive multi-garment video customization in autoregressive human video generation. Using only single-garment video data, it introduces three techniques: (i) a teacher model trained with in-context learning on deliberately mismatched reference-garment pairs to implicitly encourage motion coherence preservation during switches, (ii) streaming distillation with in-context learning and gradient-reweighted distribution matching for consistency and efficiency, and (iii) training-free KV cache rescheduling (including garment KV refresh, historical KV withdraw, and reference KV disentangle) to enable mid-video garment switching. The work reports 23.8 FPS on a single GPU (30-180× faster than baselines) and claims support for consistent long-video extrapolation.

Significance. If the empirical validation holds, the result would represent a meaningful advance for low-latency interactive applications in e-commerce and content creation. The combination of single-garment-only training, training-free switching mechanism, and real-time performance addresses practical limitations in prior human-centric video generation approaches. The reported speed and the design of the KV cache rescheduling are concrete strengths that could enable new interactive workflows.

major comments (1)

[Abstract, first key technique] Description of first key technique (mismatch training on single-garment pairs): The central claim that deliberately mismatching the reference garment image while training on single-garment video pairs will implicitly teach reliable motion-garment disentanglement and coherence preservation for mid-video switches rests on an indirect inductive bias. No explicit temporal consistency loss, switch-specific supervision, or autoregressive error analysis is described, raising the risk that small inconsistencies accumulate over long horizons or complex motions. This assumption is load-bearing for the interactive customization and long-video extrapolation claims; an ablation (matched vs. mismatched training) with quantitative coherence metrics (e.g., temporal consistency scores or user studies) is required to substantiate it.

minor comments (2)

[Abstract] The speed claim of '30-180× faster' would be clearer if the specific baselines, their hardware configurations, and the exact metric (e.g., latency per frame) were stated explicitly.
[Description of third technique] Notation for 'KV Cache Rescheduling' components (garment KV refresh, historical KV withdraw, reference KV disentangle) could be formalized with a short diagram or pseudocode to improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed analysis of our work on FashionChameleon. We address the major comment regarding the mismatch training technique below. We agree that additional evidence would strengthen the presentation and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract, first key technique] Description of first key technique (mismatch training on single-garment pairs): The central claim that deliberately mismatching the reference garment image while training on single-garment video pairs will implicitly teach reliable motion-garment disentanglement and coherence preservation for mid-video switches rests on an indirect inductive bias. No explicit temporal consistency loss, switch-specific supervision, or autoregressive error analysis is described, raising the risk that small inconsistencies accumulate over long horizons or complex motions. This assumption is load-bearing for the interactive customization and long-video extrapolation claims; an ablation (matched vs. mismatched training) with quantitative coherence metrics (e.g., temporal consistency scores or user studies) is required to substantiate it.

Authors: We appreciate the referee pointing out the indirect nature of this inductive bias. The core motivation is that standard matched reference-garment training allows the model to shortcut by directly copying appearance from the reference image, bypassing the need to preserve motion coherence from the video context. By deliberately mismatching the reference garment while keeping the single-garment video pairs, the in-context learning objective forces the model to attend to the temporal sequence for motion information rather than relying on appearance alignment. This is retained within the standard image-to-video paradigm without adding new losses. Our reported results on long-video extrapolation and interactive switching (including 23.8 FPS performance and coherence in multi-garment scenarios) provide supporting evidence that inconsistencies do not accumulate in practice, aided by the training-free KV cache mechanisms. That said, we agree a controlled ablation (matched vs. mismatched) with metrics such as temporal consistency scores and user studies would make the claim more robust. We will include this ablation in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained with no circular reductions

full rationale

The paper proposes FashionChameleon as a framework using three explicit techniques to enable interactive multi-garment video customization from single-garment data only. Technique (i) trains a teacher model on mismatched reference-garment pairs to encourage implicit motion coherence; (ii) applies streaming distillation with in-context learning and gradient-reweighted matching; (iii) uses training-free KV cache rescheduling for switches. These are design choices and training procedures whose outputs are evaluated empirically against baselines (e.g., 23.8 FPS, 30-180× speedup). No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear that would reduce any central claim to its inputs by construction. The derivation therefore remains independent of the target results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the paper introduces three new procedural components rather than new physical entities or free parameters; no explicit fitted constants or ad-hoc axioms are stated.

axioms (1)

domain assumption Autoregressive video generation models can maintain motion coherence when reference and target garment images are deliberately mismatched during training.
Invoked in the description of the Teacher Model with In-Context Learning.

pith-pipeline@v0.9.0 · 5843 in / 1365 out tokens · 52553 ms · 2026-05-20T20:05:31.792661+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

train a Teacher Model with In-Context Learning on a single reference-garment pair... enforcing a mismatch between the reference and garment image

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TryOnCrafter: Unleashing Camera Trajectories for Realistic Video Virtual Try-on via a Renderable 4D Try-on Proxy
cs.CV 2026-06 unverdicted novelty 7.0

TryOnCrafter is the first DiT-based framework for camera-controllable video virtual try-on via a renderable 4D try-on proxy distilled from 2D priors into 3DGS avatar animated with SMPL-X.