TimeColor: Flexible Reference Colorization via Temporal Concatenation

Bryan Constantine Sadihin; Hang Su; Matteo Jiahao Chen; Michael Hua Wang; Yihao Meng

arxiv: 2601.00296 · v2 · submitted 2026-01-01 · 💻 cs.CV

TimeColor: Flexible Reference Colorization via Temporal Concatenation

Bryan Constantine Sadihin , Yihao Meng , Michael Hua Wang , Matteo Jiahao Chen , Hang Su This is my paper

Pith reviewed 2026-05-16 18:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords video colorizationmulti-reference colorizationsketch-based generationdiffusion modelstemporal concatenationmasked attentionrotary position embeddinganime video

0 comments

The pith

TimeColor treats any number of reference images as extra frames concatenated in time to colorize video sketches with fixed model size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TimeColor as a diffusion-based method for sketch video colorization that accepts heterogeneous references of varying count, such as character sheets or background images, rather than restricting to a single first-frame reference. It achieves this by encoding each reference as an additional latent frame and concatenating them temporally with the target sequence so the model processes them together in every diffusion step. Spatiotemporal correspondence-masked attention and modality-disjoint RoPE indexing enforce correct binding and block palette leakage between identities. Experiments on the Sakuga-42M dataset under single- and multi-reference protocols report gains in color accuracy, character consistency, and frame-to-frame stability.

Core claim

TimeColor encodes references as additional latent frames which are concatenated temporally, permitting them to be processed concurrently in each diffusion step while keeping the model's parameter count fixed. It uses spatiotemporal correspondence-masked attention to enforce subject-reference binding in addition to modality-disjoint RoPE indexing. These mechanisms mitigate shortcutting and cross-identity palette leakage.

What carries the argument

Temporal concatenation of reference latents processed with correspondence-masked spatiotemporal attention and modality-disjoint RoPE indexing.

If this is right

The same fixed-parameter model handles one reference or many without retraining or architecture changes.
Color fidelity, identity consistency, and temporal stability all increase under both single- and multi-reference evaluation protocols.
References can be arbitrary colorized frames, character sheets, or background images rather than only the first frame.
Explicit per-reference region assignment becomes possible without altering the diffusion backbone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on live-action video to see whether the same mechanisms reduce reference leakage in non-animated footage.
Production pipelines might use it to feed multiple artist-provided references in a single pass instead of sequential single-reference steps.
Extending the temporal concatenation idea to other conditioning signals such as depth maps or motion sketches would be a direct next experiment.

Load-bearing premise

The masking and indexing will reliably tie each reference to its intended subject across real-world inputs without creating new artifacts or needing per-scene adjustments.

What would settle it

A multi-reference test sequence in which adding a second reference produces visible color mixing between characters or sudden palette shifts between frames.

read the original abstract

Most colorization models condition only on a single reference, typically the first frame of the scene. However, this approach ignores other sources of conditional data, such as character sheets, background images, or arbitrary colorized frames. We propose TimeColor, a sketch-based video colorization model that supports heterogeneous, variable-count references with the use of explicit per-reference region assignment. TimeColor encodes references as additional latent frames which are concatenated temporally, permitting them to be processed concurrently in each diffusion step while keeping the model's parameter count fixed. TimeColor also uses spatiotemporal correspondence-masked attention to enforce subject -- reference binding in addition to modality-disjoint RoPE indexing. These mechanisms mitigate shortcutting and cross-identity palette leakage. Experiments on Sakuga-42M under both single- and multi-reference protocols show that TimeColor improves color fidelity, identity consistency, and temporal stability over prior baselines. Our project page is available at https://bconstantine.github.io/TimeColor/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TimeColor adds temporal concatenation and masked attention for multi-reference video colorization, but without numbers or ablations it's unclear if the binding tricks are necessary or if extra frames alone would suffice.

read the letter

TimeColor concatenates variable references as extra latent frames in the diffusion process and adds spatiotemporal correspondence-masked attention plus modality-disjoint RoPE to keep subjects bound and stop color leakage across identities. This lets the model take character sheets, background plates, or any mix of references without changing parameter count or retraining for each new count. The approach is a direct response to the single-reference limit in prior sketch-based video colorization work, and the architecture description is clear enough that someone could reimplement the core pieces from the text. On Sakuga-42M it reports better fidelity, identity consistency, and temporal stability in both single- and multi-reference settings, which matches what you would expect from giving the model more conditioning pixels when the binding works. The practical upside is real for animation pipelines that already produce multiple reference images. The main gap is the missing evidence. The abstract supplies no metrics, no listed baselines, no error bars, and no ablation that isolates the masked attention or the special RoPE from the simple effect of feeding more frames. If standard attention on the concatenated sequence already captures most of the gain, then the central claim that these mechanisms prevent shortcutting and leakage rests on an untested assumption. That is the exact point the stress-test note raises, and nothing in the provided description contradicts it. The paper is aimed at people who build or tune reference-conditioned video models. A reader working on diffusion video pipelines would pick up the concatenation trick and the masking pattern quickly. I would send it to peer review because the problem is concrete, the fix is lightweight, and the missing controls are straightforward to add. Referees can ask for the ablations and the actual scores without the work being fundamentally broken.

Referee Report

2 major / 2 minor

Summary. The paper proposes TimeColor, a sketch-based video colorization diffusion model that supports an arbitrary number of heterogeneous references (e.g., character sheets, background images, or additional colorized frames) by encoding them as extra latent frames that are temporally concatenated with the target sequence. This keeps the parameter count fixed while allowing concurrent processing. The model adds spatiotemporal correspondence-masked attention to enforce subject-reference binding and modality-disjoint RoPE indexing to reduce shortcutting and cross-identity palette leakage. Experiments on Sakuga-42M under single- and multi-reference protocols report gains in color fidelity, identity consistency, and temporal stability relative to prior baselines.

Significance. If the reported gains hold under rigorous controls and the binding mechanisms prove necessary rather than incidental, the work would meaningfully extend reference-based colorization to practical multi-reference scenarios common in animation pipelines. The fixed-parameter design via temporal concatenation is a clean architectural choice that could generalize beyond colorization.

major comments (2)

[Experiments / §4] The central claim that spatiotemporal correspondence-masked attention and modality-disjoint RoPE are required to mitigate shortcutting and cross-identity leakage rests on aggregate improvements over baselines. No ablation is described that compares the full model against a control using standard (unmasked) attention and joint RoPE while still supplying the same extra reference frames; without this isolation, it remains possible that gains derive primarily from additional reference pixels rather than the proposed binding mechanisms.
[Experiments / §4] Quantitative results are summarized at a high level (improvements in fidelity, consistency, and stability) without reported metrics, error bars, baseline details, or per-scene breakdowns. This makes it impossible to assess effect sizes or whether the method remains stable across diverse real-world reference sets without per-scene tuning.

minor comments (2)

[Abstract] The abstract states empirical improvements but supplies no numerical values, making it difficult for readers to gauge the magnitude of gains before reading the full results section.
[Method] Notation for the masked attention and RoPE variants should be introduced with explicit equations or pseudocode in the method section to clarify how masking is applied across the concatenated temporal dimension.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major point below and will revise the paper to strengthen the experimental validation as suggested.

read point-by-point responses

Referee: [Experiments / §4] The central claim that spatiotemporal correspondence-masked attention and modality-disjoint RoPE are required to mitigate shortcutting and cross-identity leakage rests on aggregate improvements over baselines. No ablation is described that compares the full model against a control using standard (unmasked) attention and joint RoPE while still supplying the same extra reference frames; without this isolation, it remains possible that gains derive primarily from additional reference pixels rather than the proposed binding mechanisms.

Authors: We agree that the current manuscript does not include a controlled ablation isolating the spatiotemporal correspondence-masked attention and modality-disjoint RoPE from the mere provision of additional reference frames. Our existing comparisons are against prior single-reference baselines, which leaves open the possibility that gains stem primarily from extra conditioning pixels. In the revised version we will add a dedicated ablation: the full TimeColor model versus an otherwise identical variant that uses standard (unmasked) spatiotemporal attention and joint RoPE while receiving the same set of extra reference frames. This will directly test whether the proposed binding mechanisms are necessary to reduce shortcutting and cross-identity palette leakage. revision: yes
Referee: [Experiments / §4] Quantitative results are summarized at a high level (improvements in fidelity, consistency, and stability) without reported metrics, error bars, baseline details, or per-scene breakdowns. This makes it impossible to assess effect sizes or whether the method remains stable across diverse real-world reference sets without per-scene tuning.

Authors: We acknowledge that the experimental section currently presents results at a summary level. The revised manuscript will expand §4 to report the concrete metrics employed (PSNR, SSIM, LPIPS for fidelity; identity-consistency scores and temporal-warping error for stability), include error bars computed over multiple random seeds, provide full baseline implementation details, and add per-scene performance tables. These additions will enable readers to evaluate effect sizes and assess robustness across heterogeneous reference sets without per-scene hyper-parameter tuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity; architectural proposal with external empirical validation

full rationale

The paper describes an architectural extension to diffusion-based video colorization that concatenates reference frames temporally and augments attention with masked correspondence and modality-disjoint RoPE. No equations, derivations, or first-principles predictions are presented that reduce to fitted parameters or self-referential definitions. All performance claims rest on aggregate metrics from the external Sakuga-42M dataset under single- and multi-reference protocols; the mechanisms are motivated by design goals rather than any closed loop that equates outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no concrete free parameters, axioms, or invented entities can be extracted. The model presumably inherits standard diffusion-model assumptions and attention mechanisms from prior literature.

pith-pipeline@v0.9.0 · 5470 in / 1079 out tokens · 27432 ms · 2026-05-16T18:00:57.170757+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TimeColor encodes references as additional latent frames which are concatenated temporally... uses spatiotemporal correspondence-masked attention to enforce subject–reference binding in addition to modality-disjoint RoPE indexing. These mechanisms mitigate shortcutting and cross-identity palette leakage.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose TimeColor, a DiT-based framework for sketch video colorization supporting variable-count, heterogeneous multi-reference conditioning with explicit region-level control.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.