Paris 2.0: A Decentralized Diffusion Model for Video Generation

Ali Rouzbayani; Bidhan Roy; Marcos Villagra; Zhiying Jiang

arxiv: 2605.26064 · v3 · pith:56RHGHYHnew · submitted 2026-05-25 · 💻 cs.CV · cs.LG

Paris 2.0: A Decentralized Diffusion Model for Video Generation

Ali Rouzbayani , Bidhan Roy , Marcos Villagra , Zhiying Jiang This is my paper

Pith reviewed 2026-06-29 22:44 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords decentralized diffusion modelsvideo generationtext-to-videotemporal coherencedistributed trainingFrechet Video Distancediffusion models

0 comments

The pith

Decentralized computation trains the first temporally coherent video diffusion model and halves FVD versus matched monolithic training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Paris 2.0 presents the first video generation model pre-trained entirely through decentralized computation. It extends prior decentralized image work to solve the open problem of maintaining frame-to-frame temporal coherence in videos. On low-resolution text-to-video tasks the decentralized model reduces Frechet Video Distance from 561.04 to 279.01 against a monolithic baseline given identical total compute, while also raising CLIP text-video similarity and aesthetic scores. A sympathetic reader would see this as evidence that video generation need not depend on a single large GPU cluster.

Core claim

Paris 2.0 shows that a decentralized diffusion model can generate temporally coherent videos, closing the remaining gap after decentralized image training succeeded, and that this decentralized route yields roughly twofold better FVD, higher CLIP similarity, and higher aesthetic scores than a monolithic model trained on the same data under a matched total compute budget.

What carries the argument

The decentralized training recipe for diffusion models applied to video sequences, which coordinates parameter updates across distributed nodes to produce coherent frame sequences.

If this is right

Video diffusion models can be pre-trained without access to a single large GPU cluster.
Under matched total compute the decentralized approach delivers approximately 2x lower FVD on low-resolution text-to-video tasks.
CLIP text-video similarity and aesthetic scores rise relative to the monolithic baseline.
Temporally coherent generation is no longer blocked by the absence of centralized hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Distributed compute pools could train larger video models by combining resources across many smaller nodes.
The same coordination method might apply to other sequence-generation tasks that require cross-frame consistency.
Open development of video models could proceed without dependence on concentrated institutional hardware.

Load-bearing premise

Temporally coherent video generation remains achievable when training occurs through decentralized computation without a monolithic GPU cluster.

What would settle it

A re-training of the monolithic baseline on the identical data and matched total compute that produces an FVD below 279 or visibly incoherent frames from the decentralized model would falsify the performance and feasibility claims.

read the original abstract

We present Paris 2.0, the first video generation model pre-trained through decentralized computation. Its training recipe builds upon Paris 1.0 (arXiv:2510.03434), the first ever open-weight Decentralized Diffusion Model (DDM), which showed that image generation can be trained without a monolithic GPU cluster. However, temporally coherent video generation had remained an open problem under decentralized training, and Paris 2.0 closes it. In low-resolution text-to-video training, against a monolithic model trained on the same data under a matched total compute budget, Paris 2.0 cuts Frechet Video Distance (FVD) from 561.04 to 279.01, a ~2.0x improvement, and lifts CLIP text-video similarity and aesthetic score.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Paris 2.0 claims the first decentralized video diffusion model with a reported 2x FVD drop, but the abstract gives no protocol details or proof the baseline matches on architecture.

read the letter

The headline result is a claimed halving of FVD (561 to 279) in low-res text-to-video when training is decentralized instead of monolithic, on matched data and total compute. That is the one concrete number the abstract supplies.

What is new is the extension from Paris 1.0's image-only DDM to video while keeping the training decentralized. The authors correctly flag temporal coherence as the remaining open problem after the image work and say they close it. Reporting specific metric lifts on CLIP similarity and aesthetics is better than pure claims.

The soft spot is exactly the one the stress-test flags. The abstract states the models share data and compute budget but says nothing about whether the monolithic comparator uses the same architecture, parameter count, optimizer, or video-specific modifications. Paris 2.0 builds on Paris 1.0 with unspecified video extensions; those changes alone could drive the metric gain. No description appears of the decentralized protocol, synchronization method, data partitioning, or how they verified the compute was truly equivalent. Without those, the attribution to decentralization stays untestable from the given text.

This is the kind of paper that belongs in a reading group focused on distributed training or efficient generative models, mainly to see whether the full manuscript supplies the missing protocol and baseline controls. It is not ready to cite yet because the central claim rests on an unverified comparison. A serious editor should send it to referees so the authors can either strengthen the evidence or narrow the claim.

Referee Report

2 major / 1 minor

Summary. The manuscript presents Paris 2.0 as the first video generation model pre-trained via decentralized computation, extending the image-focused Paris 1.0 (arXiv:2510.03434). It claims that, in low-resolution text-to-video training against a monolithic baseline trained on identical data under matched total compute, Paris 2.0 reduces FVD from 561.04 to 279.01 (~2x improvement) while also raising CLIP text-video similarity and aesthetic score, thereby closing the open problem of temporally coherent decentralized video generation.

Significance. If the central comparison holds under rigorous controls, the result would be significant as the first concrete demonstration that decentralized training can produce temporally coherent video models without a monolithic GPU cluster. The reported metric gains, if isolated to the decentralization mechanism, would directly address the limitation left open by Paris 1.0 and provide a practical path for scaling video diffusion models under distributed compute constraints.

major comments (2)

[Abstract] Abstract: the headline FVD reduction (561.04 o 279.01) is attributed to decentralized training, yet the abstract supplies no description of the decentralized protocol, data partitioning, synchronization method, or explicit verification that the monolithic comparator used identical architecture, parameter count, optimizer, and video-specific modifications; without these controls the improvement cannot be isolated from possible recipe differences.
[Abstract] Abstract: the claim of a 'matched total compute budget' is load-bearing for the central result, but no accounting is given for communication overhead, straggler effects, or effective FLOPs under decentralization; this leaves open whether the reported gains reflect true parity or an unaccounted difference in training dynamics.

minor comments (1)

The abstract references Paris 1.0 but does not cite the exact arXiv identifier in the main text; adding the full citation would aid traceability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on the abstract. The comments correctly identify that the abstract is too concise to fully document the experimental controls. We will revise the abstract accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the headline FVD reduction (561.04 o 279.01) is attributed to decentralized training, yet the abstract supplies no description of the decentralized protocol, data partitioning, synchronization method, or explicit verification that the monolithic comparator used identical architecture, parameter count, optimizer, and video-specific modifications; without these controls the improvement cannot be isolated from possible recipe differences.

Authors: We agree the abstract lacks these details. The revised abstract will include a brief statement confirming that the decentralized protocol (with its data partitioning and synchronization) and the monolithic baseline (identical architecture, parameter count, optimizer, and video modifications) are as described in the main text, allowing isolation of the decentralization effect. revision: yes
Referee: [Abstract] Abstract: the claim of a 'matched total compute budget' is load-bearing for the central result, but no accounting is given for communication overhead, straggler effects, or effective FLOPs under decentralization; this leaves open whether the reported gains reflect true parity or an unaccounted difference in training dynamics.

Authors: We agree the abstract provides no such accounting. The revised abstract will note that the matched budget incorporates measured communication overhead and straggler mitigation, with full details in the methods. If explicit overhead measurements are absent from the current manuscript, they will be added during revision. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark comparison is self-contained

full rationale

The paper reports an empirical FVD improvement (561.04 to 279.01) for a decentralized video model versus a monolithic baseline trained on identical data and total compute. No derivation chain, equations, fitted parameters presented as predictions, or self-citations that reduce the central claim to its own inputs exist in the provided text. The result rests on external measurement against a matched comparator rather than any self-definitional or load-bearing reduction, satisfying the criteria for a non-circular empirical finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5672 in / 1033 out tokens · 30242 ms · 2026-06-29T22:44:17.500569+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 6 canonical work pages · 4 internal anchors

[1]

FLUX, 2024.https://github.com/black-forest-labs/flux

Black Forest Labs. FLUX, 2024.https://github.com/black-forest-labs/flux

2024
[2]

Paris: A decentralized trained open-weight diffusion model

Jiang, Z., Seraj, R., Villagra, M., and Roy, B. Paris: A decentralized trained open-weight diffusion model. arXiv preprint arXiv:2510.03434, 2025

work page arXiv 2025
[3]

Heterogeneous decentralized diffusion models

Jiang, Z., Seraj, R., Villagra, M., and Roy, B. Heterogeneous decentralized diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026
[4]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., Wu, K., Lin, Q., Yuan, J., Long, Y., Wang, A., Wang, A., Li, C., Huang, D., Yang, F., Tan, H., Wang, H., Song, J., Bai, J., Wu, J., Xue, J., Wang, J., Wang, K., Liu, M., Li, P., Li, S., Wang, W., Yu, W., Deng, X., Li, Y., Chen, Y., Cui, Y., Peng, Y., Yu, Z., H...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Decentralized diffusion models.Proceedings of the Computer Vision and Pattern Recognition Conference, pp

McAllister, D., Tancik, M., Song, J., and Kanazawa, A. Decentralized diffusion models.Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 23323–23333, 2025

2025
[7]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Expert-data alignment governs generation quality in decentralized diffusion models

Villagra, M., Roy, B., Seraj, R., and Jiang, Z. Expert-data alignment governs generation quality in decentralized diffusion models. InICLR 2026 DeLTa Workshop, 2026. arXiv:2602.02685

work page arXiv 2026
[9]

Open-Sora: Democratizing Efficient Video Production for All

Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., and You, Y. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

FLUX, 2024.https://github.com/black-forest-labs/flux

Black Forest Labs. FLUX, 2024.https://github.com/black-forest-labs/flux

2024

[2] [2]

Paris: A decentralized trained open-weight diffusion model

Jiang, Z., Seraj, R., Villagra, M., and Roy, B. Paris: A decentralized trained open-weight diffusion model. arXiv preprint arXiv:2510.03434, 2025

work page arXiv 2025

[3] [3]

Heterogeneous decentralized diffusion models

Jiang, Z., Seraj, R., Villagra, M., and Roy, B. Heterogeneous decentralized diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026

[4] [4]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., Wu, K., Lin, Q., Yuan, J., Long, Y., Wang, A., Wang, A., Li, C., Huang, D., Yang, F., Tan, H., Wang, H., Song, J., Bai, J., Wu, J., Xue, J., Wang, J., Wang, K., Liu, M., Li, P., Li, S., Wang, W., Yu, W., Deng, X., Li, Y., Chen, Y., Cui, Y., Peng, Y., Yu, Z., H...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Decentralized diffusion models.Proceedings of the Computer Vision and Pattern Recognition Conference, pp

McAllister, D., Tancik, M., Song, J., and Kanazawa, A. Decentralized diffusion models.Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 23323–23333, 2025

2025

[7] [7]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Expert-data alignment governs generation quality in decentralized diffusion models

Villagra, M., Roy, B., Seraj, R., and Jiang, Z. Expert-data alignment governs generation quality in decentralized diffusion models. InICLR 2026 DeLTa Workshop, 2026. arXiv:2602.02685

work page arXiv 2026

[9] [9]

Open-Sora: Democratizing Efficient Video Production for All

Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., and You, Y. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024