pith. sign in

arxiv: 2605.26064 · v3 · pith:56RHGHYHnew · submitted 2026-05-25 · 💻 cs.CV · cs.LG

Paris 2.0: A Decentralized Diffusion Model for Video Generation

Pith reviewed 2026-06-29 22:44 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords decentralized diffusion modelsvideo generationtext-to-videotemporal coherencedistributed trainingFrechet Video Distancediffusion models
0
0 comments X

The pith

Decentralized computation trains the first temporally coherent video diffusion model and halves FVD versus matched monolithic training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Paris 2.0 presents the first video generation model pre-trained entirely through decentralized computation. It extends prior decentralized image work to solve the open problem of maintaining frame-to-frame temporal coherence in videos. On low-resolution text-to-video tasks the decentralized model reduces Frechet Video Distance from 561.04 to 279.01 against a monolithic baseline given identical total compute, while also raising CLIP text-video similarity and aesthetic scores. A sympathetic reader would see this as evidence that video generation need not depend on a single large GPU cluster.

Core claim

Paris 2.0 shows that a decentralized diffusion model can generate temporally coherent videos, closing the remaining gap after decentralized image training succeeded, and that this decentralized route yields roughly twofold better FVD, higher CLIP similarity, and higher aesthetic scores than a monolithic model trained on the same data under a matched total compute budget.

What carries the argument

The decentralized training recipe for diffusion models applied to video sequences, which coordinates parameter updates across distributed nodes to produce coherent frame sequences.

If this is right

  • Video diffusion models can be pre-trained without access to a single large GPU cluster.
  • Under matched total compute the decentralized approach delivers approximately 2x lower FVD on low-resolution text-to-video tasks.
  • CLIP text-video similarity and aesthetic scores rise relative to the monolithic baseline.
  • Temporally coherent generation is no longer blocked by the absence of centralized hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Distributed compute pools could train larger video models by combining resources across many smaller nodes.
  • The same coordination method might apply to other sequence-generation tasks that require cross-frame consistency.
  • Open development of video models could proceed without dependence on concentrated institutional hardware.

Load-bearing premise

Temporally coherent video generation remains achievable when training occurs through decentralized computation without a monolithic GPU cluster.

What would settle it

A re-training of the monolithic baseline on the identical data and matched total compute that produces an FVD below 279 or visibly incoherent frames from the decentralized model would falsify the performance and feasibility claims.

read the original abstract

We present Paris 2.0, the first video generation model pre-trained through decentralized computation. Its training recipe builds upon Paris 1.0 (arXiv:2510.03434), the first ever open-weight Decentralized Diffusion Model (DDM), which showed that image generation can be trained without a monolithic GPU cluster. However, temporally coherent video generation had remained an open problem under decentralized training, and Paris 2.0 closes it. In low-resolution text-to-video training, against a monolithic model trained on the same data under a matched total compute budget, Paris 2.0 cuts Frechet Video Distance (FVD) from 561.04 to 279.01, a ~2.0x improvement, and lifts CLIP text-video similarity and aesthetic score.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents Paris 2.0 as the first video generation model pre-trained via decentralized computation, extending the image-focused Paris 1.0 (arXiv:2510.03434). It claims that, in low-resolution text-to-video training against a monolithic baseline trained on identical data under matched total compute, Paris 2.0 reduces FVD from 561.04 to 279.01 (~2x improvement) while also raising CLIP text-video similarity and aesthetic score, thereby closing the open problem of temporally coherent decentralized video generation.

Significance. If the central comparison holds under rigorous controls, the result would be significant as the first concrete demonstration that decentralized training can produce temporally coherent video models without a monolithic GPU cluster. The reported metric gains, if isolated to the decentralization mechanism, would directly address the limitation left open by Paris 1.0 and provide a practical path for scaling video diffusion models under distributed compute constraints.

major comments (2)
  1. [Abstract] Abstract: the headline FVD reduction (561.04 o 279.01) is attributed to decentralized training, yet the abstract supplies no description of the decentralized protocol, data partitioning, synchronization method, or explicit verification that the monolithic comparator used identical architecture, parameter count, optimizer, and video-specific modifications; without these controls the improvement cannot be isolated from possible recipe differences.
  2. [Abstract] Abstract: the claim of a 'matched total compute budget' is load-bearing for the central result, but no accounting is given for communication overhead, straggler effects, or effective FLOPs under decentralization; this leaves open whether the reported gains reflect true parity or an unaccounted difference in training dynamics.
minor comments (1)
  1. The abstract references Paris 1.0 but does not cite the exact arXiv identifier in the main text; adding the full citation would aid traceability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on the abstract. The comments correctly identify that the abstract is too concise to fully document the experimental controls. We will revise the abstract accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline FVD reduction (561.04 o 279.01) is attributed to decentralized training, yet the abstract supplies no description of the decentralized protocol, data partitioning, synchronization method, or explicit verification that the monolithic comparator used identical architecture, parameter count, optimizer, and video-specific modifications; without these controls the improvement cannot be isolated from possible recipe differences.

    Authors: We agree the abstract lacks these details. The revised abstract will include a brief statement confirming that the decentralized protocol (with its data partitioning and synchronization) and the monolithic baseline (identical architecture, parameter count, optimizer, and video modifications) are as described in the main text, allowing isolation of the decentralization effect. revision: yes

  2. Referee: [Abstract] Abstract: the claim of a 'matched total compute budget' is load-bearing for the central result, but no accounting is given for communication overhead, straggler effects, or effective FLOPs under decentralization; this leaves open whether the reported gains reflect true parity or an unaccounted difference in training dynamics.

    Authors: We agree the abstract provides no such accounting. The revised abstract will note that the matched budget incorporates measured communication overhead and straggler mitigation, with full details in the methods. If explicit overhead measurements are absent from the current manuscript, they will be added during revision. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark comparison is self-contained

full rationale

The paper reports an empirical FVD improvement (561.04 to 279.01) for a decentralized video model versus a monolithic baseline trained on identical data and total compute. No derivation chain, equations, fitted parameters presented as predictions, or self-citations that reduce the central claim to its own inputs exist in the provided text. The result rests on external measurement against a matched comparator rather than any self-definitional or load-bearing reduction, satisfying the criteria for a non-circular empirical finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5672 in / 1033 out tokens · 30242 ms · 2026-06-29T22:44:17.500569+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

9 extracted references · 6 canonical work pages · 4 internal anchors

  1. [1]

    FLUX, 2024.https://github.com/black-forest-labs/flux

    Black Forest Labs. FLUX, 2024.https://github.com/black-forest-labs/flux

  2. [2]

    Paris: A decentralized trained open-weight diffusion model

    Jiang, Z., Seraj, R., Villagra, M., and Roy, B. Paris: A decentralized trained open-weight diffusion model. arXiv preprint arXiv:2510.03434, 2025

  3. [3]

    Heterogeneous decentralized diffusion models

    Jiang, Z., Seraj, R., Villagra, M., and Roy, B. Heterogeneous decentralized diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

  4. [4]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., Wu, K., Lin, Q., Yuan, J., Long, Y., Wang, A., Wang, A., Li, C., Huang, D., Yang, F., Tan, H., Wang, H., Song, J., Bai, J., Wu, J., Xue, J., Wang, J., Wang, K., Liu, M., Li, P., Li, S., Wang, W., Yu, W., Deng, X., Li, Y., Chen, Y., Cui, Y., Peng, Y., Yu, Z., H...

  5. [5]

    Flow Matching for Generative Modeling

    Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

  6. [6]

    Decentralized diffusion models.Proceedings of the Computer Vision and Pattern Recognition Conference, pp

    McAllister, D., Tancik, M., Song, J., and Kanazawa, A. Decentralized diffusion models.Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 23323–23333, 2025

  7. [7]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  8. [8]

    Expert-data alignment governs generation quality in decentralized diffusion models

    Villagra, M., Roy, B., Seraj, R., and Jiang, Z. Expert-data alignment governs generation quality in decentralized diffusion models. InICLR 2026 DeLTa Workshop, 2026. arXiv:2602.02685

  9. [9]

    Open-Sora: Democratizing Efficient Video Production for All

    Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., and You, Y. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024. 6