Paris 2.0: A Decentralized Diffusion Model for Video Generation
Pith reviewed 2026-06-29 22:44 UTC · model grok-4.3
The pith
Decentralized computation trains the first temporally coherent video diffusion model and halves FVD versus matched monolithic training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Paris 2.0 shows that a decentralized diffusion model can generate temporally coherent videos, closing the remaining gap after decentralized image training succeeded, and that this decentralized route yields roughly twofold better FVD, higher CLIP similarity, and higher aesthetic scores than a monolithic model trained on the same data under a matched total compute budget.
What carries the argument
The decentralized training recipe for diffusion models applied to video sequences, which coordinates parameter updates across distributed nodes to produce coherent frame sequences.
If this is right
- Video diffusion models can be pre-trained without access to a single large GPU cluster.
- Under matched total compute the decentralized approach delivers approximately 2x lower FVD on low-resolution text-to-video tasks.
- CLIP text-video similarity and aesthetic scores rise relative to the monolithic baseline.
- Temporally coherent generation is no longer blocked by the absence of centralized hardware.
Where Pith is reading between the lines
- Distributed compute pools could train larger video models by combining resources across many smaller nodes.
- The same coordination method might apply to other sequence-generation tasks that require cross-frame consistency.
- Open development of video models could proceed without dependence on concentrated institutional hardware.
Load-bearing premise
Temporally coherent video generation remains achievable when training occurs through decentralized computation without a monolithic GPU cluster.
What would settle it
A re-training of the monolithic baseline on the identical data and matched total compute that produces an FVD below 279 or visibly incoherent frames from the decentralized model would falsify the performance and feasibility claims.
read the original abstract
We present Paris 2.0, the first video generation model pre-trained through decentralized computation. Its training recipe builds upon Paris 1.0 (arXiv:2510.03434), the first ever open-weight Decentralized Diffusion Model (DDM), which showed that image generation can be trained without a monolithic GPU cluster. However, temporally coherent video generation had remained an open problem under decentralized training, and Paris 2.0 closes it. In low-resolution text-to-video training, against a monolithic model trained on the same data under a matched total compute budget, Paris 2.0 cuts Frechet Video Distance (FVD) from 561.04 to 279.01, a ~2.0x improvement, and lifts CLIP text-video similarity and aesthetic score.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Paris 2.0 as the first video generation model pre-trained via decentralized computation, extending the image-focused Paris 1.0 (arXiv:2510.03434). It claims that, in low-resolution text-to-video training against a monolithic baseline trained on identical data under matched total compute, Paris 2.0 reduces FVD from 561.04 to 279.01 (~2x improvement) while also raising CLIP text-video similarity and aesthetic score, thereby closing the open problem of temporally coherent decentralized video generation.
Significance. If the central comparison holds under rigorous controls, the result would be significant as the first concrete demonstration that decentralized training can produce temporally coherent video models without a monolithic GPU cluster. The reported metric gains, if isolated to the decentralization mechanism, would directly address the limitation left open by Paris 1.0 and provide a practical path for scaling video diffusion models under distributed compute constraints.
major comments (2)
- [Abstract] Abstract: the headline FVD reduction (561.04 o 279.01) is attributed to decentralized training, yet the abstract supplies no description of the decentralized protocol, data partitioning, synchronization method, or explicit verification that the monolithic comparator used identical architecture, parameter count, optimizer, and video-specific modifications; without these controls the improvement cannot be isolated from possible recipe differences.
- [Abstract] Abstract: the claim of a 'matched total compute budget' is load-bearing for the central result, but no accounting is given for communication overhead, straggler effects, or effective FLOPs under decentralization; this leaves open whether the reported gains reflect true parity or an unaccounted difference in training dynamics.
minor comments (1)
- The abstract references Paris 1.0 but does not cite the exact arXiv identifier in the main text; adding the full citation would aid traceability.
Simulated Author's Rebuttal
We thank the referee for the detailed feedback on the abstract. The comments correctly identify that the abstract is too concise to fully document the experimental controls. We will revise the abstract accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline FVD reduction (561.04 o 279.01) is attributed to decentralized training, yet the abstract supplies no description of the decentralized protocol, data partitioning, synchronization method, or explicit verification that the monolithic comparator used identical architecture, parameter count, optimizer, and video-specific modifications; without these controls the improvement cannot be isolated from possible recipe differences.
Authors: We agree the abstract lacks these details. The revised abstract will include a brief statement confirming that the decentralized protocol (with its data partitioning and synchronization) and the monolithic baseline (identical architecture, parameter count, optimizer, and video modifications) are as described in the main text, allowing isolation of the decentralization effect. revision: yes
-
Referee: [Abstract] Abstract: the claim of a 'matched total compute budget' is load-bearing for the central result, but no accounting is given for communication overhead, straggler effects, or effective FLOPs under decentralization; this leaves open whether the reported gains reflect true parity or an unaccounted difference in training dynamics.
Authors: We agree the abstract provides no such accounting. The revised abstract will note that the matched budget incorporates measured communication overhead and straggler mitigation, with full details in the methods. If explicit overhead measurements are absent from the current manuscript, they will be added during revision. revision: yes
Circularity Check
No circularity; empirical benchmark comparison is self-contained
full rationale
The paper reports an empirical FVD improvement (561.04 to 279.01) for a decentralized video model versus a monolithic baseline trained on identical data and total compute. No derivation chain, equations, fitted parameters presented as predictions, or self-citations that reduce the central claim to its own inputs exist in the provided text. The result rests on external measurement against a matched comparator rather than any self-definitional or load-bearing reduction, satisfying the criteria for a non-circular empirical finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
FLUX, 2024.https://github.com/black-forest-labs/flux
Black Forest Labs. FLUX, 2024.https://github.com/black-forest-labs/flux
2024
-
[2]
Paris: A decentralized trained open-weight diffusion model
Jiang, Z., Seraj, R., Villagra, M., and Roy, B. Paris: A decentralized trained open-weight diffusion model. arXiv preprint arXiv:2510.03434, 2025
-
[3]
Heterogeneous decentralized diffusion models
Jiang, Z., Seraj, R., Villagra, M., and Roy, B. Heterogeneous decentralized diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
2026
-
[4]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., Wu, K., Lin, Q., Yuan, J., Long, Y., Wang, A., Wang, A., Li, C., Huang, D., Yang, F., Tan, H., Wang, H., Song, J., Bai, J., Wu, J., Xue, J., Wang, J., Wang, K., Liu, M., Li, P., Li, S., Wang, W., Yu, W., Deng, X., Li, Y., Chen, Y., Cui, Y., Peng, Y., Yu, Z., H...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Flow Matching for Generative Modeling
Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
Decentralized diffusion models.Proceedings of the Computer Vision and Pattern Recognition Conference, pp
McAllister, D., Tancik, M., Song, J., and Kanazawa, A. Decentralized diffusion models.Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 23323–23333, 2025
2025
-
[7]
DINOv2: Learning Robust Visual Features without Supervision
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Expert-data alignment governs generation quality in decentralized diffusion models
Villagra, M., Roy, B., Seraj, R., and Jiang, Z. Expert-data alignment governs generation quality in decentralized diffusion models. InICLR 2026 DeLTa Workshop, 2026. arXiv:2602.02685
-
[9]
Open-Sora: Democratizing Efficient Video Production for All
Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., and You, Y. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.