pith. sign in

arxiv: 1907.08845 · v1 · pith:D2KZGCTNnew · submitted 2019-07-20 · 💻 cs.CV

Order Matters: Shuffling Sequence Generation for Video Prediction

Pith reviewed 2026-05-24 18:40 UTC · model grok-4.3

classification 💻 cs.CV
keywords video predictionframe shufflingtemporal ordersequence discriminationfuture frame generationSEE-Netcomputer vision
0
0 comments X

The pith

Learning to detect shuffled frame orders produces video predictions with stronger temporal coherence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video prediction models often lose sequential structure when forecasting many frames ahead. The paper proposes that training a discriminator to tell real frame sequences from shuffled ones forces the generator to internalize correct temporal order. This shuffling-based mechanism is implemented in SEE-Net and tested on synthetic and real videos. Experiments show the resulting predictions outperform prior architectures on both visual quality and quantitative metrics.

Core claim

The paper establishes that a Shuffling sEquence gEneration network (SEE-Net) learns to discriminate unnatural sequential orders by shuffling the video frames and comparing them to the real video sequence, and that this training produces future-frame predictions with improved temporal coherence.

What carries the argument

SEE-Net's shuffling sequence generation, which trains a discriminator on shuffled versus authentic frame orders to enforce temporal order during generation.

If this is right

  • Predicted sequences maintain temporal information over longer horizons than prior methods.
  • The model reaches state-of-the-art results on three datasets containing both synthetic and real-world videos.
  • Qualitative visual quality and quantitative metrics both improve when order discrimination is added.
  • The approach directly targets loss of sequential structure rather than only content realism.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same shuffling principle could be applied to other ordered data such as audio waveforms or motion capture sequences.
  • Explicit order supervision might reduce error accumulation in any autoregressive generative model.
  • Testing whether the learned discriminator generalizes to entirely new video domains would clarify the scope of the order signal.

Load-bearing premise

Training a discriminator on shuffled versus real frame orders will cause the generator to produce predictions that preserve temporal coherence better than models without this component.

What would settle it

A video prediction model trained without any shuffling or order-discrimination component achieving equal or better long-term coherence scores on the same three datasets.

Figures

Figures reproduced from arXiv: 1907.08845 by BingZhang Hu, Junyan Wang, Yang Long, Yu Guan.

Figure 1
Figure 1. Figure 1: Human can figure out the correct order of shuffled video frames (2-1-3). By doing [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The proposed video prediction framework ensure Ec to extract similar content information for all frames in the same video clip while at least a δ difference for those from different clips. 4.2 Shuffle Discriminator To explicitly extract motion information, we propose a novel shuffle discriminator (SD), which takes a sequence of predicted motion information from f lstm as input and discriminates if they are… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative Comparison to state-of-the-art methods on the Moving MNIST dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison to state-of-the-art methods on KTH and MSR datasets. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Quantitative results of PSNR and SSIM on KTH (a and b) and MSR (c and d) [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparing the predicted results on the KTH dataset based on the proposed SEE [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

Predicting future frames in natural video sequences is a new challenge that is receiving increasing attention in the computer vision community. However, existing models suffer from severe loss of temporal information when the predicted sequence is long. Compared to previous methods focusing on generating more realistic contents, this paper extensively studies the importance of sequential order information for video generation. A novel Shuffling sEquence gEneration network (SEE-Net) is proposed that can learn to discriminate unnatural sequential orders by shuffling the video frames and comparing them to the real video sequence. Systematic experiments on three datasets with both synthetic and real-world videos manifest the effectiveness of shuffling sequence generation for video prediction in our proposed model and demonstrate state-of-the-art performance by both qualitative and quantitative evaluations. The source code is available at https://github.com/andrewjywang/SEENet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that video prediction models suffer from loss of temporal information over long sequences. It proposes SEE-Net, a Shuffling sEquence gEneration network that trains a discriminator to detect unnatural frame orders by shuffling video frames and comparing them to real sequences. This signal is used to improve the generator's future-frame predictions. Systematic experiments on three datasets (synthetic and real-world) are said to demonstrate the effectiveness of the shuffling approach and state-of-the-art performance via both qualitative and quantitative evaluations. Source code is released.

Significance. If the results hold, the work usefully isolates sequential order as a distinct training objective for temporal coherence in video prediction, separate from content realism. The open-sourced code supports reproducibility and allows the community to test the shuffling discriminator mechanism.

major comments (2)
  1. [Abstract] Abstract: the assertion of state-of-the-art performance on three datasets supplies no metrics, baselines, ablation details, or error analysis, making it impossible to judge whether the central claim (that order discrimination improves long-term prediction) is supported.
  2. [Proposed model] Proposed model paragraph: the weakest assumption—that a discriminator trained on shuffled vs. real orders will produce a generator with superior temporal coherence on future-frame prediction—is stated but not yet load-bearing without targeted ablations showing that the shuffling signal, rather than other architectural factors, drives the reported gains.
minor comments (2)
  1. The GitHub link is a positive feature; ensure the released code includes the exact shuffling procedure and training protocol described in the text.
  2. Dataset descriptions would benefit from explicit frame counts, resolution, and train/test splits to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on clarifying the abstract claims and strengthening the experimental support for the shuffling mechanism. We address the points below and have made revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion of state-of-the-art performance on three datasets supplies no metrics, baselines, ablation details, or error analysis, making it impossible to judge whether the central claim (that order discrimination improves long-term prediction) is supported.

    Authors: We agree the abstract is concise and lacks specific numbers. The full manuscript reports quantitative results, baselines, and ablations in Sections 4 and 5. We have revised the abstract to include key metrics (e.g., PSNR/SSIM improvements over baselines on the three datasets) and a brief reference to the ablation evidence for the order-discrimination claim. revision: yes

  2. Referee: [Proposed model] Proposed model paragraph: the weakest assumption—that a discriminator trained on shuffled vs. real orders will produce a generator with superior temporal coherence on future-frame prediction—is stated but not yet load-bearing without targeted ablations showing that the shuffling signal, rather than other architectural factors, drives the reported gains.

    Authors: Section 4.3 and Table 3 already present ablations that compare the full model against a variant without the shuffling discriminator while holding other components fixed; these show the order signal contributes to long-term coherence gains. To make the isolation more explicit, we have added a further targeted ablation in the revision that varies only the discriminator objective. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes SEE-Net, a distinct architecture that trains a discriminator on shuffled vs. real frame sequences to enforce temporal order in video prediction. No equations, fitted parameters, or self-citations are described that reduce any claimed prediction or result to its own inputs by construction. The central mechanism (shuffling-based discrimination) is a novel training procedure whose effectiveness is asserted via empirical evaluations on datasets, not definitional equivalence. This matches the default expectation of a non-circular paper whose claims rest on independent experimental content rather than self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed from abstract only; the ledger therefore records only the high-level modeling assumption visible in the abstract and notes that free parameters and invented entities cannot be audited without the methods section.

axioms (1)
  • domain assumption Discriminating shuffled versus real frame orders supplies a useful training signal that improves future-frame generation quality.
    This premise is invoked to justify the SEE-Net architecture and is required for the claim that the method yields better temporal coherence.

pith-pipeline@v0.9.0 · 5668 in / 1230 out tokens · 20246 ms · 2026-05-24T18:40:14.792771+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 10 internal anchors

  1. [1]

    Social lstm: Human trajectory prediction in crowded spaces

    Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei- Fei, and Silvio Savarese. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 961–971, 2016

  2. [2]

    Stochastic Variational Video Prediction

    Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H Campbell, and Sergey Levine. Stochastic variational video prediction. arXiv preprint arXiv:1710.11252 , 2017

  3. [3]

    A Compositional Object-Based Approach to Learning Physical Dynamics

    Michael B Chang, Tomer Ullman, Antonio Torralba, and Joshua B Tenenbaum. A compositional object-based approach to learning physical dynamics. arXiv preprint arXiv:1612.00341, 2016

  4. [4]

    Unsupervised learning of disentangled representations from video

    Emily L Denton et al. Unsupervised learning of disentangled representations from video. In Advances in Neural Information Processing Systems, pages 4414–4423, 2017

  5. [5]

    Flownet: Learning optical flow with convolutional networks

    Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE inter- national conference on computer vision, pages 2758–2766, 2015

  6. [6]

    Attend, infer, repeat: Fast scene understanding with generative models

    SM Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari, Ge- offrey E Hinton, et al. Attend, infer, repeat: Fast scene understanding with generative models. In Advances in Neural Information Processing Systems , pages 3225–3233, 2016

  7. [7]

    Unsupervised learning for physical interaction through video prediction

    Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. In Advances in neural information processing systems, pages 64–72, 2016

  8. [8]

    Object-centric representation learning from unlabeled videos

    Ruohan Gao, Dinesh Jayaraman, and Kristen Grauman. Object-centric representation learning from unlabeled videos. In Asian Conference on Computer Vision, pages 248–

  9. [9]

    Ensembles of deep lstm learners for activity recogni- tion using wearables

    Yu Guan and Thomas Plötz. Ensembles of deep lstm learners for activity recogni- tion using wearables. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 1(2):11, 2017

  10. [10]

    Robust gait recognition from extremely low frame-rate videos

    Yu Guan, Chang-Tsun Li, and Sruti Das Choudhury. Robust gait recognition from extremely low frame-rate videos. In 2013 International Workshop on Biometrics and Forensics (IWBF), pages 1–4. IEEE, 2013

  11. [11]

    On reducing the effect of covariate factors in gait recognition: a classifier ensemble method

    Yu Guan, Chang-Tsun Li, and Fabio Roli. On reducing the effect of covariate factors in gait recognition: a classifier ensemble method. IEEE transactions on pattern analysis and machine intelligence, 37(7):1521–1528, 2014

  12. [12]

    Learning to decompose and disentangle representations for video prediction

    Jun-Ting Hsieh, Bingbin Liu, De-An Huang, Li F Fei-Fei, and Juan Carlos Niebles. Learning to decompose and disentangle representations for video prediction. In Ad- vances in Neural Information Processing Systems, pages 515–524, 2018. 12 J. W ANGET AL.: SHUFFLING SEQUENCE GENERA TION FOR VIDEO PREDICTION

  13. [13]

    Robust cross-view gait identification with evidence: A discriminant gait gan (diggan) approach on 10000 people

    BingZhang Hu, Yan Gao, Yu Guan, Yang Long, Nicholas Lane, and Thomas Ploetz. Robust cross-view gait identification with evidence: A discriminant gait gan (diggan) approach on 10000 people. arXiv preprint arXiv:1811.10493, 2018

  14. [14]

    Flownet 2.0: Evolution of optical flow estimation with deep networks

    Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 2462–2470, 2017

  15. [15]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  16. [16]

    Deep sequential context networks for action prediction

    Yu Kong, Zhiqiang Tao, and Yun Fu. Deep sequential context networks for action prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1473–1481, 2017

  17. [17]

    A hierarchical representation for future action prediction

    Tian Lan, Tsung-Chuan Chen, and Silvio Savarese. A hierarchical representation for future action prediction. In European Conference on Computer Vision, pages 689–704. Springer, 2014

  18. [18]

    Recognizing human actions: a local svm approach

    Ivan Laptev, Barbara Caputo, et al. Recognizing human actions: a local svm approach. In null, pages 32–36. IEEE, 2004

  19. [19]

    Stochastic Adversarial Video Prediction

    Alex X Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523 , 2018

  20. [20]

    Unsupervised representation learning by sorting sequences

    Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, pages 667–676, 2017

  21. [21]

    Desire: Distant future prediction in dynamic scenes with inter- acting agents

    Namhoon Lee, Wongun Choi, Paul Vernaza, Christopher B Choy, Philip HS Torr, and Manmohan Chandraker. Desire: Distant future prediction in dynamic scenes with inter- acting agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 336–345, 2017

  22. [22]

    Action recognition based on a bag of 3d points

    Wanqing Li, Zhengyou Zhang, and Zicheng Liu. Action recognition based on a bag of 3d points. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pages 9–14. IEEE, 2010

  23. [23]

    From zero-shot learning to conventional supervised classification: Unseen visual data syn- thesis

    Yang Long, Li Liu, Ling Shao, Fumin Shen, Guiguang Ding, and Jungong Han. From zero-shot learning to conventional supervised classification: Unseen visual data syn- thesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition, pages 1627–1636, 2017

  24. [24]

    Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning

    William Lotter, Gabriel Kreiman, and David Cox. Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104, 2016

  25. [25]

    Learning activity progression in lstms for activity detection and early detection

    Shugao Ma, Leonid Sigal, and Stan Sclaroff. Learning activity progression in lstms for activity detection and early detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1942–1950, 2016. J. W ANGET AL.: SHUFFLING SEQUENCE GENERA TION FOR VIDEO PREDICTION 13

  26. [26]

    Deep multi-scale video prediction beyond mean square error

    Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video predic- tion beyond mean square error. arXiv preprint arXiv:1511.05440, 2015

  27. [27]

    Shuffle and learn: unsupervised learning using temporal order verification

    Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vi- sion, pages 527–544. Springer, 2016

  28. [28]

    Folded recurrent neural networks for future video prediction

    Marc Oliu, Javier Selva, and Sergio Escalera. Folded recurrent neural networks for future video prediction. In Proceedings of the European Conference on Computer Vision (ECCV), pages 716–731, 2018

  29. [29]

    Spatio-temporal video autoencoder with differentiable memory

    Viorica Patraucean, Ankur Handa, and Roberto Cipolla. Spatio-temporal video autoen- coder with differentiable memory. arXiv preprint arXiv:1511.06309, 2015

  30. [30]

    Visual Robot Task Planning

    Chris Paxton, Yotam Barnoy, Kapil Katyal, Raman Arora, and Gregory D Hager. Visual robot task planning. arXiv preprint arXiv:1804.00062, 2018

  31. [31]

    Two-stream convolutional networks for ac- tion recognition in videos

    Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for ac- tion recognition in videos. InAdvances in neural information processing systems, pages 568–576, 2014

  32. [33]

    Unsupervised learning of video representations using lstms

    Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In International conference on machine learning, pages 843–852, 2015

  33. [34]

    Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume

    Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8934–8943, 2018

  34. [35]

    Instance Normalization: The Missing Ingredient for Fast Stylization

    Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016

  35. [36]

    Decomposing Motion and Content for Natural Video Sequence Prediction

    Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. De- composing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033, 2017

  36. [37]

    Generating the future with adversarial transform- ers

    Carl V ondrick and Antonio Torralba. Generating the future with adversarial transform- ers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 1020–1028, 2017

  37. [38]

    Generating videos with scene dynamics

    Carl V ondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems , pages 613–621, 2016

  38. [39]

    Patch to the future: Unsupervised visual prediction

    Jacob Walker, Abhinav Gupta, and Martial Hebert. Patch to the future: Unsupervised visual prediction. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 3302–3309, 2014. 14 J. W ANGET AL.: SHUFFLING SEQUENCE GENERA TION FOR VIDEO PREDICTION

  39. [40]

    Dense optical flow prediction from a static image

    Jacob Walker, Abhinav Gupta, and Martial Hebert. Dense optical flow prediction from a static image. In Proceedings of the IEEE International Conference on Computer Vision, pages 2443–2451, 2015

  40. [41]

    Unsupervised learning of visual representations using videos

    Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vi- sion, pages 2794–2802, 2015

  41. [42]

    Convolutional lstm network: A machine learning approach for precipitation nowcasting

    SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang- chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems , pages 802–810, 2015

  42. [43]

    Visual dynamics: Prob- abilistic future frame synthesis via cross convolutional networks

    Tianfan Xue, Jiajun Wu, Katherine Bouman, and Bill Freeman. Visual dynamics: Prob- abilistic future frame synthesis via cross convolutional networks. InAdvances in Neural Information Processing Systems, pages 91–99, 2016

  43. [44]

    Triple verification network for generalized zero-shot learning

    Haofeng Zhang, Yang Long, Yu Guan, and Ling Shao. Triple verification network for generalized zero-shot learning. IEEE Transactions on Image Processing , 28(1): 506–517, 2018

  44. [45]

    Towards universal representation for unseen action recognition

    Yi Zhu, Yang Long, Yu Guan, Shawn Newsam, and Ling Shao. Towards universal representation for unseen action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9436–9445, 2018