pith. sign in

arxiv: 1906.11415 · v1 · pith:SKT5VJASnew · submitted 2019-06-27 · 💻 cs.CV

Few-Shot Video Classification via Temporal Alignment

Pith reviewed 2026-05-25 15:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords few-shottemporalvideoalignmentlearningmodelnovelclassification
0
0 comments X

The pith

A temporal alignment module improves few-shot video classification by respecting frame order in videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a Temporal Alignment Module that computes distances between a query video and class examples by averaging per-frame distances along an alignment path. Prior few-shot video methods largely ignored long-term temporal ordering, but this module makes alignment differentiable via continuous relaxation so the whole system trains end-to-end on the few-shot objective. The result is higher accuracy on Kinetics and Something-Something-V2 when only a few labeled videos per class are available. A sympathetic reader would care because video data carries natural sequence structure that could reduce the number of examples needed to recognize new actions or events.

Core claim

The central claim is that explicitly leveraging temporal ordering information through temporal alignment produces strong data-efficiency for few-shot video classification; TAM calculates the distance of a query video to novel-class proxies by averaging the per-frame distances along its alignment path, with continuous relaxation enabling end-to-end optimization that yields significant gains over baselines on real-world datasets.

What carries the argument

Temporal Alignment Module (TAM), which averages per-frame distances along a continuous relaxation of the alignment path to produce class distances while preserving temporal order.

If this is right

  • Significant accuracy gains on Kinetics and Something-Something-V2 in few-shot regimes.
  • End-to-end training directly optimizes the few-shot classification objective.
  • Explicit use of long-term temporal ordering that prior methods neglected.
  • Improved data-efficiency when only a few labeled videos per novel class are supplied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alignment idea could be tested on other ordered data such as audio clips or motion-capture sequences.
  • If alignment proves robust, it might reduce reliance on large-scale pretraining for video tasks.
  • A natural extension would measure whether the gains hold when frame sampling rates or video lengths vary widely.
  • Alignment artifacts might appear most clearly on datasets where actions are defined more by object appearance than by sequence.
  • keywords:[

Load-bearing premise

The premise that averaging distances along an alignment path will reliably improve classification accuracy without introducing artifacts from imperfect alignments or frame sampling choices.

What would settle it

An experiment in which a non-aligned baseline (identical architecture but without the alignment-path averaging) matches or exceeds TAM accuracy on Kinetics and Something-Something-V2 under identical few-shot protocols would falsify the claimed benefit of the temporal component.

Figures

Figures reproduced from arXiv: 1906.11415 by Chien-Yi Chang, Jingwei Ji, Juan Carlos Niebles, Kaidi Cao, Zhangjie Cao.

Figure 1
Figure 1. Figure 1: Our few-shot video classification setting. Pairs of se￾mantically matched frames are connected with a blue dashed line. The arrows show the direction of the temporal alignment path. deal with scarce training data for previously unseen classes across different episodes. While the majority of recent few￾shot learning works focus on image classification, adapting it to video data is not a trivial extension. V… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our method. We first extract per-frame deep features using the embedding network. We then compute the distance matrices between the query video and videos in the support set. Next, an alignment score is computed out of the matrix representation. Finally we apply softmax operator over the alignment score of each novel class. cal flow sequences. By factorizing 3D convolutional filters into separa… view at source ↗
Figure 3
Figure 3. Figure 3: Methods for calculating alignment score. Each subplot shows a distance matrix. The darker of the color of an entry, the smaller the distance value is of a pair of relevant frames. The entries with green border denotes the entries contributing to the final alignment score. a path aligning the two videos from start to end, we allow the algorithm to find a path with flexible starting and ending points, while … view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of our learning results. Comparison of our matched with CMN’s matched results in an episode. Although the averaged score is quite high given the false matching and the query image, our algorithm is able to find the correct alignment path the minimize the alignment score, which ultimately results in the correct prediction. training to simulate the few-shot setting at meta-train stage to direct… view at source ↗
Figure 5
Figure 5. Figure 5: Smoothing factor sensitivity. We compare the effect of using different smoothing factors. ing parameter λ. Previous works [5, 22] have shown that using λ empirically helps optimization in many tasks. In￾tuitively, a smaller λ functions more like the min operation and a larger λ means a heavier smoothing effect over the values in nearby positions. We experimented on λ within the value set of [0.01, 0.05, 0.… view at source ↗
read the original abstract

There is a growing interest in learning a model which could recognize novel classes with only a few labeled examples. In this paper, we propose Temporal Alignment Module (TAM), a novel few-shot learning framework that can learn to classify a previous unseen video. While most previous works neglect long-term temporal ordering information, our proposed model explicitly leverages the temporal ordering information in video data through temporal alignment. This leads to strong data-efficiency for few-shot learning. In concrete, TAM calculates the distance value of query video with respect to novel class proxies by averaging the per frame distances along its alignment path. We introduce continuous relaxation to TAM so the model can be learned in an end-to-end fashion to directly optimize the few-shot learning objective. We evaluate TAM on two challenging real-world datasets, Kinetics and Something-Something-V2, and show that our model leads to significant improvement of few-shot video classification over a wide range of competitive baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes the Temporal Alignment Module (TAM) as a few-shot video classification framework. It claims that most prior methods neglect long-term temporal ordering, whereas TAM explicitly computes query-to-class distances by averaging per-frame distances along an alignment path; a continuous relaxation renders the module differentiable for end-to-end optimization of the few-shot objective. Empirical evaluation on Kinetics and Something-Something-V2 is said to yield significant gains over competitive baselines and improved data efficiency.

Significance. If the reported gains are reproducible and attributable to the ordering-preserving alignment rather than relaxation artifacts, the work would usefully extend few-shot video methods by supplying an explicit mechanism for temporal structure. The choice of two challenging real-world datasets is appropriate for the claim.

major comments (2)
  1. [Abstract] Abstract: the central empirical claim ('significant improvement ... over a wide range of competitive baselines' and 'strong data-efficiency') is stated without any quantitative values for the baselines, metrics, effect sizes, number of shots, or statistical significance; this information is load-bearing for assessing whether the temporal-alignment mechanism actually delivers the asserted benefit.
  2. [Abstract] Abstract (TAM description): the method rests on averaging distances along a continuously relaxed alignment path, yet no analytic bound, sensitivity analysis, or ablation is referenced that would demonstrate the relaxed path remains faithful to discrete ordering under realistic frame-sampling variation; without such verification the observed gains could be artifacts of the relaxation rather than evidence for the ordering signal the paper claims to exploit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the TAM formulation. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central empirical claim ('significant improvement ... over a wide range of competitive baselines' and 'strong data-efficiency') is stated without any quantitative values for the baselines, metrics, effect sizes, number of shots, or statistical significance; this information is load-bearing for assessing whether the temporal-alignment mechanism actually delivers the asserted benefit.

    Authors: We agree that the abstract would benefit from concrete quantitative anchors. In the revised version we will insert the key reported numbers (e.g., 5-shot and 1-shot top-1 accuracies on Kinetics and Something-Something-V2 together with the absolute gains over the strongest baselines) while remaining within the word limit. revision: yes

  2. Referee: [Abstract] Abstract (TAM description): the method rests on averaging distances along a continuously relaxed alignment path, yet no analytic bound, sensitivity analysis, or ablation is referenced that would demonstrate the relaxed path remains faithful to discrete ordering under realistic frame-sampling variation; without such verification the observed gains could be artifacts of the relaxation rather than evidence for the ordering signal the paper claims to exploit.

    Authors: The full manuscript already contains ablations that replace the continuous relaxation with discrete DTW and with random alignments, showing that the ordering signal is responsible for the gains. Nevertheless, we acknowledge that an explicit sensitivity study under varied frame sampling rates is not presented. We will add this analysis (both quantitative tables and qualitative alignment visualizations) to the supplementary material. revision: partial

Circularity Check

0 steps flagged

No circularity; TAM is an independent empirical addition

full rationale

The paper's central derivation introduces TAM as a new module that averages per-frame distances along an alignment path and applies continuous relaxation for differentiability. No provided equations, self-citations, or claims reduce the claimed data-efficiency gains to a quantity defined by the paper's own fitted inputs or prior self-work by construction. The improvement is presented as arising from the added temporal alignment mechanism and is evaluated on external datasets (Kinetics, Something-Something-V2), rendering the chain self-contained against benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the TAM is introduced as a new module whose internal formulation details are not provided.

pith-pipeline@v0.9.0 · 5692 in / 1125 out tokens · 25797 ms · 2026-05-25T15:14:37.714514+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 6 internal anchors

  1. [1]

    https://20bn.com/ datasets/jester

    The 20bn-jester dataset v1. https://20bn.com/ datasets/jester. 5

  2. [2]

    S. F. Altschul, T. L. Madden, A. A. Sch ¨affer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped blast and psi-blast: a new generation of protein database search pro- grams. Nucleic acids research, 25(17):3389–3402, 1997. 3

  3. [3]

    L. Bottou. Large-scale machine learning with stochastic gra- dient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010. 6

  4. [4]

    Carreira and A

    J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 6299–6308, 2017. 1, 2, 4

  5. [5]

    D3TW: Discriminative Differentiable Dynamic Time Warping for Weakly Supervised Action Alignment and Segmentation

    C.-Y . Chang, D.-A. Huang, Y . Sui, L. Fei-Fei, and J. C. Niebles. D3TW : Discriminative differentiable dynamic time warping for weakly supervised action alignment and seg- mentation. arXiv preprint arXiv:1901.02598 , 2019. 3, 5, 8

  6. [6]

    Chen, Y .-C

    W.-Y . Chen, Y .-C. Liu, Z. Kira, Y .-C. Wang, and J.-B. Huang. A closer look at few-shot classification. In International Conference on Learning Representations, 2019. 1, 6

  7. [7]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009. 6

  8. [8]

    Dogan, B

    P. Dogan, B. Li, L. Sigal, and M. Gross. A neural multi- sequence alignment technique (neumatch). In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8749–8758, 2018. 3

  9. [9]

    C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta- learning for fast adaptation of deep networks. InProceedings of the 34th International Conference on Machine Learning- Volume 70, pages 1126–1135. JMLR. org, 2017. 2, 6

  10. [10]

    Garcia and J

    V . Garcia and J. Bruna. Few-shot learning with graph neural networks. In ICLR, 2017. 1, 2

  11. [11]

    Gidaris and N

    S. Gidaris and N. Komodakis. Dynamic few-shot visual learning without forgetting. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 4367–4375, 2018. 6

  12. [12]

    Goyal, S

    R. Goyal, S. E. Kahou, V . Michalski, J. Materzynska, S. Westphal, H. Kim, V . Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In ICCV, volume 2, page 8, 2017. 2, 5, 6

  13. [13]

    Hariharan and R

    B. Hariharan and R. Girshick. Low-shot visual recogni- tion by shrinking and hallucinating features. In Proceedings of the IEEE International Conference on Computer Vision , pages 3018–3027, 2017. 2

  14. [14]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 770–778, 2016. 6

  15. [15]

    Learning to Remember Rare Events

    Ł. Kaiser, O. Nachum, A. Roy, and S. Bengio. Learning to remember rare events. arXiv preprint arXiv:1703.03129,

  16. [16]

    Karpathy, G

    A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convo- lutional neural networks. In Proceedings of the IEEE con- ference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014. 5

  17. [17]

    W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vi- jayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017. 2, 5, 6

  18. [18]

    Klaser, M

    A. Klaser, M. Marszałek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC 2008-19th British Machine Vision Conference , pages 275–1. British Machine Vision Association, 2008. 2

  19. [19]

    Kliper-Gross, T

    O. Kliper-Gross, T. Hassner, and L. Wolf. One shot similar- ity metric learning for action recognition. In International Workshop on Similarity-Based Pattern Recognition , pages 31–45. Springer, 2011. 2

  20. [20]

    G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neu- ral networks for one-shot image recognition. In ICML Deep Learning Workshop, volume 2, 2015. 2

  21. [21]

    J. Lin, C. Gan, and S. Han. Temporal shift module for effi- cient video understanding. arXiv preprint arXiv:1811.08383,

  22. [22]

    Mensch and M

    A. Mensch and M. Blondel. Differentiable dynamic pro- gramming for structured prediction and attention. ICML,

  23. [23]

    Mishra, V

    A. Mishra, V . K. Verma, M. S. K. Reddy, S. Arulkumar, P. Rai, and A. Mittal. A generative approach to zero-shot and few-shot action recognition. In 2018 IEEE Winter Con- ference on Applications of Computer Vision (WACV), pages 372–380. IEEE, 2018. 2

  24. [24]

    M ¨uller

    M. M ¨uller. Dynamic time warping. Information retrieval for music and motion, pages 69–84, 2007. 4

  25. [25]

    Munkhdalai and H

    T. Munkhdalai and H. Yu. Meta networks. In Proceedings of the 34th International Conference on Machine Learning- Volume 70, pages 2554–2563. JMLR. org, 2017. 2

  26. [26]

    On First-Order Meta-Learning Algorithms

    A. Nichol and J. Schulman. Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999, 2018. 2

  27. [27]

    Paszke, S

    A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De- Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Auto- matic differentiation in pytorch. 2017. 6

  28. [28]

    H. Qi, M. Brown, and D. G. Lowe. Low-shot learning with imprinted weights. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 5822– 5830, 2018. 6

  29. [29]

    Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal repre- sentation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision , pages 5533–5541, 2017. 3

  30. [30]

    Ravi and H

    S. Ravi and H. Larochelle. Optimization as a model for few- shot learning. 2016. 2

  31. [31]

    Richard, H

    A. Richard, H. Kuehne, A. Iqbal, and J. Gall. Neuralnetwork-viterbi: A framework for weakly su- pervised video learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 7386–7395, 2018. 3

  32. [32]

    A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pas- canu, S. Osindero, and R. Hadsell. Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960,

  33. [33]

    Scovanner, S

    P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift de- scriptor and its application to action recognition. InProceed- ings of the 15th ACM international conference on Multime- dia, pages 357–360. ACM, 2007. 2

  34. [34]

    G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Confer- ence on Computer Vision , pages 510–526. Springer, 2016. 5

  35. [35]

    Snell, K

    J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017. 2

  36. [36]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. 5

  37. [37]

    D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional net- works. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015. 2, 4

  38. [38]

    D. Tran, H. Wang, L. Torresani, J. Ray, Y . LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages 6450– 6459, 2018. 3

  39. [39]

    Vinyals, C

    O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems , pages 3630–3638,

  40. [40]

    Wang and C

    H. Wang and C. Schmid. Action recognition with improved trajectories. In Proceedings of the IEEE international con- ference on computer vision, pages 3551–3558, 2013. 2

  41. [41]

    L. Wang, Y . Xiong, Z. Wang, Y . Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European confer- ence on computer vision, pages 20–36. Springer, 2016. 1, 2, 4, 6

  42. [42]

    X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 7794– 7803, 2018. 3, 4

  43. [43]

    Wang and A

    X. Wang and A. Gupta. Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 399–417, 2018. 3

  44. [44]

    Y .-X. Wang, R. Girshick, M. Hebert, and B. Hariharan. Low- shot learning from imaginary data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 7278–7286, 2018. 2

  45. [45]

    S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Con- ference on Computer Vision (ECCV), pages 305–321, 2018. 4, 5

  46. [46]

    B. Zhou, A. Andonian, A. Oliva, and A. Torralba. Temporal relational reasoning in videos. In Proceedings of the Euro- pean Conference on Computer Vision (ECCV) , pages 803– 818, 2018. 3, 4, 5, 7

  47. [47]

    Zhu and Y

    L. Zhu and Y . Yang. Compound memory networks for few- shot video classification. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 751–766,

  48. [48]

    Zolfaghari, K

    M. Zolfaghari, K. Singh, and T. Brox. Eco: Efficient con- volutional network for online video understanding. In Pro- ceedings of the European Conference on Computer Vision (ECCV), pages 695–712, 2018. 4