Few-Shot Video Classification via Temporal Alignment

Chien-Yi Chang; Jingwei Ji; Juan Carlos Niebles; Kaidi Cao; Zhangjie Cao

arxiv: 1906.11415 · v1 · pith:SKT5VJASnew · submitted 2019-06-27 · 💻 cs.CV

Few-Shot Video Classification via Temporal Alignment

Kaidi Cao , Jingwei Ji , Zhangjie Cao , Chien-Yi Chang , Juan Carlos Niebles This is my paper

Pith reviewed 2026-05-25 15:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords few-shottemporalvideoalignmentlearningmodelnovelclassification

0 comments

The pith

A temporal alignment module improves few-shot video classification by respecting frame order in videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a Temporal Alignment Module that computes distances between a query video and class examples by averaging per-frame distances along an alignment path. Prior few-shot video methods largely ignored long-term temporal ordering, but this module makes alignment differentiable via continuous relaxation so the whole system trains end-to-end on the few-shot objective. The result is higher accuracy on Kinetics and Something-Something-V2 when only a few labeled videos per class are available. A sympathetic reader would care because video data carries natural sequence structure that could reduce the number of examples needed to recognize new actions or events.

Core claim

The central claim is that explicitly leveraging temporal ordering information through temporal alignment produces strong data-efficiency for few-shot video classification; TAM calculates the distance of a query video to novel-class proxies by averaging the per-frame distances along its alignment path, with continuous relaxation enabling end-to-end optimization that yields significant gains over baselines on real-world datasets.

What carries the argument

Temporal Alignment Module (TAM), which averages per-frame distances along a continuous relaxation of the alignment path to produce class distances while preserving temporal order.

If this is right

Significant accuracy gains on Kinetics and Something-Something-V2 in few-shot regimes.
End-to-end training directly optimizes the few-shot classification objective.
Explicit use of long-term temporal ordering that prior methods neglected.
Improved data-efficiency when only a few labeled videos per novel class are supplied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment idea could be tested on other ordered data such as audio clips or motion-capture sequences.
If alignment proves robust, it might reduce reliance on large-scale pretraining for video tasks.
A natural extension would measure whether the gains hold when frame sampling rates or video lengths vary widely.
Alignment artifacts might appear most clearly on datasets where actions are defined more by object appearance than by sequence.
keywords:[

Load-bearing premise

The premise that averaging distances along an alignment path will reliably improve classification accuracy without introducing artifacts from imperfect alignments or frame sampling choices.

What would settle it

An experiment in which a non-aligned baseline (identical architecture but without the alignment-path averaging) matches or exceeds TAM accuracy on Kinetics and Something-Something-V2 under identical few-shot protocols would falsify the claimed benefit of the temporal component.

Figures

Figures reproduced from arXiv: 1906.11415 by Chien-Yi Chang, Jingwei Ji, Juan Carlos Niebles, Kaidi Cao, Zhangjie Cao.

**Figure 1.** Figure 1: Our few-shot video classification setting. Pairs of semantically matched frames are connected with a blue dashed line. The arrows show the direction of the temporal alignment path. deal with scarce training data for previously unseen classes across different episodes. While the majority of recent fewshot learning works focus on image classification, adapting it to video data is not a trivial extension. V… view at source ↗

**Figure 2.** Figure 2: Overview of our method. We first extract per-frame deep features using the embedding network. We then compute the distance matrices between the query video and videos in the support set. Next, an alignment score is computed out of the matrix representation. Finally we apply softmax operator over the alignment score of each novel class. cal flow sequences. By factorizing 3D convolutional filters into separa… view at source ↗

**Figure 3.** Figure 3: Methods for calculating alignment score. Each subplot shows a distance matrix. The darker of the color of an entry, the smaller the distance value is of a pair of relevant frames. The entries with green border denotes the entries contributing to the final alignment score. a path aligning the two videos from start to end, we allow the algorithm to find a path with flexible starting and ending points, while … view at source ↗

**Figure 4.** Figure 4: Visualization of our learning results. Comparison of our matched with CMN’s matched results in an episode. Although the averaged score is quite high given the false matching and the query image, our algorithm is able to find the correct alignment path the minimize the alignment score, which ultimately results in the correct prediction. training to simulate the few-shot setting at meta-train stage to direct… view at source ↗

**Figure 5.** Figure 5: Smoothing factor sensitivity. We compare the effect of using different smoothing factors. ing parameter λ. Previous works [5, 22] have shown that using λ empirically helps optimization in many tasks. Intuitively, a smaller λ functions more like the min operation and a larger λ means a heavier smoothing effect over the values in nearby positions. We experimented on λ within the value set of [0.01, 0.05, 0.… view at source ↗

read the original abstract

There is a growing interest in learning a model which could recognize novel classes with only a few labeled examples. In this paper, we propose Temporal Alignment Module (TAM), a novel few-shot learning framework that can learn to classify a previous unseen video. While most previous works neglect long-term temporal ordering information, our proposed model explicitly leverages the temporal ordering information in video data through temporal alignment. This leads to strong data-efficiency for few-shot learning. In concrete, TAM calculates the distance value of query video with respect to novel class proxies by averaging the per frame distances along its alignment path. We introduce continuous relaxation to TAM so the model can be learned in an end-to-end fashion to directly optimize the few-shot learning objective. We evaluate TAM on two challenging real-world datasets, Kinetics and Something-Something-V2, and show that our model leads to significant improvement of few-shot video classification over a wide range of competitive baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TAM adds a differentiable alignment path average to few-shot video classification and reports gains on Kinetics and Something-Something-V2, but the abstract supplies no numbers or ablations and the relaxation step is unexamined.

read the letter

The main takeaway is that this paper defines TAM as an alignment-based distance for few-shot video, made end-to-end trainable via continuous relaxation, and states that it improves over baselines by using temporal order that prior methods ignore. The formulation itself is a straightforward extension of path averaging to the few-shot setting, and testing on two standard video datasets is the right choice for the claim. That part is concrete and addresses a practical gap in handling sequential data with limited labels. The soft spots sit in the evidence and the mechanism. The abstract mentions significant improvement but gives no metrics, baseline list, variance numbers, or statistical checks, so it is impossible to judge effect size or whether choices were post-hoc. The stress-test point on the relaxation is worth taking seriously here: if the continuous version alters frame matching or introduces interpolation effects, the averaged distance no longer cleanly reflects discrete ordering, and any gain could come from something else. The paper would need to show that the relaxed path still preserves the ordering benefit under realistic sampling variation, and that is not visible from the summary. This work is for people already working on few-shot video or temporal meta-learning who want a concrete module to try. A reader who needs a starting point for alignment in low-data video tasks could extract the distance definition and the relaxation trick. It deserves a serious referee because the core idea is technically grounded and the datasets are appropriate, even though the current write-up is light on verification. I would send it to review rather than desk reject so the experiments and any ablations can be checked directly.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes the Temporal Alignment Module (TAM) as a few-shot video classification framework. It claims that most prior methods neglect long-term temporal ordering, whereas TAM explicitly computes query-to-class distances by averaging per-frame distances along an alignment path; a continuous relaxation renders the module differentiable for end-to-end optimization of the few-shot objective. Empirical evaluation on Kinetics and Something-Something-V2 is said to yield significant gains over competitive baselines and improved data efficiency.

Significance. If the reported gains are reproducible and attributable to the ordering-preserving alignment rather than relaxation artifacts, the work would usefully extend few-shot video methods by supplying an explicit mechanism for temporal structure. The choice of two challenging real-world datasets is appropriate for the claim.

major comments (2)

[Abstract] Abstract: the central empirical claim ('significant improvement ... over a wide range of competitive baselines' and 'strong data-efficiency') is stated without any quantitative values for the baselines, metrics, effect sizes, number of shots, or statistical significance; this information is load-bearing for assessing whether the temporal-alignment mechanism actually delivers the asserted benefit.
[Abstract] Abstract (TAM description): the method rests on averaging distances along a continuously relaxed alignment path, yet no analytic bound, sensitivity analysis, or ablation is referenced that would demonstrate the relaxed path remains faithful to discrete ordering under realistic frame-sampling variation; without such verification the observed gains could be artifacts of the relaxation rather than evidence for the ordering signal the paper claims to exploit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the TAM formulation. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim ('significant improvement ... over a wide range of competitive baselines' and 'strong data-efficiency') is stated without any quantitative values for the baselines, metrics, effect sizes, number of shots, or statistical significance; this information is load-bearing for assessing whether the temporal-alignment mechanism actually delivers the asserted benefit.

Authors: We agree that the abstract would benefit from concrete quantitative anchors. In the revised version we will insert the key reported numbers (e.g., 5-shot and 1-shot top-1 accuracies on Kinetics and Something-Something-V2 together with the absolute gains over the strongest baselines) while remaining within the word limit. revision: yes
Referee: [Abstract] Abstract (TAM description): the method rests on averaging distances along a continuously relaxed alignment path, yet no analytic bound, sensitivity analysis, or ablation is referenced that would demonstrate the relaxed path remains faithful to discrete ordering under realistic frame-sampling variation; without such verification the observed gains could be artifacts of the relaxation rather than evidence for the ordering signal the paper claims to exploit.

Authors: The full manuscript already contains ablations that replace the continuous relaxation with discrete DTW and with random alignments, showing that the ordering signal is responsible for the gains. Nevertheless, we acknowledge that an explicit sensitivity study under varied frame sampling rates is not presented. We will add this analysis (both quantitative tables and qualitative alignment visualizations) to the supplementary material. revision: partial

Circularity Check

0 steps flagged

No circularity; TAM is an independent empirical addition

full rationale

The paper's central derivation introduces TAM as a new module that averages per-frame distances along an alignment path and applies continuous relaxation for differentiability. No provided equations, self-citations, or claims reduce the claimed data-efficiency gains to a quantity defined by the paper's own fitted inputs or prior self-work by construction. The improvement is presented as arising from the added temporal alignment mechanism and is evaluated on external datasets (Kinetics, Something-Something-V2), rendering the chain self-contained against benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the TAM is introduced as a new module whose internal formulation details are not provided.

pith-pipeline@v0.9.0 · 5692 in / 1125 out tokens · 25797 ms · 2026-05-25T15:14:37.714514+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 6 internal anchors

[1]

https://20bn.com/ datasets/jester

The 20bn-jester dataset v1. https://20bn.com/ datasets/jester. 5

work page
[2]

S. F. Altschul, T. L. Madden, A. A. Sch ¨affer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped blast and psi-blast: a new generation of protein database search pro- grams. Nucleic acids research, 25(17):3389–3402, 1997. 3

work page 1997
[3]

L. Bottou. Large-scale machine learning with stochastic gra- dient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010. 6

work page 2010
[4]

Carreira and A

J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 6299–6308, 2017. 1, 2, 4

work page 2017
[5]

D3TW: Discriminative Differentiable Dynamic Time Warping for Weakly Supervised Action Alignment and Segmentation

C.-Y . Chang, D.-A. Huang, Y . Sui, L. Fei-Fei, and J. C. Niebles. D3TW : Discriminative differentiable dynamic time warping for weakly supervised action alignment and seg- mentation. arXiv preprint arXiv:1901.02598 , 2019. 3, 5, 8

work page internal anchor Pith review Pith/arXiv arXiv 1901
[6]

Chen, Y .-C

W.-Y . Chen, Y .-C. Liu, Z. Kira, Y .-C. Wang, and J.-B. Huang. A closer look at few-shot classiﬁcation. In International Conference on Learning Representations, 2019. 1, 6

work page 2019
[7]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009. 6

work page 2009
[8]

Dogan, B

P. Dogan, B. Li, L. Sigal, and M. Gross. A neural multi- sequence alignment technique (neumatch). In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8749–8758, 2018. 3

work page 2018
[9]

C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta- learning for fast adaptation of deep networks. InProceedings of the 34th International Conference on Machine Learning- Volume 70, pages 1126–1135. JMLR. org, 2017. 2, 6

work page 2017
[10]

Garcia and J

V . Garcia and J. Bruna. Few-shot learning with graph neural networks. In ICLR, 2017. 1, 2

work page 2017
[11]

Gidaris and N

S. Gidaris and N. Komodakis. Dynamic few-shot visual learning without forgetting. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 4367–4375, 2018. 6

work page 2018
[12]

Goyal, S

R. Goyal, S. E. Kahou, V . Michalski, J. Materzynska, S. Westphal, H. Kim, V . Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In ICCV, volume 2, page 8, 2017. 2, 5, 6

work page 2017
[13]

Hariharan and R

B. Hariharan and R. Girshick. Low-shot visual recogni- tion by shrinking and hallucinating features. In Proceedings of the IEEE International Conference on Computer Vision , pages 3018–3027, 2017. 2

work page 2017
[14]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 770–778, 2016. 6

work page 2016
[15]

Learning to Remember Rare Events

Ł. Kaiser, O. Nachum, A. Roy, and S. Bengio. Learning to remember rare events. arXiv preprint arXiv:1703.03129,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Karpathy, G

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classiﬁcation with convo- lutional neural networks. In Proceedings of the IEEE con- ference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014. 5

work page 2014
[17]

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vi- jayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017. 2, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

Klaser, M

A. Klaser, M. Marszałek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC 2008-19th British Machine Vision Conference , pages 275–1. British Machine Vision Association, 2008. 2

work page 2008
[19]

Kliper-Gross, T

O. Kliper-Gross, T. Hassner, and L. Wolf. One shot similar- ity metric learning for action recognition. In International Workshop on Similarity-Based Pattern Recognition , pages 31–45. Springer, 2011. 2

work page 2011
[20]

G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neu- ral networks for one-shot image recognition. In ICML Deep Learning Workshop, volume 2, 2015. 2

work page 2015
[21]

J. Lin, C. Gan, and S. Han. Temporal shift module for efﬁ- cient video understanding. arXiv preprint arXiv:1811.08383,

work page arXiv
[22]

Mensch and M

A. Mensch and M. Blondel. Differentiable dynamic pro- gramming for structured prediction and attention. ICML,

work page
[23]

Mishra, V

A. Mishra, V . K. Verma, M. S. K. Reddy, S. Arulkumar, P. Rai, and A. Mittal. A generative approach to zero-shot and few-shot action recognition. In 2018 IEEE Winter Con- ference on Applications of Computer Vision (WACV), pages 372–380. IEEE, 2018. 2

work page 2018
[24]

M ¨uller

M. M ¨uller. Dynamic time warping. Information retrieval for music and motion, pages 69–84, 2007. 4

work page 2007
[25]

Munkhdalai and H

T. Munkhdalai and H. Yu. Meta networks. In Proceedings of the 34th International Conference on Machine Learning- Volume 70, pages 2554–2563. JMLR. org, 2017. 2

work page 2017
[26]

On First-Order Meta-Learning Algorithms

A. Nichol and J. Schulman. Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[27]

Paszke, S

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De- Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Auto- matic differentiation in pytorch. 2017. 6

work page 2017
[28]

H. Qi, M. Brown, and D. G. Lowe. Low-shot learning with imprinted weights. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 5822– 5830, 2018. 6

work page 2018
[29]

Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal repre- sentation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision , pages 5533–5541, 2017. 3

work page 2017
[30]

Ravi and H

S. Ravi and H. Larochelle. Optimization as a model for few- shot learning. 2016. 2

work page 2016
[31]

Richard, H

A. Richard, H. Kuehne, A. Iqbal, and J. Gall. Neuralnetwork-viterbi: A framework for weakly su- pervised video learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 7386–7395, 2018. 3

work page 2018
[32]

A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pas- canu, S. Osindero, and R. Hadsell. Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Scovanner, S

P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift de- scriptor and its application to action recognition. InProceed- ings of the 15th ACM international conference on Multime- dia, pages 357–360. ACM, 2007. 2

work page 2007
[34]

G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Confer- ence on Computer Vision , pages 510–526. Springer, 2016. 5

work page 2016
[35]

Snell, K

J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017. 2

work page 2017
[36]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. 5

work page internal anchor Pith review Pith/arXiv arXiv 2012
[37]

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional net- works. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015. 2, 4

work page 2015
[38]

D. Tran, H. Wang, L. Torresani, J. Ray, Y . LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages 6450– 6459, 2018. 3

work page 2018
[39]

Vinyals, C

O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems , pages 3630–3638,

work page
[40]

Wang and C

H. Wang and C. Schmid. Action recognition with improved trajectories. In Proceedings of the IEEE international con- ference on computer vision, pages 3551–3558, 2013. 2

work page 2013
[41]

L. Wang, Y . Xiong, Z. Wang, Y . Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European confer- ence on computer vision, pages 20–36. Springer, 2016. 1, 2, 4, 6

work page 2016
[42]

X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 7794– 7803, 2018. 3, 4

work page 2018
[43]

Wang and A

X. Wang and A. Gupta. Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 399–417, 2018. 3

work page 2018
[44]

Y .-X. Wang, R. Girshick, M. Hebert, and B. Hariharan. Low- shot learning from imaginary data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 7278–7286, 2018. 2

work page 2018
[45]

S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classiﬁcation. In Proceedings of the European Con- ference on Computer Vision (ECCV), pages 305–321, 2018. 4, 5

work page 2018
[46]

B. Zhou, A. Andonian, A. Oliva, and A. Torralba. Temporal relational reasoning in videos. In Proceedings of the Euro- pean Conference on Computer Vision (ECCV) , pages 803– 818, 2018. 3, 4, 5, 7

work page 2018
[47]

Zhu and Y

L. Zhu and Y . Yang. Compound memory networks for few- shot video classiﬁcation. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 751–766,

work page
[48]

Zolfaghari, K

M. Zolfaghari, K. Singh, and T. Brox. Eco: Efﬁcient con- volutional network for online video understanding. In Pro- ceedings of the European Conference on Computer Vision (ECCV), pages 695–712, 2018. 4

work page 2018

[1] [1]

https://20bn.com/ datasets/jester

The 20bn-jester dataset v1. https://20bn.com/ datasets/jester. 5

work page

[2] [2]

S. F. Altschul, T. L. Madden, A. A. Sch ¨affer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped blast and psi-blast: a new generation of protein database search pro- grams. Nucleic acids research, 25(17):3389–3402, 1997. 3

work page 1997

[3] [3]

L. Bottou. Large-scale machine learning with stochastic gra- dient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010. 6

work page 2010

[4] [4]

Carreira and A

J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 6299–6308, 2017. 1, 2, 4

work page 2017

[5] [5]

D3TW: Discriminative Differentiable Dynamic Time Warping for Weakly Supervised Action Alignment and Segmentation

C.-Y . Chang, D.-A. Huang, Y . Sui, L. Fei-Fei, and J. C. Niebles. D3TW : Discriminative differentiable dynamic time warping for weakly supervised action alignment and seg- mentation. arXiv preprint arXiv:1901.02598 , 2019. 3, 5, 8

work page internal anchor Pith review Pith/arXiv arXiv 1901

[6] [6]

Chen, Y .-C

W.-Y . Chen, Y .-C. Liu, Z. Kira, Y .-C. Wang, and J.-B. Huang. A closer look at few-shot classiﬁcation. In International Conference on Learning Representations, 2019. 1, 6

work page 2019

[7] [7]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009. 6

work page 2009

[8] [8]

Dogan, B

P. Dogan, B. Li, L. Sigal, and M. Gross. A neural multi- sequence alignment technique (neumatch). In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8749–8758, 2018. 3

work page 2018

[9] [9]

C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta- learning for fast adaptation of deep networks. InProceedings of the 34th International Conference on Machine Learning- Volume 70, pages 1126–1135. JMLR. org, 2017. 2, 6

work page 2017

[10] [10]

Garcia and J

V . Garcia and J. Bruna. Few-shot learning with graph neural networks. In ICLR, 2017. 1, 2

work page 2017

[11] [11]

Gidaris and N

S. Gidaris and N. Komodakis. Dynamic few-shot visual learning without forgetting. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 4367–4375, 2018. 6

work page 2018

[12] [12]

Goyal, S

R. Goyal, S. E. Kahou, V . Michalski, J. Materzynska, S. Westphal, H. Kim, V . Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In ICCV, volume 2, page 8, 2017. 2, 5, 6

work page 2017

[13] [13]

Hariharan and R

B. Hariharan and R. Girshick. Low-shot visual recogni- tion by shrinking and hallucinating features. In Proceedings of the IEEE International Conference on Computer Vision , pages 3018–3027, 2017. 2

work page 2017

[14] [14]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 770–778, 2016. 6

work page 2016

[15] [15]

Learning to Remember Rare Events

Ł. Kaiser, O. Nachum, A. Roy, and S. Bengio. Learning to remember rare events. arXiv preprint arXiv:1703.03129,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Karpathy, G

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classiﬁcation with convo- lutional neural networks. In Proceedings of the IEEE con- ference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014. 5

work page 2014

[17] [17]

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vi- jayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017. 2, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2017

[18] [18]

Klaser, M

A. Klaser, M. Marszałek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC 2008-19th British Machine Vision Conference , pages 275–1. British Machine Vision Association, 2008. 2

work page 2008

[19] [19]

Kliper-Gross, T

O. Kliper-Gross, T. Hassner, and L. Wolf. One shot similar- ity metric learning for action recognition. In International Workshop on Similarity-Based Pattern Recognition , pages 31–45. Springer, 2011. 2

work page 2011

[20] [20]

G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neu- ral networks for one-shot image recognition. In ICML Deep Learning Workshop, volume 2, 2015. 2

work page 2015

[21] [21]

J. Lin, C. Gan, and S. Han. Temporal shift module for efﬁ- cient video understanding. arXiv preprint arXiv:1811.08383,

work page arXiv

[22] [22]

Mensch and M

A. Mensch and M. Blondel. Differentiable dynamic pro- gramming for structured prediction and attention. ICML,

work page

[23] [23]

Mishra, V

A. Mishra, V . K. Verma, M. S. K. Reddy, S. Arulkumar, P. Rai, and A. Mittal. A generative approach to zero-shot and few-shot action recognition. In 2018 IEEE Winter Con- ference on Applications of Computer Vision (WACV), pages 372–380. IEEE, 2018. 2

work page 2018

[24] [24]

M ¨uller

M. M ¨uller. Dynamic time warping. Information retrieval for music and motion, pages 69–84, 2007. 4

work page 2007

[25] [25]

Munkhdalai and H

T. Munkhdalai and H. Yu. Meta networks. In Proceedings of the 34th International Conference on Machine Learning- Volume 70, pages 2554–2563. JMLR. org, 2017. 2

work page 2017

[26] [26]

On First-Order Meta-Learning Algorithms

A. Nichol and J. Schulman. Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018

[27] [27]

Paszke, S

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De- Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Auto- matic differentiation in pytorch. 2017. 6

work page 2017

[28] [28]

H. Qi, M. Brown, and D. G. Lowe. Low-shot learning with imprinted weights. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 5822– 5830, 2018. 6

work page 2018

[29] [29]

Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal repre- sentation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision , pages 5533–5541, 2017. 3

work page 2017

[30] [30]

Ravi and H

S. Ravi and H. Larochelle. Optimization as a model for few- shot learning. 2016. 2

work page 2016

[31] [31]

Richard, H

A. Richard, H. Kuehne, A. Iqbal, and J. Gall. Neuralnetwork-viterbi: A framework for weakly su- pervised video learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 7386–7395, 2018. 3

work page 2018

[32] [32]

A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pas- canu, S. Osindero, and R. Hadsell. Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960,

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Scovanner, S

P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift de- scriptor and its application to action recognition. InProceed- ings of the 15th ACM international conference on Multime- dia, pages 357–360. ACM, 2007. 2

work page 2007

[34] [34]

G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Confer- ence on Computer Vision , pages 510–526. Springer, 2016. 5

work page 2016

[35] [35]

Snell, K

J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017. 2

work page 2017

[36] [36]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. 5

work page internal anchor Pith review Pith/arXiv arXiv 2012

[37] [37]

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional net- works. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015. 2, 4

work page 2015

[38] [38]

D. Tran, H. Wang, L. Torresani, J. Ray, Y . LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages 6450– 6459, 2018. 3

work page 2018

[39] [39]

Vinyals, C

O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems , pages 3630–3638,

work page

[40] [40]

Wang and C

H. Wang and C. Schmid. Action recognition with improved trajectories. In Proceedings of the IEEE international con- ference on computer vision, pages 3551–3558, 2013. 2

work page 2013

[41] [41]

L. Wang, Y . Xiong, Z. Wang, Y . Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European confer- ence on computer vision, pages 20–36. Springer, 2016. 1, 2, 4, 6

work page 2016

[42] [42]

X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 7794– 7803, 2018. 3, 4

work page 2018

[43] [43]

Wang and A

X. Wang and A. Gupta. Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 399–417, 2018. 3

work page 2018

[44] [44]

Y .-X. Wang, R. Girshick, M. Hebert, and B. Hariharan. Low- shot learning from imaginary data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 7278–7286, 2018. 2

work page 2018

[45] [45]

S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classiﬁcation. In Proceedings of the European Con- ference on Computer Vision (ECCV), pages 305–321, 2018. 4, 5

work page 2018

[46] [46]

B. Zhou, A. Andonian, A. Oliva, and A. Torralba. Temporal relational reasoning in videos. In Proceedings of the Euro- pean Conference on Computer Vision (ECCV) , pages 803– 818, 2018. 3, 4, 5, 7

work page 2018

[47] [47]

Zhu and Y

L. Zhu and Y . Yang. Compound memory networks for few- shot video classiﬁcation. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 751–766,

work page

[48] [48]

Zolfaghari, K

M. Zolfaghari, K. Singh, and T. Brox. Eco: Efﬁcient con- volutional network for online video understanding. In Pro- ceedings of the European Conference on Computer Vision (ECCV), pages 695–712, 2018. 4

work page 2018