A Survey on Deep Learning Techniques for Action Anticipation

Juergen Gall; J\"urgen Beyerer; Manuel Martin; Michael Voit; Zeyun Zhong

arxiv: 2309.17257 · v2 · submitted 2023-09-29 · 💻 cs.CV

A Survey on Deep Learning Techniques for Action Anticipation

Zeyun Zhong , Manuel Martin , Michael Voit , Juergen Gall , J\"urgen Beyerer This is my paper

Pith reviewed 2026-05-24 06:39 UTC · model grok-4.3

classification 💻 cs.CV

keywords action anticipationdeep learningsurveydaily living scenariosaction predictionevaluation metricsdatasetshuman-robot interaction

0 comments

The pith

This survey classifies deep learning methods for anticipating human actions in daily-living scenarios by their primary contributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to organize the growing body of work on action anticipation so that readers can quickly see the main technical directions and compare approaches. It groups methods according to what each contributes most, condenses the details into tables, and reviews the datasets and metrics that are commonly used. The survey then lays out systematic suggestions for future work. A reader would care because anticipating actions matters for systems that must act ahead of time, such as robots or vehicles, and a clear map of existing techniques reduces duplicated effort.

Core claim

By classifying action anticipation algorithms according to their primary contributions and summarizing them in tables, the survey establishes a structured overview of recent deep learning advances focused on daily-living scenarios, along with common evaluation metrics, datasets, and directions for future research.

What carries the argument

Classification of methods by primary contributions, accompanied by tabular summaries of the reviewed approaches.

If this is right

Researchers gain an at-a-glance comparison of how different techniques relate to one another.
Standard practices for evaluation become clearer through the consolidated discussion of metrics and datasets.
Future algorithm development can be guided by the identified gaps and suggested directions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same classification lens could be applied to neighboring tasks such as long-term action forecasting.
Real-world deployment in human-robot interaction might prioritize methods from categories that handle partial observations well.
New benchmarks could be constructed to test whether methods from one category generalize better than others across scenarios.

Load-bearing premise

The survey's selection of methods, datasets, and metrics accurately covers the current literature without major omissions or selection bias.

What would settle it

Identification of several prominent recent papers on action anticipation in daily-living settings that are absent from the classification tables.

Figures

Figures reproduced from arXiv: 2309.17257 by Juergen Gall, J\"urgen Beyerer, Manuel Martin, Michael Voit, Zeyun Zhong.

**Figure 2.** Figure 2: Predictability of actions. TABLE 1: Definition of action anticipation and its related tasks. Task Definition Action Anticipation Anticipates one or multiple future actions before they happen. Action Recognition Categorizes actions in a video or image sequence given the full observation. Early Action Recognition Categorizes actions in a video or image sequence given the partial observation. Action Segmen… view at source ↗

**Figure 3.** Figure 3: Category of the action anticipation task. Short-term [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Classification of action anticipation methods. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Three typical fusion strategies for multi-modal action anticipation. Feature extractors are omitted for brevity. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Abstraction level of activities. issue, they proposed a spatio-temporal relational network (STRN) by broadening the Convolutional Gated Recurrent Units (ConvGRU) [173], which can capture the spatial along with the temporal context. The underlying intention. To get something done, humans perform a sequence of actions dictated by an intention [174]. It is therefore advantageous to understand human intention … view at source ↗

read the original abstract

The ability to anticipate possible future human actions is essential for a wide range of applications, including autonomous driving and human-robot interaction. Consequently, numerous methods have been introduced for action anticipation in recent years, with deep learning-based approaches being particularly popular. In this work, we review the recent advances of action anticipation algorithms with a particular focus on daily-living scenarios. Additionally, we classify these methods according to their primary contributions and summarize them in tabular form, allowing readers to grasp the details at a glance. Furthermore, we delve into the common evaluation metrics and datasets used for action anticipation and provide future directions with systematical discussions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a standard survey that organizes action anticipation work with tables and a focus on daily scenarios but adds no new methods or results.

read the letter

This is a survey that reviews recent deep learning approaches to action anticipation, focusing on daily-living scenarios. It classifies the methods according to their primary contributions, summarizes them in tables, reviews common metrics and datasets, and outlines future directions. The tabular summaries are the strongest feature here. They allow a reader to compare different lines of work side by side, which is helpful for getting oriented in a growing area. The discussion of evaluation metrics and datasets also provides a practical reference that many surveys include but not always with enough detail. Nothing in the structure suggests problems with the central claims. The classification is a standard way to handle this material, and the abstract presents it as a review rather than a new technical result. The main uncertainty is whether the coverage is thorough. Any survey risks omitting relevant papers or having an implicit bias in selection, and that can only be judged from the full list of included works. The abstract gives no reason to suspect major issues, but that's the part that would need checking. This paper is for researchers who are new to action anticipation or who want a consolidated view of the field for their own literature review. It is not the kind of work that advances the state of the art on its own. I'd recommend sending it to peer review. With a solid check on the included papers and any gaps, it could become a useful reference point even if the contribution remains modest.

Referee Report

0 major / 3 minor

Summary. The manuscript is a survey reviewing recent deep learning-based methods for action anticipation, with a focus on daily-living scenarios. It classifies methods according to primary contributions, summarizes them in tables, discusses common evaluation metrics and datasets, and provides systematic discussions of future directions.

Significance. A well-organized survey with tabular summaries can help the computer vision community by consolidating the literature on action anticipation, clarifying trends in daily-living applications, and highlighting open problems; the tabular format and metric/dataset review are practical strengths for readers.

minor comments (3)

[Abstract] Abstract: the claim of reviewing 'recent advances' would be strengthened by stating the publication years or number of papers covered to allow readers to gauge scope and potential selection bias.
The classification scheme by 'primary contributions' should include a brief justification or decision criteria in the main text to make the taxonomy reproducible and less subjective.
Ensure all tables include clear captions, column definitions, and references back to the corresponding sections discussing each method.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary and significance assessment of our survey, as well as the recommendation for minor revision. No specific major comments were listed in the report.

Circularity Check

0 steps flagged

No circularity: survey paper with no derivations or predictions

full rationale

This is a literature survey that reviews, classifies, and tabulates existing action anticipation methods without presenting any original derivations, equations, fitted parameters, or predictions. The abstract and description explicitly frame the work as classification and summarization of prior literature, with no self-citation chains or ansatzes invoked as load-bearing premises. No steps match any of the enumerated circularity patterns because the paper makes no claims that reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature survey with no new scientific claims, derivations, or postulated entities.

pith-pipeline@v0.9.0 · 5633 in / 943 out tokens · 18589 ms · 2026-05-24T06:39:52.895358+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

207 extracted references · 207 canonical work pages · 6 internal anchors

[1]

Slowfast networks for video recognition,

C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in ICCV, 2019, pp. 6202–6211

work page 2019
[2]

Video swin transformer,

Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in CVPR, 2022, pp. 3202–3211

work page 2022
[3]

Philipp, ¨Uber die Formalisierung und Analyse medizinischer Prozesse im Kontext von Expertenwissen und k¨ unstlicher Intelligenz

P . Philipp, ¨Uber die Formalisierung und Analyse medizinischer Prozesse im Kontext von Expertenwissen und k¨ unstlicher Intelligenz. KIT Scientific Publishing, 2023

work page 2023
[4]

Anticipating visual representations from unlabeled video,

C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating visual representations from unlabeled video,” in CVPR, 2016, pp. 98– 106

work page 2016
[5]

Temporal Aggregate Repre- sentations for Long-Range Video Understanding,

F. Sener, D. Singhania, and A. Yao, “Temporal Aggregate Repre- sentations for Long-Range Video Understanding,” inECCV, 2020

work page 2020
[6]

MeMViT: Memory-Augmented Multiscale Vi- sion Transformer for Efficient Long-Term Video Recognition,

C.-Y. Wu, Y. Li, K. Mangalam, H. Fan, B. Xiong, J. Malik, and C. Feichtenhofer, “MeMViT: Memory-Augmented Multiscale Vi- sion Transformer for Efficient Long-Term Video Recognition,” in CVPR, 2022

work page 2022
[7]

Scaling egocentric vision: The epic-kitchens dataset,

D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price et al. , “Scaling egocentric vision: The epic-kitchens dataset,” in ECCV, 2018, pp. 720–736

work page 2018
[8]

Temporal action segmentation: An analysis of modern technique,

G. Ding, F. Sener, and A. Yao, “Temporal action segmentation: An analysis of modern technique,” arXiv preprint arXiv:2210.10352 , 2022

work page arXiv 2022
[9]

Deep learning-based action detection in untrimmed videos: A survey,

E. Vahdani and Y. Tian, “Deep learning-based action detection in untrimmed videos: A survey,” TP AMI, 2022

work page 2022
[10]

A Review on Deep Learning Techniques for Video Prediction,

S. Oprea, P . Martinez-Gonzalez, A. Garcia-Garcia, J. A. Castro- Vargas, S. Orts-Escolano, J. Garcia-Rodriguez, and A. Argyros, “A Review on Deep Learning Techniques for Video Prediction,” TP AMI, no. 6, 2020

work page 2020
[11]

Human motion trajectory prediction: A sur- vey,

A. Rudenko, L. Palmieri, M. Herman, K. M. Kitani, D. M. Gavrila, and K. O. Arras, “Human motion trajectory prediction: A sur- vey,” The International Journal of Robotics Research , vol. 39, no. 8, pp. 895–935, 2020

work page 2020
[12]

3d human motion prediction: A survey,

K. Lyu, H. Chen, Z. Liu, B. Zhang, and R. Wang, “3d human motion prediction: A survey,” Neurocomputing, vol. 489, pp. 345– 365, 2022. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 17

work page 2022
[13]

Convolutional lstm network: A machine learning ap- proach for precipitation nowcasting,

X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.- c. Woo, “Convolutional lstm network: A machine learning ap- proach for precipitation nowcasting,” in NeurIPS, vol. 28, 2015

work page 2015
[14]

Price prediction strategies for market-based scheduling,

J. K. MacKie-Mason, A. V . Osepayshvili, D. M. Reeves, and M. P . Wellman, “Price prediction strategies for market-based scheduling,” in International Conference on Automated Planning and Scheduling, 2004

work page 2004
[15]

Human In- tention Estimation based on Hidden Markov Model Motion Validation for Safe Flexible Robotized Warehouses,

T. Petkovi ´c, D. Puljiz, I. Markovi ´c, and B. Hein, “Human In- tention Estimation based on Hidden Markov Model Motion Validation for Safe Flexible Robotized Warehouses,” Robotics and Computer-Integrated Manufacturing, 2019

work page 2019
[16]

Anticipating Human Activi- ties Using Object Affordances for Reactive Robotic Response,

H. S. Koppula and A. Saxena, “Anticipating Human Activi- ties Using Object Affordances for Reactive Robotic Response,” TP AMI, no. 1, 2016

work page 2016
[17]

Car that Knows Before You Do: Anticipating Maneuvers via Learning Temporal Driving Models,

A. Jain, H. S. Koppula, B. Raghavan, S. Soh, and A. Saxena, “Car that Knows Before You Do: Anticipating Maneuvers via Learning Temporal Driving Models,” in ICCV, 2015

work page 2015
[18]

Pedestrian Action Anticipation using Contextual Feature Fusion in Stacked RNNs,

A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “Pedestrian Action Anticipation using Contextual Feature Fusion in Stacked RNNs,” in BMVC, 2019

work page 2019
[19]

Help by predicting what to do,

E. Alati, L. Mauro, V . Ntouskos, and F. Pirri, “Help by predicting what to do,” in ICIP, 2019

work page 2019
[20]

Anticipating the Start of User Interaction for Service Robot in the Wild,

K. Ito, Q. Kong, S. Horiguchi, T. Sumiyoshi, and K. Nagamatsu, “Anticipating the Start of User Interaction for Service Robot in the Wild,” in ICRA, 2020

work page 2020
[21]

Antici- pation in human-robot cooperation: A recurrent neural network approach for multiple action sequences prediction,

P . Schydlo, M. Rakovic, L. Jamone, and J. Santos-Victor, “Antici- pation in human-robot cooperation: A recurrent neural network approach for multiple action sequences prediction,” in ICRA, 2018, pp. 5909–5914

work page 2018
[22]

Using gaze patterns to predict task intent in collaboration,

C.-M. Huang, S. Andrist, A. Saupp ´e, and B. Mutlu, “Using gaze patterns to predict task intent in collaboration,” Frontiers in psychology, vol. 6, p. 1049, 2015

work page 2015
[23]

Generating notifications for missing actions: Don’t forget to turn the lights off!

B. Soran, A. Farhadi, and L. Shapiro, “Generating notifications for missing actions: Don’t forget to turn the lights off!” in ICCV, 2015, pp. 4669–4677

work page 2015
[24]

Agent-Centric Risk Assessment: Accident Anticipation and Risky Region Localization,

K.-H. Zeng, S.-H. Chou, F.-H. Chan, J. C. Niebles, and M. Sun, “Agent-Centric Risk Assessment: Accident Anticipation and Risky Region Localization,” in CVPR, 2017

work page 2017
[25]

Anticipating traffic accidents with adaptive loss and large-scale incident db,

T. Suzuki, H. Kataoka, Y. Aoki, and Y. Satoh, “Anticipating traffic accidents with adaptive loss and large-scale incident db,” in CVPR, 2018, pp. 3521–3529

work page 2018
[26]

A comprehensive survey on human activity prediction,

N. P . Trong, H. Nguyen, K. Kazunori, and B. Le Hoai, “A comprehensive survey on human activity prediction,” in ICCSA, 2017, pp. 411–425

work page 2017
[27]

Deep learning for vision-based prediction: A sur- vey,

A. Rasouli, “Deep learning for vision-based prediction: A sur- vey,” arXiv preprint arXiv:2007.00095, 2020

work page arXiv 2007
[28]

Human action recognition and prediction: A survey,

Y. Kong and Y. Fu, “Human action recognition and prediction: A survey,” IJCV, vol. 130, no. 5, pp. 1366–1401, 2022

work page 2022
[29]

Pre- dicting the future from first person (egocentric) vision: A survey,

I. Rodin, A. Furnari, D. Mavroeidis, and G. M. Farinella, “Pre- dicting the future from first person (egocentric) vision: A survey,” Computer Vision and Image Understanding, vol. 211, p. 103252, 2021

work page 2021
[30]

Online human action detection and anticipation in videos: A survey,

X. Hu, J. Dai, M. Li, C. Peng, Y. Li, and S. Du, “Online human action detection and anticipation in videos: A survey,” Neurocom- puting, vol. 491, pp. 395–413, 2022

work page 2022
[31]

An outlook into the future of egocentric vision,

C. Plizzari, G. Goletto, A. Furnari, S. Bansal, F. Ragusa, G. M. Farinella, D. Damen, and T. Tommasi, “An outlook into the future of egocentric vision,” arXiv preprint arXiv:2308.07123, 2023

work page arXiv 2023
[32]

Modeling Complex Temporal Compo- sition of Actionlets for Activity Prediction,

K. Li, J. Hu, and Y. Fu, “Modeling Complex Temporal Compo- sition of Actionlets for Activity Prediction,” in ECCV, vol. 7572, 2012, pp. 286–299

work page 2012
[33]

Prediction of Human Activity by Discovering Temporal Sequence Patterns,

K. Li and Y. Fu, “Prediction of Human Activity by Discovering Temporal Sequence Patterns,” TP AMI, vol. 36, no. 8, pp. 1644– 1657, 2014

work page 2014
[34]

First-person activity forecasting with online inverse reinforcement learning,

N. Rhinehart and K. M. Kitani, “First-person activity forecasting with online inverse reinforcement learning,” in ICCV, 2017, pp. 3696–3705

work page 2017
[35]

A poisson process model for activity forecasting,

T. Mahmud, M. Hasan, A. Chakraborty, and A. K. Roy- Chowdhury, “A poisson process model for activity forecasting,” in ICIP, 2016, pp. 3339–3343

work page 2016
[36]

Predicting Human Activities Using Stochastic Grammar,

S. Qi, S. Huang, P . Wei, and S.-C. Zhu, “Predicting Human Activities Using Stochastic Grammar,” in ICCV, 2017

work page 2017
[37]

Very deep convolutional net- works for large-scale image recognition,

K. Simonyan and A. Zisserman, “Very deep convolutional net- works for large-scale image recognition,” in ICLR, 2015

work page 2015
[38]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778

work page 2016
[39]

Temporal Segment Networks: Towards Good Prac- tices for Deep Action Recognition,

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal Segment Networks: Towards Good Prac- tices for Deep Action Recognition,” in ECCV, 2016

work page 2016
[40]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in ICLR, 2021

work page 2021
[41]

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset,

J. Carreira and A. Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset,” in CVPR, 2017

work page 2017
[42]

Two-stream convolutional net- works for action recognition in videos,

K. Simonyan and A. Zisserman, “Two-stream convolutional net- works for action recognition in videos,” in NeurIPS, vol. 27, 2014

work page 2014
[43]

A closer look at spatiotemporal convolutions for action recogni- tion,

D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recogni- tion,” in CVPR, 2018, pp. 6450–6459

work page 2018
[44]

Mvitv2: Improved multiscale vision trans- formers for classification and detection,

Y. Li, C.-Y. Wu, H. Fan, K. Mangalam, B. Xiong, J. Malik, and C. Feichtenhofer, “Mvitv2: Improved multiscale vision trans- formers for classification and detection,” in CVPR, 2022, pp. 4804–4814

work page 2022
[45]

RED: Reinforced Encoder- Decoder Networks for Action Anticipation,

J. Gao, Z. Yang, and R. Nevatia, “RED: Reinforced Encoder- Decoder Networks for Action Anticipation,” in BMVC, 2017

work page 2017
[46]

When will you do what? - Anticipating Temporal Occurrences of Activities,

Y. A. Farha, A. Richard, and J. Gall, “When will you do what? - Anticipating Temporal Occurrences of Activities,” inCVPR, 2018

work page 2018
[47]

Ego4D: Around the World in 3,000 Hours of Egocentric Video,

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar et al. , “Ego4D: Around the World in 3,000 Hours of Egocentric Video,” in CVPR, 2022

work page 2022
[48]

Uncertainty-Aware Anticipation of Activities,

Y. Abu Farha and J. Gall, “Uncertainty-Aware Anticipation of Activities,” in ICCV Workshop, 2019

work page 2019
[49]

On diverse asynchronous activity anticipation,

H. Zhao and R. P . Wildes, “On diverse asynchronous activity anticipation,” in ECCV, 2020, pp. 781–799

work page 2020
[50]

Weakly supervised action learning with rnn based fine-to-coarse modeling,

A. Richard, H. Kuehne, and J. Gall, “Weakly supervised action learning with rnn based fine-to-coarse modeling,” in CVPR, 2017, pp. 754–763

work page 2017
[51]

Ms-tcn: Multi-stage temporal convo- lutional network for action segmentation,

Y. A. Farha and J. Gall, “Ms-tcn: Multi-stage temporal convo- lutional network for action segmentation,” in CVPR, 2019, pp. 3575–3584

work page 2019
[52]

Weakly-supervised dense action anticipation,

H. Zhang, F. Chen, and A. Yao, “Weakly-supervised dense action anticipation,” in BMVC, 2021

work page 2021
[53]

A Variational Auto-Encoder Model for Stochastic Point Pro- cesses,

N. Mehrasa, A. A. Jyothi, T. Durand, J. He, L. Sigal, and G. Mori, “A Variational Auto-Encoder Model for Stochastic Point Pro- cesses,” in CVPR, 2019

work page 2019
[54]

Faster r-cnn: Towards real- time object detection with region proposal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real- time object detection with region proposal networks,” inNeurIPS, vol. 28, 2015

work page 2015
[55]

Mask r-cnn,

K. He, G. Gkioxari, P . Doll ´ar, and R. Girshick, “Mask r-cnn,” in ICCV, 2017, pp. 2961–2969

work page 2017
[56]

Imagenet classi- fication with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classi- fication with deep convolutional neural networks,” in NeurIPS, 2012

work page 2012
[57]

High five: Recognising human interactions in tv shows

A. Patron-Perez, M. Marszalek, A. Zisserman, and I. Reid, “High five: Recognising human interactions in tv shows.” in BMVC, vol. 1, no. 2, 2010, p. 33

work page 2010
[58]

Detecting activities of daily living in first-person camera views,

H. Pirsiavash and D. Ramanan, “Detecting activities of daily living in first-person camera views,” in CVPR, 2012, pp. 2847– 2854

work page 2012
[59]

Thumos challenge: Action recognition with a large number of classes,

Y.-G. Jiang, J. Liu, A. R. Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar, “Thumos challenge: Action recognition with a large number of classes,” 2014

work page 2014
[60]

Visual forecasting by imitating dynamics in natural sequences,

K.-H. Zeng, W. B. Shen, D.-A. Huang, M. Sun, and J. Car- los Niebles, “Visual forecasting by imitating dynamics in natural sequences,” in ICCV, 2017, pp. 2999–3008

work page 2017
[61]

CUHK & ETHZ & SIAT Submission to ActivityNet Challenge 2016

Y. Xiong, L. Wang, Z. Wang, B. Zhang, H. Song, W. Li, D. Lin, Y. Qiao, L. Van Gool, and X. Tang, “Cuhk & ethz & siat submission to activitynet challenge 2016,” arXiv preprint arXiv:1608.00797, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[62]

Online action detection,

R. D. Geest, E. Gavves, A. Ghodrati, Z. Li, C. Snoek, and T. Tuyte- laars, “Online action detection,” in ECCV, 2016, pp. 269–284

work page 2016
[63]

Unsupervised learning for forecast- ing action representations,

Y. Zhong and W.-S. Zheng, “Unsupervised learning for forecast- ing action representations,” in ICIP, 2018, pp. 1073–1077

work page 2018
[64]

Knowledge distillation for human action anticipation,

V . Tran, Y. Wang, Z. Zhang, and M. Hoai, “Knowledge distillation for human action anticipation,” in ICIP, 2021, pp. 2518–2522

work page 2021
[65]

Anticipating human actions by cor- relating past with the future with Jaccard similarity measures,

B. Fernando and S. Herath, “Anticipating human actions by cor- relating past with the future with Jaccard similarity measures,” in CVPR, 2021

work page 2021
[66]

The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities,

H. Kuehne, A. Arslan, and T. Serre, “The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities,” in CVPR, 2014. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 18

work page 2014
[67]

Anticipative Video Transformer,

R. Girdhar and K. Grauman, “Anticipative Video Transformer,” in ICCV, 2021

work page 2021
[68]

Rescaling egocentric vision: Collection, pipeline and challenges for epic- kitchens-100,

D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Priceet al., “Rescaling egocentric vision: Collection, pipeline and challenges for epic- kitchens-100,” IJCV, pp. 1–23, 2022

work page 2022
[69]

In the eye of beholder: Joint learning of gaze and actions in first person video,

Y. Li, M. Liu, and J. M. Rehg, “In the eye of beholder: Joint learning of gaze and actions in first person video,” in ECCV, 2018, pp. 619–635

work page 2018
[70]

Combining embedded accelerom- eters with computer vision for recognizing food preparation activities,

S. Stein and S. J. McKenna, “Combining embedded accelerom- eters with computer vision for recognizing food preparation activities,” in UbiComp, 2013, pp. 729–738

work page 2013
[71]

Learning To Anticipate Future With Dynamic Context Removal,

X. Xu, Y.-L. Li, and C. Lu, “Learning To Anticipate Future With Dynamic Context Removal,” in CVPR, 2022

work page 2022
[72]

Tsm: Temporal shift module for efficient video understanding,

J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in ICCV, 2019, pp. 7083–7093

work page 2019
[73]

Latency mat- ters: Real-time action forecasting transformer,

H. Girase, N. Agarwal, C. Choi, and K. Mangalam, “Latency mat- ters: Real-time action forecasting transformer,” in CVPR, 2023, pp. 18 759–18 769

work page 2023
[74]

Fore- casting future action sequences with neural memory networks,

H. Gammulle, S. Denman, S. Sridharan, and C. Fookes, “Fore- casting future action sequences with neural memory networks,” in BMVC, 2019

work page 2019
[75]

Temporal recurrent networks for online action detection,

M. Xu, M. Gao, Y.-T. Chen, L. S. Davis, and D. J. Crandall, “Temporal recurrent networks for online action detection,” in ICCV, 2019, pp. 5532–5541

work page 2019
[76]

TTPP: Tempo- ral Transformer with Progressive Prediction for Efficient Action Anticipation,

W. Wang, X. Peng, Y. Su, Y. Qiao, and J. Cheng, “TTPP: Tempo- ral Transformer with Progressive Prediction for Efficient Action Anticipation,” Neurocomputing, 2020

work page 2020
[77]

Lap-net: Adaptive features sampling via learning action progression for online action detection,

S. Qu, G. Chen, D. Xu, J. Dong, F. Lu, and A. Knoll, “Lap-net: Adaptive features sampling via learning action progression for online action detection,” arXiv preprint arXiv:2011.07915, 2020

work page arXiv 2011
[78]

Learning to Anticipate Egocentric Actions by Imagination,

Y. Wu, L. Zhu, X. Wang, Y. Yang, and F. Wu, “Learning to Anticipate Egocentric Actions by Imagination,” TIP, 2021

work page 2021
[79]

Long short-term transformer for online action detection,

M. Xu, Y. Xiong, H. Chen, X. Li, W. Xia, Z. Tu, and S. Soatto, “Long short-term transformer for online action detection,” in NeurIPS, vol. 34, 2021, pp. 1086–1099

work page 2021
[80]

Oadtr: Online action detection with transformers,

X. Wang, S. Zhang, Z. Qing, Y. Shao, Z. Zuo, C. Gao, and N. Sang, “Oadtr: Online action detection with transformers,” in ICCV, 2021

work page 2021

Showing first 80 references.

[1] [1]

Slowfast networks for video recognition,

C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in ICCV, 2019, pp. 6202–6211

work page 2019

[2] [2]

Video swin transformer,

Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in CVPR, 2022, pp. 3202–3211

work page 2022

[3] [3]

Philipp, ¨Uber die Formalisierung und Analyse medizinischer Prozesse im Kontext von Expertenwissen und k¨ unstlicher Intelligenz

P . Philipp, ¨Uber die Formalisierung und Analyse medizinischer Prozesse im Kontext von Expertenwissen und k¨ unstlicher Intelligenz. KIT Scientific Publishing, 2023

work page 2023

[4] [4]

Anticipating visual representations from unlabeled video,

C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating visual representations from unlabeled video,” in CVPR, 2016, pp. 98– 106

work page 2016

[5] [5]

Temporal Aggregate Repre- sentations for Long-Range Video Understanding,

F. Sener, D. Singhania, and A. Yao, “Temporal Aggregate Repre- sentations for Long-Range Video Understanding,” inECCV, 2020

work page 2020

[6] [6]

MeMViT: Memory-Augmented Multiscale Vi- sion Transformer for Efficient Long-Term Video Recognition,

C.-Y. Wu, Y. Li, K. Mangalam, H. Fan, B. Xiong, J. Malik, and C. Feichtenhofer, “MeMViT: Memory-Augmented Multiscale Vi- sion Transformer for Efficient Long-Term Video Recognition,” in CVPR, 2022

work page 2022

[7] [7]

Scaling egocentric vision: The epic-kitchens dataset,

D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price et al. , “Scaling egocentric vision: The epic-kitchens dataset,” in ECCV, 2018, pp. 720–736

work page 2018

[8] [8]

Temporal action segmentation: An analysis of modern technique,

G. Ding, F. Sener, and A. Yao, “Temporal action segmentation: An analysis of modern technique,” arXiv preprint arXiv:2210.10352 , 2022

work page arXiv 2022

[9] [9]

Deep learning-based action detection in untrimmed videos: A survey,

E. Vahdani and Y. Tian, “Deep learning-based action detection in untrimmed videos: A survey,” TP AMI, 2022

work page 2022

[10] [10]

A Review on Deep Learning Techniques for Video Prediction,

S. Oprea, P . Martinez-Gonzalez, A. Garcia-Garcia, J. A. Castro- Vargas, S. Orts-Escolano, J. Garcia-Rodriguez, and A. Argyros, “A Review on Deep Learning Techniques for Video Prediction,” TP AMI, no. 6, 2020

work page 2020

[11] [11]

Human motion trajectory prediction: A sur- vey,

A. Rudenko, L. Palmieri, M. Herman, K. M. Kitani, D. M. Gavrila, and K. O. Arras, “Human motion trajectory prediction: A sur- vey,” The International Journal of Robotics Research , vol. 39, no. 8, pp. 895–935, 2020

work page 2020

[12] [12]

3d human motion prediction: A survey,

K. Lyu, H. Chen, Z. Liu, B. Zhang, and R. Wang, “3d human motion prediction: A survey,” Neurocomputing, vol. 489, pp. 345– 365, 2022. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 17

work page 2022

[13] [13]

Convolutional lstm network: A machine learning ap- proach for precipitation nowcasting,

X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.- c. Woo, “Convolutional lstm network: A machine learning ap- proach for precipitation nowcasting,” in NeurIPS, vol. 28, 2015

work page 2015

[14] [14]

Price prediction strategies for market-based scheduling,

J. K. MacKie-Mason, A. V . Osepayshvili, D. M. Reeves, and M. P . Wellman, “Price prediction strategies for market-based scheduling,” in International Conference on Automated Planning and Scheduling, 2004

work page 2004

[15] [15]

Human In- tention Estimation based on Hidden Markov Model Motion Validation for Safe Flexible Robotized Warehouses,

T. Petkovi ´c, D. Puljiz, I. Markovi ´c, and B. Hein, “Human In- tention Estimation based on Hidden Markov Model Motion Validation for Safe Flexible Robotized Warehouses,” Robotics and Computer-Integrated Manufacturing, 2019

work page 2019

[16] [16]

Anticipating Human Activi- ties Using Object Affordances for Reactive Robotic Response,

H. S. Koppula and A. Saxena, “Anticipating Human Activi- ties Using Object Affordances for Reactive Robotic Response,” TP AMI, no. 1, 2016

work page 2016

[17] [17]

Car that Knows Before You Do: Anticipating Maneuvers via Learning Temporal Driving Models,

A. Jain, H. S. Koppula, B. Raghavan, S. Soh, and A. Saxena, “Car that Knows Before You Do: Anticipating Maneuvers via Learning Temporal Driving Models,” in ICCV, 2015

work page 2015

[18] [18]

Pedestrian Action Anticipation using Contextual Feature Fusion in Stacked RNNs,

A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “Pedestrian Action Anticipation using Contextual Feature Fusion in Stacked RNNs,” in BMVC, 2019

work page 2019

[19] [19]

Help by predicting what to do,

E. Alati, L. Mauro, V . Ntouskos, and F. Pirri, “Help by predicting what to do,” in ICIP, 2019

work page 2019

[20] [20]

Anticipating the Start of User Interaction for Service Robot in the Wild,

K. Ito, Q. Kong, S. Horiguchi, T. Sumiyoshi, and K. Nagamatsu, “Anticipating the Start of User Interaction for Service Robot in the Wild,” in ICRA, 2020

work page 2020

[21] [21]

Antici- pation in human-robot cooperation: A recurrent neural network approach for multiple action sequences prediction,

P . Schydlo, M. Rakovic, L. Jamone, and J. Santos-Victor, “Antici- pation in human-robot cooperation: A recurrent neural network approach for multiple action sequences prediction,” in ICRA, 2018, pp. 5909–5914

work page 2018

[22] [22]

Using gaze patterns to predict task intent in collaboration,

C.-M. Huang, S. Andrist, A. Saupp ´e, and B. Mutlu, “Using gaze patterns to predict task intent in collaboration,” Frontiers in psychology, vol. 6, p. 1049, 2015

work page 2015

[23] [23]

Generating notifications for missing actions: Don’t forget to turn the lights off!

B. Soran, A. Farhadi, and L. Shapiro, “Generating notifications for missing actions: Don’t forget to turn the lights off!” in ICCV, 2015, pp. 4669–4677

work page 2015

[24] [24]

Agent-Centric Risk Assessment: Accident Anticipation and Risky Region Localization,

K.-H. Zeng, S.-H. Chou, F.-H. Chan, J. C. Niebles, and M. Sun, “Agent-Centric Risk Assessment: Accident Anticipation and Risky Region Localization,” in CVPR, 2017

work page 2017

[25] [25]

Anticipating traffic accidents with adaptive loss and large-scale incident db,

T. Suzuki, H. Kataoka, Y. Aoki, and Y. Satoh, “Anticipating traffic accidents with adaptive loss and large-scale incident db,” in CVPR, 2018, pp. 3521–3529

work page 2018

[26] [26]

A comprehensive survey on human activity prediction,

N. P . Trong, H. Nguyen, K. Kazunori, and B. Le Hoai, “A comprehensive survey on human activity prediction,” in ICCSA, 2017, pp. 411–425

work page 2017

[27] [27]

Deep learning for vision-based prediction: A sur- vey,

A. Rasouli, “Deep learning for vision-based prediction: A sur- vey,” arXiv preprint arXiv:2007.00095, 2020

work page arXiv 2007

[28] [28]

Human action recognition and prediction: A survey,

Y. Kong and Y. Fu, “Human action recognition and prediction: A survey,” IJCV, vol. 130, no. 5, pp. 1366–1401, 2022

work page 2022

[29] [29]

Pre- dicting the future from first person (egocentric) vision: A survey,

I. Rodin, A. Furnari, D. Mavroeidis, and G. M. Farinella, “Pre- dicting the future from first person (egocentric) vision: A survey,” Computer Vision and Image Understanding, vol. 211, p. 103252, 2021

work page 2021

[30] [30]

Online human action detection and anticipation in videos: A survey,

X. Hu, J. Dai, M. Li, C. Peng, Y. Li, and S. Du, “Online human action detection and anticipation in videos: A survey,” Neurocom- puting, vol. 491, pp. 395–413, 2022

work page 2022

[31] [31]

An outlook into the future of egocentric vision,

C. Plizzari, G. Goletto, A. Furnari, S. Bansal, F. Ragusa, G. M. Farinella, D. Damen, and T. Tommasi, “An outlook into the future of egocentric vision,” arXiv preprint arXiv:2308.07123, 2023

work page arXiv 2023

[32] [32]

Modeling Complex Temporal Compo- sition of Actionlets for Activity Prediction,

K. Li, J. Hu, and Y. Fu, “Modeling Complex Temporal Compo- sition of Actionlets for Activity Prediction,” in ECCV, vol. 7572, 2012, pp. 286–299

work page 2012

[33] [33]

Prediction of Human Activity by Discovering Temporal Sequence Patterns,

K. Li and Y. Fu, “Prediction of Human Activity by Discovering Temporal Sequence Patterns,” TP AMI, vol. 36, no. 8, pp. 1644– 1657, 2014

work page 2014

[34] [34]

First-person activity forecasting with online inverse reinforcement learning,

N. Rhinehart and K. M. Kitani, “First-person activity forecasting with online inverse reinforcement learning,” in ICCV, 2017, pp. 3696–3705

work page 2017

[35] [35]

A poisson process model for activity forecasting,

T. Mahmud, M. Hasan, A. Chakraborty, and A. K. Roy- Chowdhury, “A poisson process model for activity forecasting,” in ICIP, 2016, pp. 3339–3343

work page 2016

[36] [36]

Predicting Human Activities Using Stochastic Grammar,

S. Qi, S. Huang, P . Wei, and S.-C. Zhu, “Predicting Human Activities Using Stochastic Grammar,” in ICCV, 2017

work page 2017

[37] [37]

Very deep convolutional net- works for large-scale image recognition,

K. Simonyan and A. Zisserman, “Very deep convolutional net- works for large-scale image recognition,” in ICLR, 2015

work page 2015

[38] [38]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778

work page 2016

[39] [39]

Temporal Segment Networks: Towards Good Prac- tices for Deep Action Recognition,

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal Segment Networks: Towards Good Prac- tices for Deep Action Recognition,” in ECCV, 2016

work page 2016

[40] [40]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in ICLR, 2021

work page 2021

[41] [41]

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset,

J. Carreira and A. Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset,” in CVPR, 2017

work page 2017

[42] [42]

Two-stream convolutional net- works for action recognition in videos,

K. Simonyan and A. Zisserman, “Two-stream convolutional net- works for action recognition in videos,” in NeurIPS, vol. 27, 2014

work page 2014

[43] [43]

A closer look at spatiotemporal convolutions for action recogni- tion,

D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recogni- tion,” in CVPR, 2018, pp. 6450–6459

work page 2018

[44] [44]

Mvitv2: Improved multiscale vision trans- formers for classification and detection,

Y. Li, C.-Y. Wu, H. Fan, K. Mangalam, B. Xiong, J. Malik, and C. Feichtenhofer, “Mvitv2: Improved multiscale vision trans- formers for classification and detection,” in CVPR, 2022, pp. 4804–4814

work page 2022

[45] [45]

RED: Reinforced Encoder- Decoder Networks for Action Anticipation,

J. Gao, Z. Yang, and R. Nevatia, “RED: Reinforced Encoder- Decoder Networks for Action Anticipation,” in BMVC, 2017

work page 2017

[46] [46]

When will you do what? - Anticipating Temporal Occurrences of Activities,

Y. A. Farha, A. Richard, and J. Gall, “When will you do what? - Anticipating Temporal Occurrences of Activities,” inCVPR, 2018

work page 2018

[47] [47]

Ego4D: Around the World in 3,000 Hours of Egocentric Video,

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar et al. , “Ego4D: Around the World in 3,000 Hours of Egocentric Video,” in CVPR, 2022

work page 2022

[48] [48]

Uncertainty-Aware Anticipation of Activities,

Y. Abu Farha and J. Gall, “Uncertainty-Aware Anticipation of Activities,” in ICCV Workshop, 2019

work page 2019

[49] [49]

On diverse asynchronous activity anticipation,

H. Zhao and R. P . Wildes, “On diverse asynchronous activity anticipation,” in ECCV, 2020, pp. 781–799

work page 2020

[50] [50]

Weakly supervised action learning with rnn based fine-to-coarse modeling,

A. Richard, H. Kuehne, and J. Gall, “Weakly supervised action learning with rnn based fine-to-coarse modeling,” in CVPR, 2017, pp. 754–763

work page 2017

[51] [51]

Ms-tcn: Multi-stage temporal convo- lutional network for action segmentation,

Y. A. Farha and J. Gall, “Ms-tcn: Multi-stage temporal convo- lutional network for action segmentation,” in CVPR, 2019, pp. 3575–3584

work page 2019

[52] [52]

Weakly-supervised dense action anticipation,

H. Zhang, F. Chen, and A. Yao, “Weakly-supervised dense action anticipation,” in BMVC, 2021

work page 2021

[53] [53]

A Variational Auto-Encoder Model for Stochastic Point Pro- cesses,

N. Mehrasa, A. A. Jyothi, T. Durand, J. He, L. Sigal, and G. Mori, “A Variational Auto-Encoder Model for Stochastic Point Pro- cesses,” in CVPR, 2019

work page 2019

[54] [54]

Faster r-cnn: Towards real- time object detection with region proposal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real- time object detection with region proposal networks,” inNeurIPS, vol. 28, 2015

work page 2015

[55] [55]

Mask r-cnn,

K. He, G. Gkioxari, P . Doll ´ar, and R. Girshick, “Mask r-cnn,” in ICCV, 2017, pp. 2961–2969

work page 2017

[56] [56]

Imagenet classi- fication with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classi- fication with deep convolutional neural networks,” in NeurIPS, 2012

work page 2012

[57] [57]

High five: Recognising human interactions in tv shows

A. Patron-Perez, M. Marszalek, A. Zisserman, and I. Reid, “High five: Recognising human interactions in tv shows.” in BMVC, vol. 1, no. 2, 2010, p. 33

work page 2010

[58] [58]

Detecting activities of daily living in first-person camera views,

H. Pirsiavash and D. Ramanan, “Detecting activities of daily living in first-person camera views,” in CVPR, 2012, pp. 2847– 2854

work page 2012

[59] [59]

Thumos challenge: Action recognition with a large number of classes,

Y.-G. Jiang, J. Liu, A. R. Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar, “Thumos challenge: Action recognition with a large number of classes,” 2014

work page 2014

[60] [60]

Visual forecasting by imitating dynamics in natural sequences,

K.-H. Zeng, W. B. Shen, D.-A. Huang, M. Sun, and J. Car- los Niebles, “Visual forecasting by imitating dynamics in natural sequences,” in ICCV, 2017, pp. 2999–3008

work page 2017

[61] [61]

CUHK & ETHZ & SIAT Submission to ActivityNet Challenge 2016

Y. Xiong, L. Wang, Z. Wang, B. Zhang, H. Song, W. Li, D. Lin, Y. Qiao, L. Van Gool, and X. Tang, “Cuhk & ethz & siat submission to activitynet challenge 2016,” arXiv preprint arXiv:1608.00797, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[62] [62]

Online action detection,

R. D. Geest, E. Gavves, A. Ghodrati, Z. Li, C. Snoek, and T. Tuyte- laars, “Online action detection,” in ECCV, 2016, pp. 269–284

work page 2016

[63] [63]

Unsupervised learning for forecast- ing action representations,

Y. Zhong and W.-S. Zheng, “Unsupervised learning for forecast- ing action representations,” in ICIP, 2018, pp. 1073–1077

work page 2018

[64] [64]

Knowledge distillation for human action anticipation,

V . Tran, Y. Wang, Z. Zhang, and M. Hoai, “Knowledge distillation for human action anticipation,” in ICIP, 2021, pp. 2518–2522

work page 2021

[65] [65]

Anticipating human actions by cor- relating past with the future with Jaccard similarity measures,

B. Fernando and S. Herath, “Anticipating human actions by cor- relating past with the future with Jaccard similarity measures,” in CVPR, 2021

work page 2021

[66] [66]

The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities,

H. Kuehne, A. Arslan, and T. Serre, “The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities,” in CVPR, 2014. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 18

work page 2014

[67] [67]

Anticipative Video Transformer,

R. Girdhar and K. Grauman, “Anticipative Video Transformer,” in ICCV, 2021

work page 2021

[68] [68]

Rescaling egocentric vision: Collection, pipeline and challenges for epic- kitchens-100,

D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Priceet al., “Rescaling egocentric vision: Collection, pipeline and challenges for epic- kitchens-100,” IJCV, pp. 1–23, 2022

work page 2022

[69] [69]

In the eye of beholder: Joint learning of gaze and actions in first person video,

Y. Li, M. Liu, and J. M. Rehg, “In the eye of beholder: Joint learning of gaze and actions in first person video,” in ECCV, 2018, pp. 619–635

work page 2018

[70] [70]

Combining embedded accelerom- eters with computer vision for recognizing food preparation activities,

S. Stein and S. J. McKenna, “Combining embedded accelerom- eters with computer vision for recognizing food preparation activities,” in UbiComp, 2013, pp. 729–738

work page 2013

[71] [71]

Learning To Anticipate Future With Dynamic Context Removal,

X. Xu, Y.-L. Li, and C. Lu, “Learning To Anticipate Future With Dynamic Context Removal,” in CVPR, 2022

work page 2022

[72] [72]

Tsm: Temporal shift module for efficient video understanding,

J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in ICCV, 2019, pp. 7083–7093

work page 2019

[73] [73]

Latency mat- ters: Real-time action forecasting transformer,

H. Girase, N. Agarwal, C. Choi, and K. Mangalam, “Latency mat- ters: Real-time action forecasting transformer,” in CVPR, 2023, pp. 18 759–18 769

work page 2023

[74] [74]

Fore- casting future action sequences with neural memory networks,

H. Gammulle, S. Denman, S. Sridharan, and C. Fookes, “Fore- casting future action sequences with neural memory networks,” in BMVC, 2019

work page 2019

[75] [75]

Temporal recurrent networks for online action detection,

M. Xu, M. Gao, Y.-T. Chen, L. S. Davis, and D. J. Crandall, “Temporal recurrent networks for online action detection,” in ICCV, 2019, pp. 5532–5541

work page 2019

[76] [76]

TTPP: Tempo- ral Transformer with Progressive Prediction for Efficient Action Anticipation,

W. Wang, X. Peng, Y. Su, Y. Qiao, and J. Cheng, “TTPP: Tempo- ral Transformer with Progressive Prediction for Efficient Action Anticipation,” Neurocomputing, 2020

work page 2020

[77] [77]

Lap-net: Adaptive features sampling via learning action progression for online action detection,

S. Qu, G. Chen, D. Xu, J. Dong, F. Lu, and A. Knoll, “Lap-net: Adaptive features sampling via learning action progression for online action detection,” arXiv preprint arXiv:2011.07915, 2020

work page arXiv 2011

[78] [78]

Learning to Anticipate Egocentric Actions by Imagination,

Y. Wu, L. Zhu, X. Wang, Y. Yang, and F. Wu, “Learning to Anticipate Egocentric Actions by Imagination,” TIP, 2021

work page 2021

[79] [79]

Long short-term transformer for online action detection,

M. Xu, Y. Xiong, H. Chen, X. Li, W. Xia, Z. Tu, and S. Soatto, “Long short-term transformer for online action detection,” in NeurIPS, vol. 34, 2021, pp. 1086–1099

work page 2021

[80] [80]

Oadtr: Online action detection with transformers,

X. Wang, S. Zhang, Z. Qing, Y. Shao, Z. Zuo, C. Gao, and N. Sang, “Oadtr: Online action detection with transformers,” in ICCV, 2021

work page 2021