Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira, Andrew Zisserman · 2017

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

representative citing papers

Phenaki: Variable Length Video Generation From Open Domain Textual Description

cs.CV · 2022-10-05 · unverdicted · novelty 7.0

Phenaki generates arbitrary-length videos from sequences of text prompts by tokenizing videos with causal temporal attention and generating tokens with a text-conditioned masked transformer, trained jointly on images and videos.

What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction

cs.CV · 2024-07-11 · unverdicted · novelty 6.0

Introduces the QEVD benchmark for asynchronous situated interaction in fitness coaching and proposes a streaming baseline to address limitations of existing vision-language models.

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

cs.CV · 2023-07-13 · unverdicted · novelty 6.0

InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

cs.CV · 2022-12-06 · unverdicted · novelty 5.0

InternVideo combines masked video modeling and video-language contrastive learning into a single foundation model that reaches state-of-the-art results on 39 video datasets including 91.1% top-1 on Kinetics-400.

citing papers explorer

Showing 4 of 4 citing papers.

Phenaki: Variable Length Video Generation From Open Domain Textual Description cs.CV · 2022-10-05 · unverdicted · none · ref 6
Phenaki generates arbitrary-length videos from sequences of text prompts by tokenizing videos with causal temporal attention and generating tokens with a text-conditioned masked transformer, trained jointly on images and videos.
What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction cs.CV · 2024-07-11 · unverdicted · none · ref 14
Introduces the QEVD benchmark for asynchronous situated interaction in fitness coaching and proposes a streaming baseline to address limitations of existing vision-language models.
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation cs.CV · 2023-07-13 · unverdicted · none · ref 53
InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.
InternVideo: General Video Foundation Models via Generative and Discriminative Learning cs.CV · 2022-12-06 · unverdicted · none · ref 27
InternVideo combines masked video modeling and video-language contrastive learning into a single foundation model that reaches state-of-the-art results on 39 video datasets including 91.1% top-1 on Kinetics-400.

Quo vadis, action recognition? a new model and the kinetics dataset

fields

years

verdicts

representative citing papers

citing papers explorer