pith. sign in

arxiv: 1807.00230 · v2 · pith:HWC2FEQDnew · submitted 2018-06-30 · 💻 cs.CV

Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

classification 💻 cs.CV
keywords audiomodelslearningself-supervisedsynchronizationvideoaccuracyaction
0
0 comments X
read the original abstract

There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further finetuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +19.9% in action recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. USV: Towards Understanding the User-generated Short-form Videos

    cs.CV 2026-05 unverdicted novelty 7.0

    Introduces the USV dataset of 224K short user-generated videos and benchmarks topic recognition plus video-text retrieval with MMF-Net and VTCL baselines.