pith. machine review for the scientific record. sign in

arxiv: 1804.09235 · v2 · submitted 2018-04-24 · 💻 cs.CV

Recognition: unknown

On the effectiveness of task granularity for transfer learning

Authors on Pith no claims yet
classification 💻 cs.CV
keywords granularitylearningtasktransfercaptioningclassificationfeaturesfine-grained
0
0 comments X
read the original abstract

We describe a DNN for video classification and captioning, trained end-to-end, with shared features, to solve tasks at different levels of granularity, exploring the link between granularity in a source task and the quality of learned features for transfer learning. For solving the new task domain in transfer learning, we freeze the trained encoder and fine-tune a neural net on the target domain. We train on the Something-Something dataset with over 220, 000 videos, and multiple levels of target granularity, including 50 action groups, 174 fine-grained action categories and captions. Classification and captioning with Something-Something are challenging because of the subtle differences between actions, applied to thousands of different object classes, and the diversity of captions penned by crowd actors. Our model performs better than existing classification baselines for SomethingSomething, with impressive fine-grained results. And it yields a strong baseline on the new Something-Something captioning task. Experiments reveal that training with more fine-grained tasks tends to produce better features for transfer learning.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Exploring High-Order Self-Similarity for Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.

  2. InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

    cs.CV 2023-07 unverdicted novelty 6.0

    InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.