Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic · 2019

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

Bernini: Latent Semantic Planning for Video Diffusion

cs.CV · 2026-05-21 · unverdicted · novelty 5.0

Bernini is a framework that uses an MLLM planner to output semantic representations for a DiT renderer to generate or edit videos, reporting SOTA benchmark performance.

VDCook:DIY video data cook your MLLMs

cs.LG · 2026-03-04 · unverdicted · novelty 5.0

VDCook is an automated, self-evolving platform for generating in-domain video datasets for MLLMs via natural language queries, retrieval-synthesis, and multi-dimensional metadata.

citing papers explorer

Showing 2 of 2 citing papers.

Bernini: Latent Semantic Planning for Video Diffusion cs.CV · 2026-05-21 · unverdicted · none · ref 52
Bernini is a framework that uses an MLLM planner to output semantic representations for a DiT renderer to generate or edit videos, reporting SOTA benchmark performance.
VDCook:DIY video data cook your MLLMs cs.LG · 2026-03-04 · unverdicted · none · ref 15
VDCook is an automated, self-evolving platform for generating in-domain video datasets for MLLMs via natural language queries, retrieval-synthesis, and multi-dimensional metadata.

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

fields

years

verdicts

representative citing papers

citing papers explorer