pith. the verified trust layer for science. sign in

arxiv: 1503.01558 · v3 · pith:5WVMJSZ6new · submitted 2015-03-05 · 💻 cs.CL · cs.CV· cs.IR

What's Cookin'? Interpreting Cooking Videos using Text, Speech and Vision

classification 💻 cs.CL cs.CVcs.IR
keywords automaticallycookinginstructionsrecipespeechtechniquevideoalign
0
0 comments X p. Extension
Add this Pith Number to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{5WVMJSZ6}

Prints a linked pith:5WVMJSZ6 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

read the original abstract

We present a novel method for aligning a sequence of instructions to a video of someone carrying out a task. In particular, we focus on the cooking domain, where the instructions correspond to the recipe. Our technique relies on an HMM to align the recipe steps to the (automatically generated) speech transcript. We then refine this alignment using a state-of-the-art visual food detector, based on a deep convolutional neural network. We show that our technique outperforms simpler techniques based on keyword spotting. It also enables interesting applications, such as automatically illustrating recipes with keyframes, and searching within a video for events of interest.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. REMAP: Regularized Matching and Partial Alignment of Video Embeddings

    cs.CV 2025-09 unverdicted novelty 7.0

    REMAP applies regularized fused partial Gromov-Wasserstein optimal transport to align video embeddings for unsupervised procedure learning on noisy instructional videos.