Spatio-temporal video autoencoder with differentiable memory

Ankur Handa; Roberto Cipolla; Viorica Patraucean

arxiv: 1511.06309 · v5 · pith:UEU2YREFnew · submitted 2015-11-19 · 💻 cs.LG · cs.CV

Spatio-temporal video autoencoder with differentiable memory

Viorica Patraucean , Ankur Handa , Roberto Cipolla This is my paper

classification 💻 cs.LG cs.CV

keywords frameautoencodermemorydifferentiableflownextopticaltemporal

0 comments

read the original abstract

We describe a new spatio-temporal video autoencoder, based on a classic spatial image autoencoder and a novel nested temporal autoencoder. The temporal encoder is represented by a differentiable visual memory composed of convolutional long short-term memory (LSTM) cells that integrate changes over time. Here we target motion changes and use as temporal decoder a robust optical flow prediction module together with an image sampler serving as built-in feedback loop. The architecture is end-to-end differentiable. At each time step, the system receives as input a video frame, predicts the optical flow based on the current observation and the LSTM memory state as a dense transformation map, and applies it to the current frame to generate the next frame. By minimising the reconstruction error between the predicted next frame and the corresponding ground truth next frame, we train the whole system to extract features useful for motion estimation without any supervision effort. We present one direct application of the proposed framework in weakly-supervised semantic segmentation of videos through label propagation using optical flow.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Order Matters: Shuffling Sequence Generation for Video Prediction
cs.CV 2019-07 unverdicted novelty 6.0

SEE-Net improves video prediction by using frame shuffling to enforce learning of natural temporal order, reporting state-of-the-art results on three synthetic and real-world datasets.