AV1 Motion Vector Fidelity and Application for Efficient Optical Flow

Anil Kokaram; Julien Zouein; Vibhoothi Vibhoothi

arxiv: 2510.17427 · v1 · pith:7U6ZKY7Lnew · submitted 2025-10-20 · 📡 eess.IV · cs.MM

AV1 Motion Vector Fidelity and Application for Efficient Optical Flow

Julien Zouein , Vibhoothi Vibhoothi , Anil Kokaram This is my paper

Pith reviewed 2026-05-18 06:18 UTC · model grok-4.3

classification 📡 eess.IV cs.MM

keywords AV1motion vectorsoptical flowRAFTwarm-startvideo compressioncomputer vision

0 comments

The pith

Motion vectors extracted from AV1 video can accelerate optical flow estimation fourfold when used to warm-start deep learning methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes the quality of motion vectors obtained from AV1 compressed videos by comparing them to ground-truth optical flow. It shows that these vectors are close enough in fidelity to be useful, particularly when the encoder is configured appropriately. The main advance is demonstrating that feeding these vectors into the RAFT network as an initial guess cuts the number of iterations needed for convergence. This leads to faster processing in motion estimation tasks without much sacrifice in final accuracy.

Core claim

Motion vectors from AV1 video codec can serve as a high-quality and computationally efficient substitute for traditional optical flow. Using these extracted AV1 motion vectors as a warm-start for RAFT significantly reduces the time to convergence while achieving comparable accuracy. Specifically, we observe a four-fold speedup in computation time with only a minor trade-off in end-point error.

What carries the argument

AV1 motion vectors as warm-start for the RAFT deep learning optical flow method

If this is right

AV1 motion vectors offer high fidelity to ground-truth optical flow with optimal encoder settings.
Recommendations for encoder settings improve motion vector quality for vision applications.
The warm-start approach achieves four-fold speedup in RAFT convergence.
This method supports a wide range of motion-aware computer vision applications by reusing compressed video data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could be applied to other deep learning optical flow estimators for similar efficiency gains.
In streaming video scenarios, decoder-exposed motion vectors could enable on-the-fly acceleration without extra computation.
Extending the analysis to HEVC and other codecs might reveal comparative advantages in different compression standards.

Load-bearing premise

The ground-truth optical flow used for fidelity comparisons is accurate and representative, and the chosen video datasets and encoder configurations generalize to typical computer vision workloads.

What would settle it

A test on a new set of videos where ground-truth optical flow is computed independently, checking if the four-fold speedup and minor error trade-off still hold.

Figures

Figures reproduced from arXiv: 2510.17427 by Anil Kokaram, Julien Zouein, Vibhoothi Vibhoothi.

**Figure 1.** Figure 1: Visualisation of the processing of the extracted motion. Left to Right: RGB frame, Motion Field extracted from AV1, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Visualisation of the elements from Spring Dataset. Left to Right: Frame extracted from Sequence 38, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Impact of using AV1 Motion Field as warm-start for [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

read the original abstract

This paper presents a comprehensive analysis of motion vectors extracted from AV1-encoded video streams and their application in accelerating optical flow estimation. We demonstrate that motion vectors from AV1 video codec can serve as a high-quality and computationally efficient substitute for traditional optical flow, a critical but often resource-intensive component in many computer vision pipelines. Our primary contributions are twofold. First, we provide a detailed comparison of motion vectors from both AV1 and HEVC against ground-truth optical flow, establishing their fidelity. In particular we show the impact of encoder settings on motion estimation fidelity and make recommendations about the optimal settings. Second, we show that using these extracted AV1 motion vectors as a "warm-start" for a state-of-the-art deep learning-based optical flow method, RAFT, significantly reduces the time to convergence while achieving comparable accuracy. Specifically, we observe a four-fold speedup in computation time with only a minor trade- off in end-point error. These findings underscore the potential of reusing motion vectors from compressed video as a practical and efficient method for a wide range of motion-aware computer vision applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that motion vectors extracted from AV1-encoded video streams exhibit high fidelity to ground-truth optical flow (with analysis of encoder settings' impact), and that initializing the RAFT deep optical flow method with these AV1 motion vectors as a warm-start yields a four-fold speedup in computation time while incurring only a minor increase in end-point error.

Significance. If the empirical results hold under rigorous controls, this could provide a practical contribution to efficient computer vision by enabling reuse of readily available motion data from compressed video codecs, reducing the computational burden of optical flow in downstream applications.

major comments (2)

[§4] §4 (RAFT warm-start results): The four-fold speedup claim relies on aggregate computation times, but the manuscript does not report iteration counts, convergence thresholds, or error-vs-iteration plots comparing AV1-initialized RAFT to standard initialization. This makes it difficult to attribute the speedup specifically to motion vector quality rather than other factors.
[§3] §3 (fidelity comparisons): The analysis of AV1/HEVC motion vector fidelity to ground-truth optical flow lacks explicit details on the video datasets used, the precise endpoint-error metric definition, statistical significance tests, and controls for content bias, which are load-bearing for the recommendations on optimal encoder settings and the overall fidelity claims.

minor comments (1)

[Abstract] Abstract: The phrase 'minor trade-off in end-point error' is not quantified with specific EPE values or direct comparisons, reducing clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript. We address each major comment below and outline the revisions we will make to strengthen the presentation of our results.

read point-by-point responses

Referee: [§4] §4 (RAFT warm-start results): The four-fold speedup claim relies on aggregate computation times, but the manuscript does not report iteration counts, convergence thresholds, or error-vs-iteration plots comparing AV1-initialized RAFT to standard initialization. This makes it difficult to attribute the speedup specifically to motion vector quality rather than other factors.

Authors: We appreciate this observation. The current manuscript reports measured wall-clock computation times demonstrating the four-fold speedup to convergence, but we agree that additional diagnostics would more directly link the improvement to the quality of the AV1 motion-vector initialization. In the revised version we will add: (i) the exact iteration counts required to reach convergence for both initializations, (ii) the convergence threshold employed (change in flow field below 0.01 pixels), and (iii) error-versus-iteration curves on representative sequences. These additions will clarify that the AV1 warm-start produces a steeper error reduction per iteration. revision: yes
Referee: [§3] §3 (fidelity comparisons): The analysis of AV1/HEVC motion vector fidelity to ground-truth optical flow lacks explicit details on the video datasets used, the precise endpoint-error metric definition, statistical significance tests, and controls for content bias, which are load-bearing for the recommendations on optimal encoder settings and the overall fidelity claims.

Authors: We thank the referee for highlighting these points. The manuscript already employs standard benchmarks (Sintel and KITTI) and reports average endpoint error (EPE), but we acknowledge that the presentation can be made more explicit. In the revision we will: state the precise datasets and sequence counts, give the mathematical definition of EPE, include paired statistical significance tests across sequences, and add content-bias controls by reporting EPE stratified by motion magnitude and scene type. These changes will better support the encoder-setting recommendations. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison to external benchmarks

full rationale

The manuscript performs direct fidelity measurements of AV1/HEVC motion vectors against independent ground-truth optical flow on chosen datasets, followed by timing experiments that initialize RAFT with those vectors and record wall-clock convergence. No equations, fitted parameters, or predictions are defined in terms of the target quantities; the speedup claim rests on observed runtimes rather than any self-referential construction. Self-citations, if present, are not load-bearing for the central empirical result, which remains falsifiable against external optical-flow ground truth and standard RAFT implementations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical engineering study that relies on existing AV1/HEVC codecs and the RAFT network; no new free parameters, mathematical axioms, or invented physical entities are introduced.

pith-pipeline@v0.9.0 · 5720 in / 1088 out tokens · 43332 ms · 2026-05-18T06:18:57.964799+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

using these extracted AV1 motion vectors as a warm-start for a state-of-the-art deep learning-based optical flow method, RAFT, significantly reduces the time to convergence while achieving comparable accuracy. Specifically, we observe a four-fold speedup
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The quality of the initial motion vectors is heavily influenced by the encoder configuration. The cpu-used parameter in the libaom-av1 encoder controls a trade-off between encoding speed and quality

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

Efficient feature extraction, encoding, and classification for action recognition,

V . Kantorov and I. Laptev, “Efficient feature extraction, encoding, and classification for action recognition,” in2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2593–2600

work page 2014
[2]

Leveraging bitstream metadata for fast, accurate, generalized compressed video quality enhancement,

M. Ehrlich, J. Barker, N. Padmanabhan, L. Davis, A. Tao, B. Catanzaro, and A. Shrivastava, “Leveraging bitstream metadata for fast, accurate, generalized compressed video quality enhancement,” in2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 1506–1516

work page 2024
[3]

Using modern motion estimation algorithms in existing video codecs,

D. J. Ringis, D. Singh, F. Pitie, and A. Kokaram, “Using modern motion estimation algorithms in existing video codecs,” inApplications of Digital Image Processing XLI, vol. 10752. SPIE, 2018, pp. 288–295

work page 2018
[4]

Hevc-epic: Fast optical flow estimation from coded video via edge-preserving interpolation,

D. R ¨ufenacht and D. Taubman, “Hevc-epic: Fast optical flow estimation from coded video via edge-preserving interpolation,”IEEE Transactions on Image Processing, vol. 27, no. 6, pp. 3100–3113, 2018

work page 2018
[5]

Epicflow: Edge-preserving interpolation of correspondences for optical flow,

J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid, “Epicflow: Edge-preserving interpolation of correspondences for optical flow,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1164–1172

work page 2015
[6]

Fast optical flow extraction from compressed video,

S. I. Young, B. Girod, and D. Taubman, “Fast optical flow extraction from compressed video,”IEEE Transactions on Image Processing, vol. 29, pp. 6409–6421, 2020

work page 2020
[7]

Raft: Recurrent all-pairs field transforms for optical flow,

Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” 2020. [Online]. Available: https://arxiv.org/abs/2003.12039

work page arXiv 2020
[8]

Mvflow: Deep optical flow estimation of compressed videos with motion vector prior,

S. Zhou, X. Jiang, W. Tan, R. He, and B. Yan, “Mvflow: Deep optical flow estimation of compressed videos with motion vector prior,” 2023. [Online]. Available: https://arxiv.org/abs/2308.01568

work page arXiv 2023
[9]

Adaptive multi-reference prediction using a symmetric framework,

Z. Liu, D. Mukherjee, W.-T. Lin, P. Wilkins, J. Han, and Y . Xu, “Adaptive multi-reference prediction using a symmetric framework,” Electronic Imaging, vol. 2017, no. 2, 2017

work page 2017
[10]

Tool description for av1 and libaom,

X. Zhao, S. Liu, A. Grange, and A. Norkin, “Tool description for av1 and libaom,”Alliance for Open Media, Codec Working Group, Document: CWG-B078o, 2021

work page 2021
[11]

Leveraging decoded hevc motion for fast, high quality optical flow estimation,

D. R ¨ufenacht and D. Taubman, “Leveraging decoded hevc motion for fast, high quality optical flow estimation,” in2017 IEEE 19th Interna- tional Workshop on Multimedia Signal Processing (MMSP), 2017, pp. 1–6

work page 2017
[12]

Flownet: Learning optical flow with convolutional networks,

A. Dosovitskiy, P. Fischer, E. Ilg, P. H ¨ausser, C. Hazırbas ¸, V . Golkov, P. v.d. Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” inIEEE International Conference on Computer Vision (ICCV), 2015. [Online]. Available: http://lmb.informatik.uni-freiburg.de/Publications/2015/DFIB15

work page 2015
[13]

A naturalistic open source movie for optical flow evaluation,

D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” inEuropean Conf. on Computer Vision (ECCV), ser. Part IV , LNCS 7577, A. Fitzgibbon et al. (Eds.), Ed. Springer-Verlag, Oct. 2012, pp. 611–625

work page 2012
[14]

The hci benchmark suite: Stereo and flow ground truth with uncer- tainties for urban autonomous driving,

D. Kondermann, R. Nair, K. Honauer, K. Krispin, J. Andrulis, A. Brock, B. Gussefeld, M. Rahimimoghaddam, S. Hofmann, C. Brenneret al., “The hci benchmark suite: Stereo and flow ground truth with uncer- tainties for urban autonomous driving,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016, pp. 19–28

work page 2016
[15]

Object scene flow for autonomous vehicles,

M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” inConference on Computer Vision and Pattern Recognition (CVPR), 2015

work page 2015
[16]

Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo,

L. Mehl, J. Schmalfuss, A. Jahedi, Y . Nalivayko, and A. Bruhn, “Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo,” inProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[17]

A fusion-based video quality assessment (FVQA) index,

J. Y . Lin, T.-J. Liu, E. C.-H. Wu, and C.-C. J. Kuo, “A fusion-based video quality assessment (FVQA) index,” inSignal and Information Processing Association Annual Summit and Conference (APSIPA), 2014

work page 2014

[1] [1]

Efficient feature extraction, encoding, and classification for action recognition,

V . Kantorov and I. Laptev, “Efficient feature extraction, encoding, and classification for action recognition,” in2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2593–2600

work page 2014

[2] [2]

Leveraging bitstream metadata for fast, accurate, generalized compressed video quality enhancement,

M. Ehrlich, J. Barker, N. Padmanabhan, L. Davis, A. Tao, B. Catanzaro, and A. Shrivastava, “Leveraging bitstream metadata for fast, accurate, generalized compressed video quality enhancement,” in2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 1506–1516

work page 2024

[3] [3]

Using modern motion estimation algorithms in existing video codecs,

D. J. Ringis, D. Singh, F. Pitie, and A. Kokaram, “Using modern motion estimation algorithms in existing video codecs,” inApplications of Digital Image Processing XLI, vol. 10752. SPIE, 2018, pp. 288–295

work page 2018

[4] [4]

Hevc-epic: Fast optical flow estimation from coded video via edge-preserving interpolation,

D. R ¨ufenacht and D. Taubman, “Hevc-epic: Fast optical flow estimation from coded video via edge-preserving interpolation,”IEEE Transactions on Image Processing, vol. 27, no. 6, pp. 3100–3113, 2018

work page 2018

[5] [5]

Epicflow: Edge-preserving interpolation of correspondences for optical flow,

J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid, “Epicflow: Edge-preserving interpolation of correspondences for optical flow,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1164–1172

work page 2015

[6] [6]

Fast optical flow extraction from compressed video,

S. I. Young, B. Girod, and D. Taubman, “Fast optical flow extraction from compressed video,”IEEE Transactions on Image Processing, vol. 29, pp. 6409–6421, 2020

work page 2020

[7] [7]

Raft: Recurrent all-pairs field transforms for optical flow,

Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” 2020. [Online]. Available: https://arxiv.org/abs/2003.12039

work page arXiv 2020

[8] [8]

Mvflow: Deep optical flow estimation of compressed videos with motion vector prior,

S. Zhou, X. Jiang, W. Tan, R. He, and B. Yan, “Mvflow: Deep optical flow estimation of compressed videos with motion vector prior,” 2023. [Online]. Available: https://arxiv.org/abs/2308.01568

work page arXiv 2023

[9] [9]

Adaptive multi-reference prediction using a symmetric framework,

Z. Liu, D. Mukherjee, W.-T. Lin, P. Wilkins, J. Han, and Y . Xu, “Adaptive multi-reference prediction using a symmetric framework,” Electronic Imaging, vol. 2017, no. 2, 2017

work page 2017

[10] [10]

Tool description for av1 and libaom,

X. Zhao, S. Liu, A. Grange, and A. Norkin, “Tool description for av1 and libaom,”Alliance for Open Media, Codec Working Group, Document: CWG-B078o, 2021

work page 2021

[11] [11]

Leveraging decoded hevc motion for fast, high quality optical flow estimation,

D. R ¨ufenacht and D. Taubman, “Leveraging decoded hevc motion for fast, high quality optical flow estimation,” in2017 IEEE 19th Interna- tional Workshop on Multimedia Signal Processing (MMSP), 2017, pp. 1–6

work page 2017

[12] [12]

Flownet: Learning optical flow with convolutional networks,

A. Dosovitskiy, P. Fischer, E. Ilg, P. H ¨ausser, C. Hazırbas ¸, V . Golkov, P. v.d. Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” inIEEE International Conference on Computer Vision (ICCV), 2015. [Online]. Available: http://lmb.informatik.uni-freiburg.de/Publications/2015/DFIB15

work page 2015

[13] [13]

A naturalistic open source movie for optical flow evaluation,

D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” inEuropean Conf. on Computer Vision (ECCV), ser. Part IV , LNCS 7577, A. Fitzgibbon et al. (Eds.), Ed. Springer-Verlag, Oct. 2012, pp. 611–625

work page 2012

[14] [14]

The hci benchmark suite: Stereo and flow ground truth with uncer- tainties for urban autonomous driving,

D. Kondermann, R. Nair, K. Honauer, K. Krispin, J. Andrulis, A. Brock, B. Gussefeld, M. Rahimimoghaddam, S. Hofmann, C. Brenneret al., “The hci benchmark suite: Stereo and flow ground truth with uncer- tainties for urban autonomous driving,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016, pp. 19–28

work page 2016

[15] [15]

Object scene flow for autonomous vehicles,

M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” inConference on Computer Vision and Pattern Recognition (CVPR), 2015

work page 2015

[16] [16]

Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo,

L. Mehl, J. Schmalfuss, A. Jahedi, Y . Nalivayko, and A. Bruhn, “Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo,” inProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023

[17] [17]

A fusion-based video quality assessment (FVQA) index,

J. Y . Lin, T.-J. Liu, E. C.-H. Wu, and C.-C. J. Kuo, “A fusion-based video quality assessment (FVQA) index,” inSignal and Information Processing Association Annual Summit and Conference (APSIPA), 2014

work page 2014