pith. sign in

arxiv: 2312.15868 · v3 · submitted 2023-12-26 · 💻 cs.CV

Restoration-Oriented Video Frame Interpolation with Region-Distinguishable Priors from SAM

Pith reviewed 2026-05-24 05:45 UTC · model grok-4.3

classification 💻 cs.CV
keywords video frame interpolationregion-distinguishable priorsSAMmotion estimationfeature normalizationhierarchical fusionsegmentation priors
0
0 comments X

The pith

Region-Distinguishable Priors from SAM2 make matched regions have similar features in VFI encoders, improving intermediate frame synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing video frame interpolation methods struggle with ambiguity when estimating motion between neighboring frames because corresponding areas are hard to identify reliably. The paper extracts Region-Distinguishable Priors from SAM2 segmentations, expressed as spatial-varying Gaussian mixtures that mark an arbitrary number of regions in a single format. These priors are injected into the encoder of motion-based VFI networks at multiple levels through the Hierarchical Region-aware Feature Fusion Module, which applies RDP-guided Feature Normalization inside residual blocks. The result is that features of matched regions become more alike across frames, which directly aids the generation of the missing intermediate frame. Tests show the module can be added to existing methods and raises performance on diverse scenes.

Core claim

The paper establishes that integrating Region-Distinguishable Priors derived from SAM2 via the HRFFM module causes the features in the VFI encoder to exhibit similar representations for matched regions in neighboring frames, thereby improving the synthesis of intermediate frames.

What carries the argument

Hierarchical Region-aware Feature Fusion Module (HRFFM) that folds RDP-guided Feature Normalization (RDPFN) into residual connections at successive stages of the VFI encoder.

If this is right

  • The same RDP priors and fusion module can be inserted into any existing motion-based VFI pipeline as a plug-and-play component.
  • Matched regions across frames acquire closer feature representations inside the encoder.
  • Synthesis quality of the interpolated frame rises consistently on varied scene types.
  • An arbitrary number of regions can be handled under one modality because the priors use spatial-varying Gaussian mixtures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If segmentation models become more temporally stable, the same fusion design could yield further gains without architectural changes.
  • The region cues might transfer to related video tasks such as deblurring or super-resolution where motion ambiguity also appears.
  • Performance on fast-motion or heavy-occlusion sequences would test how well the priors remain reliable between frames.
  • Explicit region distinctions could lessen dependence on purely learned motion estimators in future VFI systems.

Load-bearing premise

Open-world segmentation models such as SAM2 produce region distinctions that are accurate and stable across neighboring frames so they can resolve motion estimation ambiguity.

What would settle it

An ablation in which adding the RDP priors and HRFFM produces no measurable increase in feature similarity between matched regions or in final interpolation metrics relative to the unmodified baseline.

Figures

Figures reproduced from arXiv: 2312.15868 by Jiafei Wu, Ming-Hsuan Yang, Xiaogang Xu, Yan Han, Yingqi Lin, Zhe Liu.

Figure 1
Figure 1. Figure 1: The first two columns: overlay inputs and the ground truth frame. Middle two columns: motion field (from first to [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The standard framework of motion-based VFI. It consists of three stages: extracting the image features from the encoder, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The Overview of HRFFM, which first exploits RDPs to enhance image features via RDPFN (Eq. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The Overview of RDPFN. It utilizes both RDP features [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visual comparison on SNU-FILM [9]. Three rows, from top to bottom, represent the comparison results for VFIformer, UPR-Net, and M2M-PWC. The highlighted boxes indicate positions where our model demonstrates superior performance. TABLE IV: Quantitative (PSNR/SSIM) comparisons between VFI baselines and VFIformer’s implementation with our strategy (ours) on Vimeo90K [71] . The best result is boldfaced . When … view at source ↗
Figure 6
Figure 6. Figure 6: Comparisons between baselines and ours in terms [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visual comparisons of ablation studies on Vimeo90K [ [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The results of the user study, which summarize [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visual comparison of UPR-Net [21] and UPR-Netours on SNU-FILM [10] whose data is degraded by the low-light. compared to all the baselines. While some participants opted for the “same” option, this is primarily attributed to the resolution of the testing images. Higher resolution tends to amplify differences, as observed in the results from the SNU￾FILM dataset. This underscores that our method can enhance… view at source ↗
read the original abstract

In existing restoration-oriented Video Frame Interpolation (VFI) approaches, the motion estimation between neighboring frames plays a crucial role. However, the estimation accuracy in existing methods remains a challenge, primarily due to the inherent ambiguity in identifying corresponding areas in adjacent frames for interpolation. Therefore, enhancing accuracy by distinguishing different regions before motion estimation is of utmost importance. In this paper, we introduce a novel solution involving the utilization of open-world segmentation models, e.g., SAM2 (Segment Anything Model2) for frames, to derive Region-Distinguishable Priors (RDPs) in different frames. These RDPs are represented as spatial-varying Gaussian mixtures, distinguishing an arbitrary number of areas with a unified modality. RDPs can be integrated into existing motion-based VFI methods to enhance features for motion estimation, facilitated by our designed play-and-plug Hierarchical Region-aware Feature Fusion Module (HRFFM). HRFFM incorporates RDP into various hierarchical stages of VFI's encoder, using RDP-guided Feature Normalization (RDPFN) in a residual learning manner. With HRFFM and RDP, the features within VFI's encoder exhibit similar representations for matched regions in neighboring frames, thus improving the synthesis of intermediate frames. Extensive experiments demonstrate that HRFFM consistently enhances VFI performance across various scenes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that deriving Region-Distinguishable Priors (RDPs) from SAM2 as spatial-varying Gaussian mixtures for neighboring frames, then integrating them via the Hierarchical Region-aware Feature Fusion Module (HRFFM) with residual RDP-guided Feature Normalization (RDPFN) at multiple encoder stages, causes VFI encoder features to exhibit similar representations for matched regions. This is asserted to resolve motion estimation ambiguity and improve intermediate frame synthesis in restoration-oriented VFI, with the module presented as plug-and-play for existing methods. The abstract states that extensive experiments demonstrate consistent enhancement across scenes.

Significance. If the central claim holds and the stability of SAM2-derived RDPs across frames is verified, the work would provide a concrete mechanism for injecting open-world segmentation priors into motion-based VFI pipelines, potentially improving handling of ambiguous regions without retraining the base VFI model. The residual, hierarchical design of HRFFM is a practical strength that could enable easy adoption. However, the absence of reported quantitative gains, ablations, or stability metrics in the abstract limits the assessed impact.

major comments (2)
  1. [Abstract] Abstract: the claim that 'extensive experiments demonstrate that HRFFM consistently enhances VFI performance across various scenes' is unsupported by any quantitative results, baselines, ablation tables, or error analysis, which is load-bearing for the assertion that the method improves synthesis of intermediate frames.
  2. [Abstract] Abstract (paragraph describing utilization of SAM for RDPs and their representation as spatial-varying Gaussian mixtures): the central premise that RDPs produce distinctions stable enough for RDPFN to yield similar representations for matched regions is not accompanied by any measurement of frame-to-frame RDP consistency (e.g., label agreement after flow warping), directly risking the validity of the 'similar representations' claim if SAM2 outputs shift between adjacent frames.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'play-and-plug' is likely intended as 'plug-and-play'; this should be corrected for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and will revise the abstract and add supporting analysis as needed.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'extensive experiments demonstrate that HRFFM consistently enhances VFI performance across various scenes' is unsupported by any quantitative results, baselines, ablation tables, or error analysis, which is load-bearing for the assertion that the method improves synthesis of intermediate frames.

    Authors: We agree that the abstract would be strengthened by including quantitative support. The full manuscript contains tables and figures with PSNR/SSIM comparisons, baseline results, and ablations demonstrating consistent gains. In revision we will update the abstract to highlight key quantitative metrics (e.g., average improvements on Vimeo90K and DAVIS) while keeping it concise. revision: yes

  2. Referee: [Abstract] Abstract (paragraph describing utilization of SAM for RDPs and their representation as spatial-varying Gaussian mixtures): the central premise that RDPs produce distinctions stable enough for RDPFN to yield similar representations for matched regions is not accompanied by any measurement of frame-to-frame RDP consistency (e.g., label agreement after flow warping), directly risking the validity of the 'similar representations' claim if SAM2 outputs shift between adjacent frames.

    Authors: SAM2 is designed for temporally coherent segmentation, and the spatial-varying Gaussian mixture representation is intended to provide a stable unified modality. We acknowledge that the current manuscript does not report an explicit frame-to-frame consistency metric such as label agreement after warping. We will add this analysis (e.g., consistency scores on adjacent frames) to the revised version or supplementary material to directly support the stability premise. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method proposal is self-contained

full rationale

The paper describes a plug-and-play module (HRFFM with RDPFN) that integrates external SAM2-derived RDPs into existing VFI encoders. No equations, parameter-fitting steps, or derivations appear in the provided text. The central claim—that RDP integration yields similar encoder features for matched regions—is presented as an empirical outcome of the architecture rather than a quantity defined by the authors' own prior results or self-citations. No load-bearing self-citation chains, self-definitional loops, or fitted-input-as-prediction patterns are present. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the premise that SAM segmentation yields useful region distinctions for motion estimation and that the proposed fusion module can exploit them without side effects; no free parameters are mentioned.

axioms (2)
  • domain assumption SAM2 produces stable and accurate region segmentations across adjacent video frames that can be represented as spatial-varying Gaussian mixtures
    Invoked when the abstract states that RDPs are derived from SAM for frames and represented as Gaussian mixtures to distinguish areas.
  • ad hoc to paper Integrating RDPs via RDP-guided Feature Normalization in a residual manner will make encoder features similar for matched regions without introducing new artifacts
    Invoked in the description of HRFFM and its effect on feature representations.
invented entities (2)
  • Region-Distinguishable Priors (RDPs) no independent evidence
    purpose: To provide spatial-varying Gaussian mixture representations that distinguish arbitrary numbers of regions across frames for motion estimation
    New concept introduced to address ambiguity in identifying corresponding areas
  • Hierarchical Region-aware Feature Fusion Module (HRFFM) no independent evidence
    purpose: To incorporate RDPs into various hierarchical stages of the VFI encoder using RDP-guided Feature Normalization in residual learning
    New module designed for the integration step

pith-pipeline@v0.9.0 · 5773 in / 1563 out tokens · 38766 ms · 2026-05-24T05:45:25.124802+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages

  1. [1]

    al., T.Y .: Inpaint anything: Segment anything meets image inpainting

    et. al., T.Y .: Inpaint anything: Segment anything meets image inpainting. arXiv (2023)

  2. [2]

    al., Z.L.: Can sam boost video super-resolution ? arXiv (2023)

    et. al., Z.L.: Can sam boost video super-resolution ? arXiv (2023)

  3. [3]

    IJCV (2011)

    Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. IJCV (2011)

  4. [4]

    IEEE TPAMI (2019)

    Bao, W., Lai, W.S., Zhang, X., Gao, Z., Yang, M.H.: Memc-net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement. IEEE TPAMI (2019)

  5. [5]

    IEEE TPAMI (2017)

    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI (2017)

  6. [6]

    In: AAAI (2020)

    Cheng, X., Chen, Z.: Video frame interpolation via deformable separable convolution. In: AAAI (2020)

  7. [7]

    IEEE TPAMI (2021)

    Cheng, X., Chen, Z.: Multiple video frame interpolation via enhanced deformable separable convolution. IEEE TPAMI (2021)

  8. [8]

    arXiv (2023)

    Cheng, Y ., Li, L., Xu, Y ., Li, X., Yang, Z., Wang, W., Yang, Y .: Segment and track anything. arXiv (2023)

  9. [10]

    In: AAAI (2020)

    Choi, M., Kim, H., Han, B., Xu, N., Lee, K.M.: Channel attention is all you need for video frame interpolation. In: AAAI (2020)

  10. [11]

    In: AAAI (2024)

    Danier, D., Zhang, F., Bull, D.: Ldmvfi: Video frame interpolation with latent diffusion models. In: AAAI (2024)

  11. [12]

    In: CVPR (2021)

    Ding, T., Liang, L., Zhu, Z., Zharkov, I.: Cdfi: Compression-driven network design for frame interpolation. In: CVPR (2021)

  12. [13]

    In: CVPR (2016)

    Flynn, J., Neulander, I., Philbin, J., Snavely, N.: Deepstereo: Learning to predict new views from the world’s imagery. In: CVPR (2016)

  13. [14]

    CVPR (2020)

    Gui, S., Wang, C., Chen, Q., Tao, D.: Featureflow: Robust video interpolation via structure-to-texture generation. CVPR (2020)

  14. [15]

    In: CVPR (2024)

    Hu, M., Jiang, K., Zhong, Z., Wang, Z., Zheng, Y .: Iq-vfi: implicit quadratic motion estimation for video frame interpolation. In: CVPR (2024)

  15. [16]

    In: CVPR (2023)

    Hu, P., Niklaus, S., Sclaroff, S., Saenko, K.: Many-to-many splatting for efficient video frame interpolation. In: CVPR (2023)

  16. [17]

    IEEE TPAMI (2023)

    Hu, P., Niklaus, S., Zhang, L., Sclaroff, S., Saenko, K.: Video frame in- terpolation with many-to-many splatting and spatial selective refinement. IEEE TPAMI (2023)

  17. [18]

    In: CVPR (2018)

    Hui, T.W., Tang, X., Loy, C.C.: Liteflownet: A lightweight convolutional neural network for optical flow estimation. In: CVPR (2018)

  18. [19]

    In: CVPR (2017)

    Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. In: CVPR (2017)

  19. [20]

    In: CVPR (2018)

    Jiang, H., Sun, D., Jampani, V ., Yang, M.H., Learned-Miller, E., Kautz, J.: Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In: CVPR (2018)

  20. [21]

    In: CVPR (2023)

    Jin, X., Wu, L., Chen, J., Chen, Y ., Koo, J., hee Hahm, C.: A unified pyramid recurrent network for video frame interpolation. In: CVPR (2023)

  21. [22]

    In: W ACV (2023)

    Jin, X., Wu, L., Shen, G., Chen, Y ., Chen, J., Koo, J., hee Hahm, C.: Enhanced bi-directional motion estimation for video frame interpolation. In: W ACV (2023)

  22. [23]

    In: W ACV (2020)

    Kalluri, T., Pathak, D., Chandraker, M., Tran, D.: Flavr: Flow-agnostic video representations for fast frame interpolation. In: W ACV (2020)

  23. [24]

    Neurocomputing (2025)

    Kim, Y ., Kwon, S., Kang, D., Lee, H., Paik, J.: Enhancing video frame interpolation with region of motion loss and self-attention mechanisms: A dual approach to address large, nonlinear motions. Neurocomputing (2025)

  24. [25]

    In: CVPR (2019)

    Kirillov, A., He, K., Girshick, R., Rother, C., Doll ´ar, P.: Panoptic segmentation. In: CVPR (2019)

  25. [26]

    In: ICCV (2023)

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y ., et al.: Segment anything. In: ICCV (2023)

  26. [27]

    In: CVPR (2022)

    Kong, L., Jiang, B., Luo, D., Chu, W., Huang, X., Tai, Y ., Wang, C., Yang, J.: Ifrnet: Intermediate feature refine network for efficient frame interpolation. In: CVPR (2022)

  27. [28]

    In: CVPR (2020)

    Lee, H., Kim, T., young Chunga, T., Pak, D., Ban, Y ., Lee, S.: Adacof: Adaptive collaboration of flows for video frame interpolation. In: CVPR (2020)

  28. [29]

    In: W ACV (2022)

    Lee, S., Choi, N., Choi, W.I.: Enhanced correlation matching based video frame interpolation. In: W ACV (2022)

  29. [30]

    In: CVPR (2020)

    Lee, Y ., Park, J.: Centermask: Real-time anchor-free instance segmen- tation. In: CVPR (2020)

  30. [31]

    In: CVPR (2023)

    Li, Z., Zhu, Z.L., Han, L.H., Hou, Q., Guo, C.L., Cheng, M.M.: Amt: All-pairs multi-field transforms for efficient frame interpolation. In: CVPR (2023)

  31. [32]

    IEEE TIP (2023)

    Liu, C., Yang, H., Fu, J., Qian, X.: Ttvfi: Learning trajectory-aware transformer for video frame interpolation. IEEE TIP (2023)

  32. [33]

    In: CVPR (2024)

    Liu, C., Zhang, G., Zhao, R., Wang, L.: Sparse global matching for video frame interpolation with large motion. In: CVPR (2024)

  33. [34]

    IEEE TIP (2023)

    Liu, M., Xu, C., Yao, C., Lin, C., Zhao, Y .: Jnmr: Joint non-linear motion regression for video frame interpolation. IEEE TIP (2023)

  34. [35]

    In: AAAI (2019)

    Liu, Y .L., Liao, Y .T., Lin, Y .Y ., Chuang, Y .Y .: Deep video frame interpolation using cyclic frame generation. In: AAAI (2019)

  35. [36]

    In: ICCV (2017)

    Liu, Z., Yeh, R.A., Tang, X., Liu, Y ., Agarwala, A.: Video frame synthesis using deep voxel flow. In: ICCV (2017)

  36. [37]

    In: ECCV (2016)

    Long, G., Kneip, L., Alvarez, J.M., Li, H., Zhang, X., Yu, Q.: Learning image matching by simply watching video. In: ECCV (2016)

  37. [38]

    IEEE TIP (2017)

    Lu, G., Zhang, X., Chen, L., Gao, Z.: Novel integration of frame rate up conversion and hevc coding based on rate-distortion optimization. IEEE TIP (2017)

  38. [39]

    In: CVPR (2022)

    Lu, L., Wu, R., Lin, H., Lu, J., Jia, J.: Video frame interpolation with transformer. In: CVPR (2022)

  39. [40]

    In: ICCV (2025)

    Lyu, Z., Chen, C.: Tlb-vfi: Temporal-aware latent brownian bridge diffusion for video frame interpolation. In: ICCV (2025)

  40. [41]

    In: ACM MM (2024)

    Lyu, Z., Li, M., Jiao, J., Chen, C.: Frame interpolation with consecutive brownian bridge diffusion. In: ACM MM (2024)

  41. [42]

    In: CVPR (2018)

    Meyer, S., Djelouah, A., McWilliams, B., Sorkine-Hornung, A., Gross, M., Schroers, C.: Phasenet for video frame interpolation. In: CVPR (2018)

  42. [43]

    completely blind

    Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer. IEEE Sign. Process. Letters (2012)

  43. [44]

    In: ICLR (2025)

    Nan, K., Xie, R., Zhou, P., Fan, T., Yang, Z., Chen, Z., Li, X., Yang, J., Tai, Y .: Openvid-1m: A large-scale high-quality dataset for text-to-video generation. In: ICLR (2025)

  44. [45]

    In: CVPR (2018)

    Niklaus, S., Liu, F.: Context-aware synthesis for video frame interpola- tion. In: CVPR (2018)

  45. [46]

    In: CVPR (2020)

    Niklaus, S., Liu, F.: Softmax splatting for video frame interpolation. In: CVPR (2020)

  46. [47]

    In: CVPR (2017)

    Niklaus, S., Mai, L., Liu, F.: Video frame interpolation via adaptive convolution. In: CVPR (2017)

  47. [48]

    In: ICCV (2017)

    Niklaus, S., Mai, L., Liu, F.: Video frame interpolation via adaptive separable convolution. In: ICCV (2017)

  48. [49]

    In: W ACV (2021)

    Niklaus, S., Mai, L., Wang, O.: Revisiting adaptive convolutions for video frame interpolation. In: W ACV (2021)

  49. [50]

    In: CVPR (2023)

    Park, J., Kim, J., Kim, C.S.: Biformer: Learning bilateral motion estimation via bilateral transformer for 4k video frame interpolation. In: CVPR (2023)

  50. [51]

    In: ECCV (2020)

    Park, J., Ko, K., Lee, C., Kim, C.S.: Bmbc: Bilateral motion estimation with bilateral cost volume for video interpolation. In: ECCV (2020)

  51. [52]

    In: ICCV (2021)

    Park, J., Lee, C., Kim, C.S.: Asymmetric bilateral motion estimation for video frame interpolation. In: ICCV (2021)

  52. [53]

    In: CVPR (2023)

    Plack, M., Briedis, K.M., Djelouah, A., Hullin, M.B., Gross, M., Schroers., C.: Frame interpolation transformer and uncertainty guidance. In: CVPR (2023)

  53. [54]

    In: ICLR (2025)

    Ravi, N., Gabeur, V ., Hu, Y .T., Hu, R., Ryali, C., Ma, T., Khedr, H., R¨adle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V ., Carion, N., Wu, C.Y ., Girshick, R., Doll ´ar, P., Feichtenhofer, C.: Sam 2: Segment anything in images and videos. In: ICLR (2025)

  54. [55]

    arXiv (2022)

    Reda, F., Kontkanen, J., Tabellion, E., Sun, D., Pantofaru, C., Curless, B.: Film: Frame interpolation for large motion. arXiv (2022)

  55. [56]

    In: CVPR (2025)

    Seo, W., Oh, J., Kim, M.: Bim-vfi: Bidirectional motion field-guided frame interpolation for video with non-uniform motions. In: CVPR (2025)

  56. [57]

    In: CVPR (2022)

    Shi, Z., Xu, X., Liu, X., Chen, J., Yang, M.H.: Video frame interpolation transformer. In: CVPR (2022)

  57. [58]

    In: AAAI (2025) JOURNAL OF LATEX CLASS FILES, VOL

    Shu, H., Li, W., Tang, Y ., Zhang, Y ., Chen, Y ., Li, H., Wang, Y ., Chen, X.: Tinysam: Pushing the envelope for efficient segment anything model. In: AAAI (2025) JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, NOVEMBER 2025 14

  58. [59]

    In: ICCV (2021)

    Sim, H., Oh, J., Kim, M.: Xvfi: Extreme video frame interpolation. In: ICCV (2021)

  59. [60]

    In: CVPR (2021)

    Siyao, L., Zhao, S., Yu, W., Sun, W., Metaxas, D., Loy, C.C., Liu, Z.: Deep animation video interpolation in the wild. In: CVPR (2021)

  60. [61]

    arXiv (2012)

    Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv (2012)

  61. [62]

    In: NeurIPS (2024)

    Stergiou, A.: Lavib: A large-scale video interpolation benchmark. In: NeurIPS (2024)

  62. [63]

    In: CVPR (2018)

    Sun, D., Yang, X., Liu, M.Y ., Kautz, J.: Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In: CVPR (2018)

  63. [64]

    In: ECCV (2020)

    Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: ECCV (2020)

  64. [65]

    In: ICLR (2019)

    Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Fvd: A new metric for video generation. In: ICLR (2019)

  65. [66]

    In: ICLR (2025)

    Wang, W., Wang, Q., Zheng, K., Ouyang, H., Chen, Z., Gong, B., Chen, H., Shen, Y ., Shen, C.: Framer: Interactive frame interpolation. In: ICLR (2025)

  66. [67]

    In: ECCV (2024)

    Wang, X., Zhou, B., Curless, B., Kemelmacher-Shlizerman, I., Holynski, A., Seitz, S.M.: Generative inbetweening: Adapting image-to-video models for keyframe interpolation. In: ECCV (2024)

  67. [68]

    IEEE TPAMI (2021)

    Wang, X., Zhang, R., Shen, C., Kong, T., Li, L.: Solo: A simple framework for instance segmentation. IEEE TPAMI (2021)

  68. [69]

    IEEE TIP (2004)

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE TIP (2004)

  69. [70]

    In: CVPR (2024)

    Wu, G., Tao, X., Li, C., Wang, W., Liu, X., Zheng., Q.: Perception- oriented video frame interpolation via asymmetric blending. In: CVPR (2024)

  70. [71]

    IJCV (2019)

    Xue, T., Chen, B., Wu, J., Wei, D., Freeman, W.T.: Video enhancement with taskoriented flow. IJCV (2019)

  71. [72]

    In: ICCV (2023)

    Yoo, J.S., Lee, H., Jung, S.W.: Video object segmentation-aware video frame interpolation. In: ICCV (2023)

  72. [73]

    In: NeurIPS (2024)

    Zhang, G., Liu, C., Cui, Y ., Zhao, X., Wang, K.M.L.: Vfimamba: Video frame interpolation with state space models. In: NeurIPS (2024)

  73. [74]

    In: CVPR (2023)

    Zhang, G., Zhu, Y ., Wang, H., Chen, Y ., Wu, G., Wang, L.: Extracting motion and appearance via inter-frame attention for efficient video frame interpolation. In: CVPR (2023)

  74. [75]

    In: CVPR (2018)

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

  75. [76]

    In: CVPR (2017)

    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR (2017)

  76. [77]

    In: ECCV (2024)

    Zhong, Z., Krishnan, G., Sun, X., Qiao, Y ., Ma, S., Wang, J.: Clearer frames, anytime: Resolving velocity ambiguity in video frame interpo- lation. In: ECCV (2024)

  77. [78]

    In: CVPR (2017)

    Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: CVPR (2017)

  78. [79]

    In: IJCAI (2023)

    Zhou, C., Liu, J., Tang, J., Wu, G.: Video frame interpolation with densely queried bilateral correlation. In: IJCAI (2023)

  79. [80]

    In: CVPR (2025)

    Zhu, T., Ren, D., Wang, Q., Wu, X., Zuo, W.: Generative inbetween- ing through frame-wise conditions-driven video generation. In: CVPR (2025)