pith. machine review for the scientific record. sign in

arxiv: 2604.06740 · v1 · submitted 2026-04-08 · 💻 cs.CV

Recognition: no theorem link

LiveStre4m: Feed-Forward Live Streaming of Novel Views from Unposed Multi-View Video

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords novel view synthesislive streamingunposed multi-view videofeed-forward modeldynamic scene reconstructiontemporal consistencycamera pose predictionreal-time rendering
0
0 comments X

The pith

A feed-forward model streams temporally consistent novel views in real time from unposed multi-view videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LiveStre4m, a feed-forward approach to live novel view synthesis that reconstructs dynamic 3D scenes from sparse unposed video streams. It combines a multi-view vision transformer for keyframe reconstruction, a module that predicts camera poses and intrinsics directly from RGB images, and a diffusion-transformer for smooth temporal interpolation between frames. This pipeline runs at an average of 0.07 seconds per frame at 1024 by 768 resolution and works with as few as two synchronized input streams. If the approach holds, it removes the need for ground-truth camera calibration and slow per-scene optimization that previously made real-time streaming impossible. The result is a system that could support practical live applications such as broadcasting or virtual environments without specialized hardware setups.

Core claim

LiveStre4m is a feed-forward model consisting of a multi-view vision transformer for keyframe 3D scene reconstruction, a Camera Pose Predictor that estimates both poses and intrinsics directly from RGB images, and a diffusion-transformer interpolation module that maintains temporal consistency, allowing real-time novel-view video streaming from as few as two synchronized unposed inputs at 0.07 seconds per frame while outperforming optimization-based methods by orders of magnitude in speed.

What carries the argument

A multi-view vision transformer for keyframe 3D scene reconstruction coupled with a diffusion-transformer interpolation module and a Camera Pose Predictor that estimates poses and intrinsics from RGB images.

If this is right

  • Novel view synthesis becomes feasible for live streaming without requiring known camera parameters or lengthy optimization per scene.
  • Real-time performance at 0.07 seconds per frame at 1024 by 768 resolution makes the method suitable for practical video applications.
  • Temporal consistency across frames is preserved in dynamic content using the interpolation module.
  • The system operates with minimal input consisting of only two synchronized unposed streams.
  • The approach enables generalization to new scenes without retraining or per-scene fitting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the pose predictor proves robust across varied lighting and motion, it could lower the barrier for setting up multi-camera live events.
  • The speed opens the possibility of running similar pipelines on edge devices for interactive view switching during streams.
  • Extensions might test whether adding more input views further improves quality while keeping the same runtime profile.

Load-bearing premise

The Camera Pose Predictor can reliably estimate accurate poses and intrinsics directly from RGB images without ground-truth calibration, and the overall model generalizes to unseen dynamic scenes while maintaining temporal consistency.

What would settle it

A test on a previously unseen dynamic scene captured with two unposed synchronized cameras that shows either visible temporal artifacts in the output video or reconstruction times substantially above 0.07 seconds per frame would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2604.06740 by Egor Bondarev, Erkut Akdag, Pedro Quesado, Willem Menu, Yasaman Kashefbahrami.

Figure 1
Figure 1. Figure 1: Illustration of the proposed LiveStre4m method, a feed [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overviews of LiveStre4m model architecture. The model receives multi-view video keyframes, the first such keyframe is used [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Representation of the Spatial Module architecture. The module leverages ViTs to model scene 3D geometry and appearance [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results produced by with LiveStre4m, which [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results comparing different time steps of [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

Live-streaming Novel View Synthesis (NVS) from unposed multi-view video remains an open challenge in a wide range of applications. Existing methods for dynamic scene representation typically require ground-truth camera parameters and involve lengthy optimizations ($\approx 2.67$s), which makes them unsuitable for live streaming scenarios. To address this issue, we propose a novel viewpoint video live-streaming method (LiveStre4m), a feed-forward model for real-time NVS from unposed sparse multi-view inputs. LiveStre4m introduces a multi-view vision transformer for keyframe 3D scene reconstruction coupled with a diffusion-transformer interpolation module that ensures temporal consistency and stable streaming. In addition, a Camera Pose Predictor module is proposed to efficiently estimate both poses and intrinsics directly from RGB images, removing the reliance on known camera calibration information. Our approach enables temporally consistent novel-view video streaming in real-time using as few as two synchronized unposed input streams. LiveStre4m attains an average reconstruction time of $ 0.07$s per-frame at $ 1024 \times 768$ resolution, outperforming the optimization-based dynamic scene representation methods by orders of magnitude in runtime. These results demonstrate that LiveStre4m makes real-time NVS streaming feasible in practical settings, marking a substantial step toward deployable live novel-view synthesis systems. Code available at: https://github.com/pedro-quesado/LiveStre4m

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents LiveStre4m, a feed-forward model for real-time novel view synthesis and live streaming from unposed sparse multi-view video. It comprises a Camera Pose Predictor that estimates poses and intrinsics directly from RGB images, a multi-view vision transformer for keyframe 3D scene reconstruction, and a diffusion-transformer interpolation module to enforce temporal consistency. The central claim is that the system enables temporally consistent novel-view video streaming in real time (0.07 s per frame at 1024×768) from as few as two synchronized unposed streams, outperforming optimization-based dynamic scene methods by orders of magnitude in runtime.

Significance. If the runtime and quality claims are substantiated, the work would mark a meaningful advance toward deployable live NVS systems by removing reliance on ground-truth calibration and per-scene optimization, potentially enabling practical applications in telepresence, AR, and broadcast.

major comments (2)
  1. [Abstract] Abstract: the claim of 0.07 s per-frame reconstruction at 1024×768 and orders-of-magnitude speedup over optimization-based methods is presented without any quantitative runtime tables, quality metrics (PSNR, SSIM, LPIPS), baseline comparisons, or error bars, which are load-bearing for the central performance assertion.
  2. [Abstract] Abstract: the Camera Pose Predictor is positioned as the key enabler for unposed inputs, yet the text supplies no pose-error metrics (rotation/translation errors), no ablation on pose noise sensitivity, and no comparison against methods that use ground-truth poses, leaving the generalization and real-time claims without supporting evidence.
minor comments (1)
  1. The abstract states that code is available at the cited GitHub link but provides no details on training data, model checkpoints, or evaluation protocols needed for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will make revisions to better substantiate the claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 0.07 s per-frame reconstruction at 1024×768 and orders-of-magnitude speedup over optimization-based methods is presented without any quantitative runtime tables, quality metrics (PSNR, SSIM, LPIPS), baseline comparisons, or error bars, which are load-bearing for the central performance assertion.

    Authors: We agree that the abstract's performance claims would be strengthened by explicit quantitative support. The current manuscript body includes experimental results, but we will add a dedicated runtime table with comparisons to optimization-based baselines, report PSNR/SSIM/LPIPS values with error bars, and revise the abstract to reference these tables and key numerical results. revision: yes

  2. Referee: [Abstract] Abstract: the Camera Pose Predictor is positioned as the key enabler for unposed inputs, yet the text supplies no pose-error metrics (rotation/translation errors), no ablation on pose noise sensitivity, and no comparison against methods that use ground-truth poses, leaving the generalization and real-time claims without supporting evidence.

    Authors: We acknowledge that the current text lacks explicit quantitative evaluation of the Camera Pose Predictor. We will add rotation/translation error metrics, an ablation study on pose noise sensitivity, and direct comparisons against ground-truth pose inputs in the experiments section, along with an update to the abstract referencing these results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical feed-forward model evaluated on external metrics

full rationale

The paper describes a trained feed-forward pipeline (multi-view ViT reconstruction + diffusion-transformer interpolation + Camera Pose Predictor) whose central claims are runtime (0.07 s/frame) and qualitative/quantitative superiority over optimization baselines. These are measured against independent test data and external timing, not derived from the model's own fitted outputs or self-referential definitions. No equations, uniqueness theorems, or self-citations are invoked in the abstract or description to force the performance results by construction. The pose predictor is an architectural component whose accuracy is an empirical assumption, not a definitional tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of three learned modules whose internal parameters are fitted during training; the abstract provides no explicit list of free parameters or additional axioms beyond standard neural-network assumptions.

free parameters (1)
  • neural network weights
    All transformer and diffusion model parameters are learned from data and central to the reported speed and quality.
axioms (1)
  • domain assumption The diffusion-transformer interpolation module produces temporally consistent frames
    Invoked to ensure stable streaming but not derived from first principles in the abstract.

pith-pipeline@v0.9.0 · 5579 in / 1422 out tokens · 41815 ms · 2026-05-10T18:51:17.107059+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 7 canonical work pages · 2 internal anchors

  1. [1]

    Quick- srnet: Plain single-image super-resolution architecture for faster inference on mobile platforms, 2023

    Guillaume Berger, Manik Dhingra, Antoine Mercier, Yashesh Savani, Sunny Panchal, and Fatih Porikli. Quick- srnet: Plain single-image super-resolution architecture for faster inference on mobile platforms, 2023. 5, 7

  2. [2]

    pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

    David Charatan, Sizhe Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InCVPR, 2024. 2

  3. [3]

    Channel attention is all you need for video frame interpolation

    Myungsub Choi, Heewon Kim, Bohyung Han, Ning Xu, and Kyoung Mu Lee. Channel attention is all you need for video frame interpolation. InAAAI, 2020. 3

  4. [4]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR 2021 - 9th International Conference on Learning Representations...

  5. [5]

    Instantsplat: Sparse-view gaussian splatting in sec- onds, 2024

    Zhiwen Fan, Kairun Wen, Wenyan Cong, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, Zhangyang Wang, and Yue Wang. Instantsplat: Sparse-view gaussian splatting in sec- onds, 2024. 6

  6. [6]

    Quark: Real-time, high-resolution, and general neural view synthesis.ACM Trans

    John Flynn, Michael Broxton, Lukas Murmann, Lucy Chai, Matthew DuVall, Cl ´ement Godard, Kathryn Heal, Srinivas Kaza, Stephen Lombardi, Xuan Luo, Supreeth Achar, Kira Prabhu, Tiancheng Sun, Lynn Tsai, and Ryan Overbeck. Quark: Real-time, high-resolution, and general neural view synthesis.ACM Trans. Graph., 43(6), 2024. 2

  7. [7]

    K-planes: Explicit radiance fields in space, time, and appearance

    Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 12479–12488, 2023. 6

  8. [8]

    Real-time intermediate flow estimation for video frame interpolation

    Zhewei Huang, Tianyuan Zhang, Wen Heng, Boxin Shi, and Shuchang Zhou. Real-time intermediate flow estimation for video frame interpolation. InProceedings of the European Conference on Computer Vision (ECCV), 2022. 3

  9. [9]

    Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025

    Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025. 3

  10. [10]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023. 1, 2, 5

  11. [11]

    Ifrnet: Intermediate feature refine network for efficient frame interpolation

    Lingtong Kong, Boyuan Jiang, Donghao Luo, Wenqing Chu, Xiaoming Huang, Ying Tai, Chengjie Wang, and Jie Yang. Ifrnet: Intermediate feature refine network for efficient frame interpolation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3

  12. [12]

    Ground- ing image matching in 3d with mast3r, 2024

    Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing image matching in 3d with mast3r, 2024. 1, 3

  13. [13]

    Streaming radiance fields for 3d video synthe- sis.Advances in Neural Information Processing Systems, 35: 13485–13498, 2022

    Lingzhi Li, Zhen Shen, Zhongshu Wang, Li Shen, and Ping Tan. Streaming radiance fields for 3d video synthe- sis.Advances in Neural Information Processing Systems, 35: 13485–13498, 2022. 6, 7, 8, 1

  14. [14]

    Neural 3d video synthesis from multi- view video

    Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon Green, Christoph Lassner, Changil Kim, Tanner Schmidt, Steven Lovegrove, Michael Goesele, Richard Newcombe, and Zhaoyang Lv. Neural 3d video synthesis from multi- view video. InProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 5511–5521. IEEE Computer ...

  15. [15]

    Spacetime gaus- sian feature splatting for real-time dynamic view synthesis

    Zhan Li, Zhang Chen, Zhong Li, and Yi Xu. Spacetime gaus- sian feature splatting for real-time dynamic view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8508–8520,

  16. [16]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis, 2020. 2

  17. [17]

    Biformer: Learning bilateral motion estimation via bilateral trans- former for 4k video frame interpolation

    Junheum Park, Jintae Kim, and Chang-Su Kim. Biformer: Learning bilateral motion estimation via bilateral trans- former for 4k video frame interpolation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1568–1577, 2023. 3

  18. [18]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers.arXiv preprint arXiv:2212.09748, 2022. 2, 3

  19. [19]

    Vi- sion transformers for dense prediction

    Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE International Conference on Computer Vision, 2021. 3

  20. [20]

    Film: Frame inter- polation for large motion

    Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, and Brian Curless. Film: Frame inter- polation for large motion. InEuropean Conference on Com- puter Vision, pages 250–266. Springer, 2022. 3

  21. [21]

    Structure-from-motion revisited

    Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2016. 2

  22. [22]

    Pixelwise view selection for un- structured multi-view stereo

    Johannes Lutz Sch ¨onberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for un- structured multi-view stereo. InEuropean Conference on Computer Vision (ECCV), 2016. 2

  23. [23]

    Very deep con- volutional networks for large-scale image recognition

    Karen Simonyan and Andrew Zisserman. Very deep con- volutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, 2015. 4

  24. [24]

    Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024

    Brandon Smart, Chuanxia Zheng, Iro Laina, and Vic- tor Adrian Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024. 2, 3

  25. [25]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012. 3 9

  26. [26]

    3dgstream: On-the-fly training of 3d gaussians for efficient streaming of photo-realistic free- viewpoint videos

    Jiakai Sun, Han Jiao, Guangyuan Li, Zhanjie Zhang, Lei Zhao, and Wei Xing. 3dgstream: On-the-fly training of 3d gaussians for efficient streaming of photo-realistic free- viewpoint videos. InCVPR, 2024. 1, 2, 6

  27. [27]

    Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment, 2024

    Jianyuan Wang, Christian Rupprecht, and David Novotny. Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment, 2024. 2

  28. [28]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 2

  29. [29]

    Dust3r: Geometric 3d vi- sion made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InCVPR, 2024. 1, 3

  30. [30]

    arXiv preprint arXiv:2102.07064 (2021)

    Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Victor Adrian Prisacariu. NeRF−−: Neural radiance fields without known camera parameters.arXiv preprint arXiv:2102.07064, 2021. 6

  31. [31]

    4d gaussian splatting for real-time dynamic scene render- ing

    Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene render- ing. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 20310– 20320, 2024. 1, 2, 6

  32. [32]

    Localdygs: Multi-view global dynamic scene modeling via adaptive local implicit feature decou- pling.arXiv preprint arXiv:2507.02363, 2025

    Jiahao Wu, Rui Peng, Jianbo Jiao, Jiayu Yang, Luyang Tang, Kaiqiang Xiong, Jie Liang, Jinbo Yan, Runling Liu, and Ronggang Wang. Localdygs: Multi-view global dynamic scene modeling via adaptive local implicit feature decou- pling.arXiv preprint arXiv:2507.02363, 2025. 7, 1, 2

  33. [33]

    Swift4d: Adaptive divide-and-conquer gaussian splatting for compact and efficient reconstruction of dynamic scene,

    Jiahao Wu, Rui Peng, Zhiyan Wang, Lu Xiao, Luyang Tang, Jinbo Yan, Kaiqiang Xiong, and Ronggang Wang. Swift4d: Adaptive divide-and-conquer gaussian splatting for compact and efficient reconstruction of dynamic scene.arXiv preprint arXiv:2503.12307, 2025. 7, 1, 2

  34. [34]

    Depthsplat: Connecting gaussian splatting and depth

    Haofei Xu, Songyou Peng, Fangjinhua Wang, Hermann Blum, Daniel Barath, Andreas Geiger, and Marc Pollefeys. Depthsplat: Connecting gaussian splatting and depth. In CVPR, 2025. 3

  35. [35]

    Video enhancement with task-oriented flow.International Journal of Computer Vision (IJCV), 127 (8):1106–1125, 2019

    Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman. Video enhancement with task-oriented flow.International Journal of Computer Vision (IJCV), 127 (8):1106–1125, 2019. 3

  36. [36]

    Instant gaussian stream: Fast and generalizable streaming of dy- namic scene reconstruction via gaussian splatting

    Jinbo Yan, Rui Peng, Zhiyan Wang, Luyang Tang, Jiayu Yang, Jie Liang, Jiahao Wu, and Ronggang Wang. Instant gaussian stream: Fast and generalizable streaming of dy- namic scene reconstruction via gaussian splatting. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 16520–16531, 2025. 1, 2, 6, 7

  37. [37]

    Real- time photorealistic dynamic scene representation and render- ing with 4d gaussian splatting

    Zeyu Yang, Hongye Yang, Zijie Pan, and Li Zhang. Real- time photorealistic dynamic scene representation and render- ing with 4d gaussian splatting. InInternational Conference on Learning Representations (ICLR), 2024. 1

  38. [38]

    No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images.arXiv preprint arXiv:2410.24207, 2024

    Botao Ye, Sifei Liu, Haofei Xu, Li Xueting, Marc Pollefeys, Ming-Hsuan Yang, and Peng Songyou. No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images.arXiv preprint arXiv:2410.24207, 2024. 3, 6

  39. [39]

    Range-nullspace video frame interpolation with focalized motion estimation

    Zhiyang Yu, Yu Zhang, Dongqing Zou, Xijun Chen, Jimmy S Ren, and Shunqing Ren. Range-nullspace video frame interpolation with focalized motion estimation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 22159–22168, 2023. 3

  40. [40]

    Efros, Eli Shecht- man, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2018. 5

  41. [41]

    Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views

    Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gor- don Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 21936–21947, 2025. 2, 3, 4, 7, 1

  42. [42]

    Enhanced diffusion for high-quality large-motion video frame interpolation

    Zihao Zhang, Haoran Chen, Haoyu Zhao, Guansong Lu, Yanwei Fu, Hang Xu, and Zuxuan Wu. Enhanced diffusion for high-quality large-motion video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 2025. 3, 5, 7

  43. [43]

    Exploring motion ambiguity and alignment for high-quality video frame interpolation

    Kun Zhou, Wenbo Li, Xiaoguang Han, and Jiangbo Lu. Exploring motion ambiguity and alignment for high-quality video frame interpolation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22169–22179, 2023. 3 10 LiveStre4m: Feed-Forward Live Streaming of Novel Views from Unposed Multi-View Video Supplementary Materia...