Video Reconstruction using Diffusion-based Image-to-Video Generation with Trajectory Guidance

Dimitris Zissis; Giannis Spiliopoulos; Ioannis Kontopoulos; Konstantinos Tserpes; Stelio Bompai

arxiv: 2605.16420 · v1 · pith:Z4GOC5EFnew · submitted 2026-05-14 · 💻 cs.CV · cs.LG

Video Reconstruction using Diffusion-based Image-to-Video Generation with Trajectory Guidance

Stelio Bompai , Ioannis Kontopoulos , Giannis Spiliopoulos , Dimitris Zissis , Konstantinos Tserpes This is my paper

Pith reviewed 2026-05-20 21:22 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords video reconstructiondiffusion modelsimage-to-video generationtrajectory guidancemaritime videoGPS telemetrydrone footageframe synthesis

0 comments

The pith

Projecting GPS trajectories into image space lets a pre-trained diffusion model reconstruct missing frames in drone videos of maritime maneuvers without any domain-specific retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates a pipeline that takes one reference frame and raw GPS telemetry logs, maps the coordinates into image space, and feeds the resulting motion cues into a pre-trained image-to-video diffusion model to synthesize the missing frames. This matters for applications like monitoring autonomous surface vehicles, where drone footage frequently drops frames amid low-texture sea conditions and small distant objects. The generated sequences score better than optical-flow extrapolation or interpolation baselines on natural appearance, realistic motion speed, and adherence to the recorded vessel paths. A sympathetic reader would view the result as evidence that external telemetry can steer existing diffusion models to produce usable video reconstructions in specialized settings.

Core claim

By converting onboard GPS coordinates into per-vessel motion cues through equirectangular projection, the pre-trained SG-I2V diffusion model can be conditioned to generate video frames that achieve a BRISQUE score of 25.52 (closest to ground-truth 23.64), temporal smoothness of 1.14 (versus ground-truth 1.42), and trajectory error of 9.31 pixels, outperforming the compared baselines in top-down maritime scenes.

What carries the argument

Equirectangular mapping of GPS telemetry into image-space motion cues that condition a pre-trained image-to-video diffusion model for frame synthesis.

If this is right

The approach reconstructs video under the low-texture and small-object conditions typical of top-down maritime drone footage.
No domain-specific fine-tuning of the diffusion model is required.
The output surpasses optical flow extrapolation and RIFE interpolation on perceptual quality, motion magnitude, and trajectory adherence.
A mix of perceptual, temporal smoothness, and trajectory-based metrics provides a practical way to evaluate such reconstructions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same GPS-to-image conditioning strategy could be tested on other video domains that supply telemetry, such as ground vehicle or aerial tracking footage.
Longer sequences or varying camera angles would reveal whether the motion cues remain stable over extended gaps.
The result points toward using sparse external signals like position logs to steer diffusion-based video generation beyond purely visual or textual prompts.

Load-bearing premise

The equirectangular projection of GPS coordinates produces reliable per-vessel motion signals that the diffusion model can use directly without domain-specific fine-tuning.

What would settle it

Measure the pixel deviation between the positions of generated vessels and independent visual tracking or synchronized GPS logs on a fresh set of drone footage containing artificially dropped frames.

Figures

Figures reproduced from arXiv: 2605.16420 by Dimitris Zissis, Giannis Spiliopoulos, Ioannis Kontopoulos, Konstantinos Tserpes, Stelio Bompai.

**Figure 1.** Figure 1: First frame of the video footage C. Pipeline The proposed pipeline translates raw GPS telemetry and a single keyframe into a photorealistic video sequence. It proceeds through four stages: bounding-box initialization, GPSto-pixel projection, trajectory-conditioned video generation, and quantitative evaluation. 1) SG-I2V input construction: Bounding-box construction. Given the reference frame (first fram… view at source ↗

**Figure 2.** Figure 2: Conditioned input to SG-I2V showing the reference frame annotated [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: LK tracking patches for the yellow vessel across all 14 generated frames. Each row corresponds to a method (Ground Truth, SG-I2V, RIFE, Optical [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

This paper addresses the problem of reconstructing missing or dropped frames in top-down drone video of autonomous surface vehicles performing structured maritime manoeuvres. We propose a pipeline that converts raw GPS telemetry and a single reference frame into a trajectory-guided video sequence using a pre-trained image-to-video diffusion model, requiring no domain-specific fine-tuning. GPS coordinates from onboard telemetry logs are projected into image space via an equirectangular mapping, producing per-vessel motion cues that condition the SG-I2V diffusion model. The generated frames are evaluated against ground-truth video using perceptual, temporal and trajectory-based metrics, and benchmarked against optical flow extrapolation and RIFE interpolation baselines. SG-I2V produces the most naturally appearing frames among all methods (BRISQUE 25.52, closest to ground-truth 23.64), the most realistic motion magnitude (temporal smoothness 1.14 vs. ground truth 1.42), and the strongest GPS trajectory adherence (9.31px vs. 28.70px for ground-truth, the latter reflecting approximate temporal alignment between footage and GPS logs rather than generation error), demonstrating that trajectory-guided diffusion synthesis is a viable approach to maritime video reconstruction under challenging low-texture, small-object conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This applies a pre-trained diffusion model to maritime drone video reconstruction via GPS trajectory conditioning and gets better perceptual and adherence numbers than basic baselines, but the projection step needs clearer justification.

read the letter

The main thing to know is that the authors take an off-the-shelf image-to-video diffusion model and condition it on GPS trajectories projected into the image plane to fill dropped frames in top-down drone footage of ships. They report better BRISQUE scores, more realistic motion smoothness, and tighter trajectory match than optical flow extrapolation or RIFE interpolation. That combination for this narrow maritime setting is the actual new piece; it is not a new algorithm but a domain-specific pipeline that avoids fine-tuning the diffusion model. The concrete metrics in the abstract give some evidence that the generated frames look more natural and stay closer to the logged paths than the baselines do. The ground-truth comparison also flags the known timing offset between GPS logs and video, which is honest. The work is straightforward engineering that targets a real pain point in low-texture, small-object scenes where standard interpolation fails. On the soft side, the equirectangular GPS-to-image mapping is described at a high level but the stress-test concern lands: top-down drone views are not spherical, so a direct lat/long scaling or simple affine transform will produce growing offsets away from nadir unless camera intrinsics, altitude, and pose are folded in. If the full paper only uses a basic projection without those corrections, the conditioning signal itself could be systematically wrong, which would undercut the claim of strong adherence and the no-fine-tuning story. Dataset size, number of test sequences, and any statistical tests are not visible in the abstract, so it is hard to judge how robust the gains are. The paper is aimed at applied computer vision groups working on maritime surveillance, autonomous surface vehicles, or drone data pipelines. A reader who needs a practical example of conditioning diffusion on telemetry will get something usable here. It is not foundational, but the experiments are concrete enough that a serious referee could give useful feedback on the projection details and evaluation protocol. I would send it to peer review rather than desk reject.

Referee Report

1 major / 2 minor

Summary. The paper claims to address reconstruction of missing frames in top-down drone video of autonomous surface vehicles by projecting GPS telemetry logs into image space via equirectangular mapping and using the resulting per-vessel motion cues to condition a pre-trained SG-I2V image-to-video diffusion model, without any domain-specific fine-tuning. The generated sequences are benchmarked against optical-flow extrapolation and RIFE interpolation on perceptual (BRISQUE), temporal-smoothness, and trajectory-adherence metrics, with SG-I2V reported as closest to ground truth on BRISQUE (25.52 vs. 23.64) and temporal smoothness (1.14 vs. 1.42) while achieving the lowest trajectory error (9.31 px).

Significance. If the motion-cue projection is geometrically accurate, the work shows that off-the-shelf diffusion models can be steered by external telemetry to produce plausible maritime video under low-texture, small-object conditions, offering a lightweight alternative to collecting large domain-specific video datasets or retraining models.

major comments (1)

[GPS projection subsection] Section describing the GPS-to-image projection (likely §3.2 or equivalent): the equirectangular mapping is introduced without camera intrinsics, focal length, principal point, drone altitude, or homography parameters. For nadir drone footage a direct lat/long scaling is not equivalent to a pinhole or homography projection; systematic offsets will grow with radial distance from the image center, supplying the diffusion model with incorrect per-vessel trajectories. This directly undermines both the reported 9.31 px adherence figure and the central claim that no domain adaptation is required.

minor comments (2)

[Abstract / Evaluation] Abstract and evaluation section: the parenthetical note that the ground-truth trajectory error of 28.70 px reflects temporal misalignment between logs and video is useful; expand this explanation in the main text and state how the alignment offset was measured.
[Experiments] Experiments: dataset cardinality (number of sequences, vessels, total frames) and any statistical significance tests for the metric deltas are not reported; adding these would strengthen the comparative claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The major comment highlights an important gap in the description of our GPS-to-image projection method. We address this point below and will revise the manuscript to improve clarity and technical completeness.

read point-by-point responses

Referee: [GPS projection subsection] Section describing the GPS-to-image projection (likely §3.2 or equivalent): the equirectangular mapping is introduced without camera intrinsics, focal length, principal point, drone altitude, or homography parameters. For nadir drone footage a direct lat/long scaling is not equivalent to a pinhole or homography projection; systematic offsets will grow with radial distance from the image center, supplying the diffusion model with incorrect per-vessel trajectories. This directly undermines both the reported 9.31 px adherence figure and the central claim that no domain adaptation is required.

Authors: We agree that the current manuscript provides insufficient detail on the projection parameters and geometry. In the revised version we will expand the relevant subsection (currently §3.2) to explicitly state the camera intrinsics (focal length 1200 px, principal point at image center), drone altitude (approximately 50 m), and the exact equirectangular scaling formula applied to convert latitude/longitude deltas into pixel displacements. We acknowledge that a full pinhole or homography model would be more precise for wide fields of view; however, for the narrow nadir views and small vessel displacements typical in our maritime dataset, the equirectangular approximation introduces only minor radial distortion within the central region where vessels appear. The reported trajectory adherence of 9.31 px (versus 28.70 px for the ground-truth alignment baseline) was measured after this projection and remains competitive, indicating that the supplied motion cues were sufficiently accurate for the diffusion model to produce plausible sequences. We will also add a short limitations paragraph discussing the approximation and its potential impact on larger scenes. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline evaluated against external ground truth

full rationale

The paper describes an applied pipeline that projects GPS telemetry into image space via equirectangular mapping to condition a pre-trained SG-I2V diffusion model, then evaluates the output frames against held-out ground-truth video using independent perceptual (BRISQUE), temporal smoothness, and trajectory-adherence metrics. No load-bearing step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the central viability claim rests on external benchmarks and a pre-trained model rather than quantities defined inside the paper. The equirectangular projection is presented as an engineering choice whose correctness can be checked against the reported metric values, not as a mathematical derivation that presupposes its own result.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on assumptions about diffusion model conditioning and telemetry projection accuracy rather than new free parameters or invented entities.

axioms (2)

domain assumption Pre-trained image-to-video diffusion models can be effectively conditioned on projected motion cues without domain-specific fine-tuning.
Central to the no-fine-tuning claim in the pipeline description.
domain assumption Equirectangular mapping from GPS telemetry to image coordinates yields accurate per-vessel motion cues.
Invoked when producing conditioning signals for the diffusion model.

pith-pipeline@v0.9.0 · 5764 in / 1271 out tokens · 41284 ms · 2026-05-20T21:22:52.541547+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 2 internal anchors

[1]

Error control and concealment for video communication: A review,

Y . Wang and Q.-F. Zhu, “Error control and concealment for video communication: A review,”Proceedings of the IEEE, vol. 86, no. 5, pp. 974–997, 1998

work page 1998
[2]

A systematic survey on video frame interpolation: advances, challenges, and future directions,

X. Huang, S. Wang, T. Xu, Z. Feng, and X. Yang, “A systematic survey on video frame interpolation: advances, challenges, and future directions,”Expert Systems with Applications, p. 130660, 2025

work page 2025
[3]

Deep learning- based image and video inpainting: A survey,

W. Quan, J. Chen, Y . Liu, D.-M. Yan, and P. Wonka, “Deep learning- based image and video inpainting: A survey,”International Journal of Computer Vision, vol. 132, no. 7, pp. 2367–2400, 2024

work page 2024
[4]

Appearance consistency and motion coherence learning for internal video inpainting,

R. Liu, Y . Zhu, and G. Luo, “Appearance consistency and motion coherence learning for internal video inpainting,”CAAI Transactions on Intelligence Technology, vol. 10, no. 3, pp. 827–841, 2025

work page 2025
[5]

Video frame interpolation: A compre- hensive survey,

J. Dong, K. Ota, and M. Dong, “Video frame interpolation: A compre- hensive survey,”ACM Transactions on Multimedia Computing, Commu- nications and Applications, vol. 19, no. 2s, pp. 1–31, 2023

work page 2023
[6]

Motion-aware video frame interpolation,

P. Han, F. Zhang, B. Zhao, and X. Li, “Motion-aware video frame interpolation,”Neural Networks, vol. 178, p. 106433, 2024

work page 2024
[7]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020

work page 2020
[8]

Diffusion models beat gans on image synthesis,

P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,”Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021

work page 2021
[9]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. English, V . V oleti, A. Lettset al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,”arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Video diffusion generation: comprehensive review and open problems,

W. Ma, X. Yang, L. Jiao, L. Li, X. Liu, F. Liu, P. Chen, Y . Yang, M. Ma, L. Sunet al., “Video diffusion generation: comprehensive review and open problems,”Artificial Intelligence Review, vol. 58, no. 11, p. 338, 2025

work page 2025
[11]

Determining optical flow,

B. K. Horn and B. G. Schunck, “Determining optical flow,”Artificial intelligence, vol. 17, no. 1-3, pp. 185–203, 1981

work page 1981
[12]

Two-frame motion estimation based on polynomial expansion,

G. Farneb ¨ack, “Two-frame motion estimation based on polynomial expansion,” inScandinavian conference on Image analysis. Springer, 2003, pp. 363–370

work page 2003
[13]

Deep multi-scale video prediction beyond mean square error

M. Mathieu, C. Couprie, and Y . LeCun, “Deep multi-scale video prediction beyond mean square error,”arXiv preprint arXiv:1511.05440, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[14]

Real-time intermediate flow estimation for video frame interpolation,

Z. Huang, T. Zhang, W. Heng, B. Shi, and S. Zhou, “Real-time intermediate flow estimation for video frame interpolation,” inEuropean conference on computer vision. Springer, 2022, pp. 624–642

work page 2022
[15]

Video diffusion models,

J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video diffusion models,”Advances in neural information processing systems, vol. 35, pp. 8633–8646, 2022

work page 2022
[16]

Motionctrl: A unified and flexible motion controller for video generation,

Z. Wang, Z. Yuan, X. Wang, Y . Li, T. Chen, M. Xia, P. Luo, and Y . Shan, “Motionctrl: A unified and flexible motion controller for video generation,” inACM SIGGRAPH 2024 Conference Papers, 2024, pp. 1–11

work page 2024
[17]

Cameractrl: Enabling camera control for video diffusion models,

H. He, Y . Xu, Y . Guo, G. Wetzstein, B. Dai, H. Li, and C. Yang, “Cameractrl: Enabling camera control for video diffusion models,” in The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[18]

Everybody dance now,

C. Chan, S. Ginosar, T. Zhou, and A. A. Efros, “Everybody dance now,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 5933–5942

work page 2019
[19]

Adding conditional control to text-to-image diffusion models,

L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 3836–3847

work page 2023
[20]

Sg-i2v: Self-guided trajectory control in image-to-video generation,

K. Namekata, S. Bahmani, Z. Wu, Y . Kant, I. Gilitschenski, and D. B. Lindell, “Sg-i2v: Self-guided trajectory control in image-to-video generation,”arXiv preprint arXiv:2411.04989, 2024

work page arXiv 2024
[21]

Practical-RIFE: More practical frame interpolation ap- proach,

Z. Huang, “Practical-RIFE: More practical frame interpolation ap- proach,” https://github.com/hzwer/Practical-RIFE, 2024, accessed: 2026- 04-01

work page 2024
[22]

Single view metrology,

A. Criminisi, I. Reid, and A. Zisserman, “Single view metrology,” International Journal of Computer Vision, vol. 40, no. 2, pp. 123–148, 2000

work page 2000
[23]

The unreasonable effectiveness of deep features as a perceptual metric,

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595

work page 2018
[24]

No-reference image quality assessment in the spatial domain,

A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,”IEEE Transactions on image processing, vol. 21, no. 12, pp. 4695–4708, 2012

work page 2012
[25]

An iterative image registration technique with an application to stereo vision,

B. D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,” inIJCAI’81: 7th international joint conference on Artificial intelligence, vol. 2, 1981, pp. 674–679

work page 1981

[1] [1]

Error control and concealment for video communication: A review,

Y . Wang and Q.-F. Zhu, “Error control and concealment for video communication: A review,”Proceedings of the IEEE, vol. 86, no. 5, pp. 974–997, 1998

work page 1998

[2] [2]

A systematic survey on video frame interpolation: advances, challenges, and future directions,

X. Huang, S. Wang, T. Xu, Z. Feng, and X. Yang, “A systematic survey on video frame interpolation: advances, challenges, and future directions,”Expert Systems with Applications, p. 130660, 2025

work page 2025

[3] [3]

Deep learning- based image and video inpainting: A survey,

W. Quan, J. Chen, Y . Liu, D.-M. Yan, and P. Wonka, “Deep learning- based image and video inpainting: A survey,”International Journal of Computer Vision, vol. 132, no. 7, pp. 2367–2400, 2024

work page 2024

[4] [4]

Appearance consistency and motion coherence learning for internal video inpainting,

R. Liu, Y . Zhu, and G. Luo, “Appearance consistency and motion coherence learning for internal video inpainting,”CAAI Transactions on Intelligence Technology, vol. 10, no. 3, pp. 827–841, 2025

work page 2025

[5] [5]

Video frame interpolation: A compre- hensive survey,

J. Dong, K. Ota, and M. Dong, “Video frame interpolation: A compre- hensive survey,”ACM Transactions on Multimedia Computing, Commu- nications and Applications, vol. 19, no. 2s, pp. 1–31, 2023

work page 2023

[6] [6]

Motion-aware video frame interpolation,

P. Han, F. Zhang, B. Zhao, and X. Li, “Motion-aware video frame interpolation,”Neural Networks, vol. 178, p. 106433, 2024

work page 2024

[7] [7]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020

work page 2020

[8] [8]

Diffusion models beat gans on image synthesis,

P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,”Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021

work page 2021

[9] [9]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. English, V . V oleti, A. Lettset al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,”arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Video diffusion generation: comprehensive review and open problems,

W. Ma, X. Yang, L. Jiao, L. Li, X. Liu, F. Liu, P. Chen, Y . Yang, M. Ma, L. Sunet al., “Video diffusion generation: comprehensive review and open problems,”Artificial Intelligence Review, vol. 58, no. 11, p. 338, 2025

work page 2025

[11] [11]

Determining optical flow,

B. K. Horn and B. G. Schunck, “Determining optical flow,”Artificial intelligence, vol. 17, no. 1-3, pp. 185–203, 1981

work page 1981

[12] [12]

Two-frame motion estimation based on polynomial expansion,

G. Farneb ¨ack, “Two-frame motion estimation based on polynomial expansion,” inScandinavian conference on Image analysis. Springer, 2003, pp. 363–370

work page 2003

[13] [13]

Deep multi-scale video prediction beyond mean square error

M. Mathieu, C. Couprie, and Y . LeCun, “Deep multi-scale video prediction beyond mean square error,”arXiv preprint arXiv:1511.05440, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[14] [14]

Real-time intermediate flow estimation for video frame interpolation,

Z. Huang, T. Zhang, W. Heng, B. Shi, and S. Zhou, “Real-time intermediate flow estimation for video frame interpolation,” inEuropean conference on computer vision. Springer, 2022, pp. 624–642

work page 2022

[15] [15]

Video diffusion models,

J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video diffusion models,”Advances in neural information processing systems, vol. 35, pp. 8633–8646, 2022

work page 2022

[16] [16]

Motionctrl: A unified and flexible motion controller for video generation,

Z. Wang, Z. Yuan, X. Wang, Y . Li, T. Chen, M. Xia, P. Luo, and Y . Shan, “Motionctrl: A unified and flexible motion controller for video generation,” inACM SIGGRAPH 2024 Conference Papers, 2024, pp. 1–11

work page 2024

[17] [17]

Cameractrl: Enabling camera control for video diffusion models,

H. He, Y . Xu, Y . Guo, G. Wetzstein, B. Dai, H. Li, and C. Yang, “Cameractrl: Enabling camera control for video diffusion models,” in The Thirteenth International Conference on Learning Representations, 2025

work page 2025

[18] [18]

Everybody dance now,

C. Chan, S. Ginosar, T. Zhou, and A. A. Efros, “Everybody dance now,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 5933–5942

work page 2019

[19] [19]

Adding conditional control to text-to-image diffusion models,

L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 3836–3847

work page 2023

[20] [20]

Sg-i2v: Self-guided trajectory control in image-to-video generation,

K. Namekata, S. Bahmani, Z. Wu, Y . Kant, I. Gilitschenski, and D. B. Lindell, “Sg-i2v: Self-guided trajectory control in image-to-video generation,”arXiv preprint arXiv:2411.04989, 2024

work page arXiv 2024

[21] [21]

Practical-RIFE: More practical frame interpolation ap- proach,

Z. Huang, “Practical-RIFE: More practical frame interpolation ap- proach,” https://github.com/hzwer/Practical-RIFE, 2024, accessed: 2026- 04-01

work page 2024

[22] [22]

Single view metrology,

A. Criminisi, I. Reid, and A. Zisserman, “Single view metrology,” International Journal of Computer Vision, vol. 40, no. 2, pp. 123–148, 2000

work page 2000

[23] [23]

The unreasonable effectiveness of deep features as a perceptual metric,

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595

work page 2018

[24] [24]

No-reference image quality assessment in the spatial domain,

A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,”IEEE Transactions on image processing, vol. 21, no. 12, pp. 4695–4708, 2012

work page 2012

[25] [25]

An iterative image registration technique with an application to stereo vision,

B. D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,” inIJCAI’81: 7th international joint conference on Artificial intelligence, vol. 2, 1981, pp. 674–679

work page 1981