pith. sign in

arxiv: 2605.18328 · v1 · pith:SKZQIEGSnew · submitted 2026-05-18 · 💻 cs.CV

CineMatte: Background Matting for Virtual Production and Beyond

Pith reviewed 2026-05-20 11:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords background mattingvirtual productioncross-attentionvision transformerforeground extractionLED volume4K datasetimage-guided upsampler
0
0 comments X

The pith

CineMatte encodes input frames and backgrounds separately with a frozen DINOv3 ViT then uses cross-attention to predict foreground mattes for virtual production.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops CineMatte to address the difficulty of removing backgrounds from LED virtual production footage so that new backgrounds can be inserted during post-production without labor-intensive manual work. Instead of feeding the input and background together through concatenation, the method runs a shared-weight frozen DINOv3 Vision Transformer on each stream independently and then applies cross-attention between the resulting features to identify the foreground. This design keeps the semantic knowledge learned during pretraining and makes the model more tolerant to changes in the background image. The authors also swap the usual convolutional detail branch for a pretrained image-guided feature upsampler to cut down on boundary artifacts that arise from semantic misalignment. They support the claims with a new real 4K HDR dataset captured on a professional LED stage using green-screen insertion and tracked camera motion, plus tests on public benchmarks that show good generalization to ordinary real-world video.

Core claim

CineMatte employs a cross-attention-conditioned design. Instead of concatenating the background with the input, CineMatte employs a Siamese, frozen DINOv3 Vision Transformer with shared weights to encode the input frame and the captured background separately. A cross-attention module compares the two streams to predict the foreground, preserving pretrained semantics and improving robustness to background shifts. Previous ViT-based matting models use a parallel convolutional detail branch to recover fine details, which can cause boundary artifacts in real-world samples due to semantic misalignment with the backbone. We instead replace it with a pretrained, image-guided feature upsampler, and

What carries the argument

A Siamese frozen DINOv3 Vision Transformer with shared weights that encodes the input frame and background separately, followed by a cross-attention module to compare the streams and predict the foreground, plus a pretrained image-guided feature upsampler for detail recovery.

If this is right

  • Foreground mattes stay accurate even when the final background differs from the one shown on the LED volume during capture.
  • Boundary artifacts decrease because the pretrained upsampler avoids semantic misalignment with the ViT backbone.
  • Tracked camera trajectories in the dataset let new backgrounds be rendered with correct parallax during later compositing.
  • Performance on VideoMatte240K and YouTubeMatte indicates the same pipeline works on ordinary real-world footage outside virtual production.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The frozen ViT approach could reduce the amount of task-specific labeled data needed when adapting matting to other controlled environments.
  • The separate-encoding pattern may transfer to related tasks such as video object segmentation where background variability is also high.
  • If inference can be made faster, the method could support live virtual production pipelines that require on-set matting.

Load-bearing premise

That encoding the input and background separately with a frozen shared-weight DINOv3 ViT and comparing them via cross-attention will reliably preserve semantics and outperform concatenation-based approaches without introducing misalignment artifacts in real VP footage.

What would settle it

A side-by-side evaluation on the CineMatte-4K test set in which a simple concatenation baseline matches or exceeds CineMatte on standard matting metrics such as SAD, MSE, or gradient error would undermine the claim that separate encoding plus cross-attention is necessary for robustness.

Figures

Figures reproduced from arXiv: 2605.18328 by Chen Zhang, Fasheng Chen, Jiangbo Cao, Yuanjian He.

Figure 1
Figure 1. Figure 1: We propose CineMatte, a background matting method for virtual production and beyond. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Creation of the CineMatte-4K dataset tween the target scene and a green screen; the scene frame serves as the input, and the ground-truth alpha is obtained by manually matting the corresponding green-screen frame. This yields a non-synthetic dataset for background matting and virtual production. For videos, we record tracked cam￾era trajectories together with green-screen foregrounds, en￾abling later rende… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of CineMatte. A frozen Siamese DINOv3 [ [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Real-world results: our method yields crisp boundaries and the most complete human matte, while baselines misclassify the [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: JAFAR-style feature upsampler. It recovers high [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: More samples on the CineMatte-4K dataset. Faces are blurred for anonymity. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The effect of connecting a high-resolution shortcut from [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

LED Virtual Production (VP) uses large LED volumes to render backgrounds in real time, enabling in-camera visual effects but making post-shot changes labor-intensive. We address this with CineMatte, a robust background matting framework for VP and beyond. CineMatte employs a cross-attention-conditioned design. Instead of concatenating the background with the input, CineMatte employs a Siamese, frozen DINOv3 Vision Transformer with shared weights to encode the input frame and the captured background separately. A cross-attention module compares the two streams to predict the foreground, preserving pretrained semantics and improving robustness to background shifts. Previous ViT-based matting models use a parallel convolutional "detail branch" to recover fine details, which can cause boundary artifacts in real-world samples due to semantic misalignment with the backbone. We instead replace it with a pretrained, image-guided feature upsampler, which largely mitigates the problem. We also introduce CineMatte-4K, a 4K HDR image-video dataset captured on a professional LED VP stage. To the best of our knowledge, the image subset is the first dataset for VP matting and is non-synthetic, obtained via green-screen insertion; the video subset includes camera motion with tracked trajectories so that arbitrary backgrounds can be rendered later with correct parallax. Across CineMatte-4K and public benchmarks (VideoMatte240K, YouTubeMatte), CineMatte not only excels in VP but also generalizes robustly to real-world footage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CineMatte, a background matting framework for LED Virtual Production. It uses a Siamese frozen DINOv3 ViT with shared weights to encode the input frame and captured background separately, applies cross-attention to predict the foreground while preserving pretrained semantics, and replaces the conventional detail branch with a pretrained image-guided feature upsampler to reduce boundary artifacts. The work also contributes the CineMatte-4K dataset of 4K HDR non-synthetic images (via green-screen) and videos with tracked camera motion for parallax-correct rendering. Claims include superior performance on CineMatte-4K and public benchmarks (VideoMatte240K, YouTubeMatte) with robust generalization to real-world footage.

Significance. If the results hold, the work offers a practical advance for VP post-production by enabling easier background changes. The non-synthetic VP-specific dataset fills a clear resource gap and supports future work on camera-motion-aware matting. The design choice of frozen Siamese DINOv3 plus cross-attention, together with the pretrained upsampler, provides a clean way to leverage semantic priors without introducing misalignment-prone detail branches. These elements, if validated by ablations and metrics, constitute a solid applied contribution to computer vision for media production.

major comments (2)
  1. [Method] Method section (cross-attention design): The central claim that separate encoding with frozen shared-weight DINOv3 ViT followed by cross-attention reliably preserves semantics and improves robustness to background shifts (versus concatenation) is load-bearing. However, the manuscript provides no ablation that isolates the cross-attention module from the pretrained upsampler or the new CineMatte-4K dataset, particularly under LED-induced color or moiré shifts that could misalign the feature spaces. Without this, attribution of gains to the proposed design remains unclear.
  2. [Experiments] Experiments section: The abstract states that CineMatte excels across CineMatte-4K and public benchmarks with robust generalization, yet the description supplies no quantitative metrics, error analysis, or specific comparisons (e.g., boundary error or foreground accuracy under controlled background shifts). This absence prevents verification that the architectural choices deliver the claimed improvements over prior ViT-based matting models.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by including one or two key quantitative results (e.g., matting accuracy or generalization gap) to support the performance claims.
  2. [Method] Notation for the cross-attention module and the image-guided upsampler should be defined explicitly with equations or a diagram for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and positive review of our manuscript. We address each major comment below and indicate the revisions we will incorporate to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Method] Method section (cross-attention design): The central claim that separate encoding with frozen shared-weight DINOv3 ViT followed by cross-attention reliably preserves semantics and improves robustness to background shifts (versus concatenation) is load-bearing. However, the manuscript provides no ablation that isolates the cross-attention module from the pretrained upsampler or the new CineMatte-4K dataset, particularly under LED-induced color or moiré shifts that could misalign the feature spaces. Without this, attribution of gains to the proposed design remains unclear.

    Authors: We appreciate the referee's emphasis on isolating the contribution of the cross-attention module. Our existing comparisons to concatenation-based baselines provide supporting evidence for the design choice, yet we agree that a targeted ablation—specifically evaluating the cross-attention component independently of the upsampler and under controlled LED-induced color and moiré shifts—would strengthen attribution. We will add this ablation study, including quantitative results on feature alignment and robustness, to the revised manuscript. revision: yes

  2. Referee: [Experiments] Experiments section: The abstract states that CineMatte excels across CineMatte-4K and public benchmarks with robust generalization, yet the description supplies no quantitative metrics, error analysis, or specific comparisons (e.g., boundary error or foreground accuracy under controlled background shifts). This absence prevents verification that the architectural choices deliver the claimed improvements over prior ViT-based matting models.

    Authors: We thank the referee for noting this presentation issue. The experiments section contains tables with quantitative metrics (SAD, MSE, gradient, and connectivity errors) and direct comparisons to prior ViT-based matting models on CineMatte-4K, VideoMatte240K, and YouTubeMatte. To address the concern directly, we will expand the section with a dedicated error analysis subsection that includes boundary-specific metrics and results under controlled background shifts, ensuring all claims are explicitly linked to the reported numbers. revision: yes

Circularity Check

0 steps flagged

No circularity detected in architectural or empirical claims

full rationale

The paper describes an architectural proposal (Siamese frozen DINOv3 encoders with cross-attention replacing concatenation, plus a pretrained image-guided upsampler replacing a convolutional detail branch) and introduces the CineMatte-4K dataset, then reports performance on that dataset plus public benchmarks. No equations, parameter fits, or derivations are shown that reduce the claimed predictions or robustness improvements to the inputs by construction. The central design choices are presented as independent innovations whose benefits are asserted via external empirical results rather than self-referential definitions or self-citation chains. The derivation chain is therefore self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach depends on the assumption that a frozen pretrained DINOv3 model retains useful semantics for foreground-background separation and that the new real-world VP dataset is representative; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption A frozen DINOv3 Vision Transformer preserves semantic features that are transferable to background matting without fine-tuning
    The method freezes the model to retain pretrained semantics and relies on this for robustness.

pith-pipeline@v0.9.0 · 5806 in / 1375 out tokens · 51851 ms · 2026-05-20T11:30:15.554902+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 6 internal anchors

  1. [1]

    Transmat- ting: Enhancing transparent objects matting with transform- ers

    Huanqia Cai, Fanglei Xue, Lele Xu, and Lili Guo. Transmat- ting: Enhancing transparent objects matting with transform- ers. InProc. Eur. Conf. on Computer Vision (ECCV), 2022. 3

  2. [2]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2, 3

  3. [3]

    Rethinking Atrous Convolution for Semantic Image Segmentation

    Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for seman- tic image segmentation.arXiv preprint arXiv:1706.05587,

  4. [4]

    Knn mat- ting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(9):2175–2188, 2013

    Qifeng Chen, Dingzeyu Li, and Chi-Keung Tang. Knn mat- ting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(9):2175–2188, 2013. 2, 3, 7

  5. [5]

    Jafar: Jack up any feature at any resolution.arXiv preprint arXiv:2506.11136, 2025

    Paul Couairon, Lo ¨ıck Chambon, Louis Serrano, Jean- Emmanuel Haugeard, Matthieu Cord, and Nicolas Thome. Jafar: Jack up any feature at any resolution.arXiv preprint arXiv:2506.11136, 2025. 3, 4, 5

  6. [6]

    Learning affinity- aware upsampling for deep image matting

    Yutong Dai, Hao Lu, and Chunhua Shen. Learning affinity- aware upsampling for deep image matting. InProc. IEEE/CVF Conf. on Computer Vision and Pattern Recogni- tion (CVPR), pages 6841–6850, 2021. 3

  7. [7]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 7

  8. [8]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2, 3, 4

  9. [9]

    Perceptually motivated benchmark for video matting

    Mikhail Erofeev, Yury Gitman, Dmitriy S Vatolin, Alexey Fedorov, and Jue Wang. Perceptually motivated benchmark for video matting. InBMVC, page 2, 2015. 8

  10. [10]

    Featup: A model- agnostic framework for features at any resolution.arXiv preprint arXiv:2403.10516, 2024

    Stephanie Fu, Mark Hamilton, Laura Brandt, Axel Feldman, Zhoutong Zhang, and William T Freeman. Featup: A model- agnostic framework for features at any resolution.arXiv preprint arXiv:2403.10516, 2024. 3

  11. [11]

    Eduardo S. L. Gastal and Manuel M. Oliveira. Shared sam- pling for real-time alpha matting.Computer Graphics Fo- rum, 29(2):575–584, 2010. 3

  12. [12]

    A global sampling method for alpha matting

    Kaiming He, Christoph Rhemann, Carsten Rother, Xiaoou Tang, and Jian Sun. A global sampling method for alpha matting. InProc. IEEE Conf. on Computer Vision and Pat- tern Recognition (CVPR), pages 2049–2056, 2011. 3

  13. [13]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 2, 3

  14. [14]

    End-to-end video matting with trimap propagation

    Wei-Lun Huang and Ming-Sui Lee. End-to-end video matting with trimap propagation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14337–14347, 2023. 3

  15. [15]

    Maggie: Masked guided gradual human in- stance matting

    Chuong Huynh, Seoung Wug Oh, Abhinav Shrivastava, and Joon-Young Lee. Maggie: Masked guided gradual human in- stance matting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3870– 3879, 2024

  16. [16]

    Zhanghan Ke, Jiayu Sun, Kaican Li, Qiong Yan, and Ryn- son W.H. Lau. Modnet: Real-time trimap-free portrait mat- ting via objective decomposition. InProc. AAAI Conf. on Artificial Intelligence (AAAI), 2022. 3, 6, 7

  17. [17]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 3

  18. [18]

    Nonlocal matting

    Philip Lee and Ying Wu. Nonlocal matting. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2193–2200, 2011. 3

  19. [19]

    A closed-form solution to natural image matting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):228–242,

    Anat Levin, Dani Lischinski, and Yair Weiss. A closed-form solution to natural image matting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):228–242,

  20. [20]

    Maybank, and Dacheng Tao

    Jizhizi Li, Jing Zhang, Stephen J. Maybank, and Dacheng Tao. Bridging composite and real: Towards end-to-end deep image matting.arXiv preprint arXiv:2010.16188, 2020. 3, 6, 8

  21. [21]

    Privacy- preserving portrait matting

    Jizhizi Li, Sihan Ma, Jing Zhang, and Dacheng Tao. Privacy- preserving portrait matting. InProceedings of the 29th ACM international conference on multimedia, pages 3501–3509,

  22. [22]

    Deep automatic natural image matting.arXiv preprint arXiv:2107.07235,

    Jizhizi Li, Jing Zhang, and Dacheng Tao. Deep automatic natural image matting.arXiv preprint arXiv:2107.07235,

  23. [23]

    Vmformer: End-to-end video matting with transformer

    Jiachen Li, Vidit Goel, Marianna Ohanyan, Shant Navasardyan, Yunchao Wei, and Humphrey Shi. Vmformer: End-to-end video matting with transformer. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6678–6687, 2024. 3

  24. [24]

    Natural image matting via guided contextual attention

    Yaoyi Li and Hongtao Lu. Natural image matting via guided contextual attention. InProc. AAAI Conf. on Artificial Intel- ligence (AAAI), pages 11450–11457, 2020. 3

  25. [25]

    Exploring plain vision transformer backbones for object de- tection

    Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object de- tection. InEuropean conference on computer vision, pages 280–296. Springer, 2022. 3

  26. [26]

    Refinenet: Multi-path refinement networks for high- resolution semantic segmentation

    Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Refinenet: Multi-path refinement networks for high- resolution semantic segmentation. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1925–1934, 2017. 3

  27. [27]

    Seitz, and Ira Kemelmacher- Shlizerman

    Shanchuan Lin, Andrey Ryabtsev, Soumyadip Sengupta, Brian Curless, Steven M. Seitz, and Ira Kemelmacher- Shlizerman. Real-time high-resolution background matting. arXiv preprint arXiv:2012.07810, 2020. 2, 3, 6, 7, 8

  28. [28]

    Robust high-resolution video matting with tempo- ral guidance

    Shanchuan Lin, Linjie Yang, Imran Saleemi, and Soumyadip Sengupta. Robust high-resolution video matting with tempo- ral guidance. InProc. IEEE/CVF Winter Conf. on Applica- tions of Computer Vision (WACV), pages 238–247, 2022. 3, 7 9

  29. [29]

    In- dices matter: Learning to index for deep image matting

    Hao Lu, Yutong Dai, Chunhua Shen, and Songcen Xu. In- dices matter: Learning to index for deep image matting. InProc. IEEE/CVF Int. Conf. on Computer Vision (ICCV), pages 3266–3275, 2019. 3

  30. [30]

    & Yadav, S

    Achleshwar Luthra, Harsh Sulakhe, Tanish Mittal, Abhishek Iyer, and Santosh Yadav. Eformer: Edge enhancement based transformer for medical image denoising.arXiv preprint arXiv:2109.08044, 2021. 3

  31. [31]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e J´egou, Julien Mairal, P...

  32. [32]

    Matteformer: Transformer-based image mat- ting via prior-tokens

    GyuTae Park, SungJoon Son, JaeYoung Yoo, SeHo Kim, and Nojun Kwak. Matteformer: Transformer-based image mat- ting via prior-tokens. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 11696–11706, 2022. 2, 3

  33. [33]

    How do vision transformers work?arXiv preprint arXiv:2202.06709, 2022

    Namuk Park and Songkuk Kim. How do vision transformers work?arXiv preprint arXiv:2202.06709, 2022. 2, 5

  34. [34]

    A survey on vir- tual production and the future of compositing technologies

    Filipe Pires, Rui Silva, and Rui Raposo. A survey on vir- tual production and the future of compositing technologies. Avanca Cinema Journal, 21(692-9), 2022. 2

  35. [35]

    Attention-guided hi- erarchical structure aggregation for image matting

    Yu Qiao, Yuhao Liu, Xin Yang, Dongsheng Zhou, Mingliang Xu, Qiang Zhang, and Xiaopeng Wei. Attention-guided hi- erarchical structure aggregation for image matting. InProc. IEEE/CVF Conf. on Computer Vision and Pattern Recogni- tion (CVPR), pages 13676–13685, 2020. 6

  36. [36]

    Vi- sion transformers for dense prediction

    Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 3, 5

  37. [37]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 8

  38. [38]

    High resolution matting via interactive trimap segmentation

    Christoph Rhemann, Carsten Rother, Alex Rav-Acha, and Toby Sharp. High resolution matting via interactive trimap segmentation. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008. 3

  39. [39]

    A perceptu- ally motivated online benchmark for image matting

    Christoph Rhemann, Carsten Rother, Jue Wang, Margrit Gelautz, Pushmeet Kohli, and Pamela Rott. A perceptu- ally motivated online benchmark for image matting. In2009 IEEE conference on computer vision and pattern recogni- tion, pages 1826–1833. IEEE, 2009. 8

  40. [40]

    A spatially varying psf-based prior for al- pha matting

    Christoph Rhemann, Carsten Rother, Pushmeet Kohli, and Margrit Gelautz. A spatially varying psf-based prior for al- pha matting. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2149–2156, 2010. 3

  41. [41]

    Ruzon and Carlo Tomasi

    Mark A. Ruzon and Carlo Tomasi. Alpha estimation in nat- ural images. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 18–25, 2000. 3

  42. [42]

    Seitz, and Ira Kemelmacher-Shlizerman

    Soumyadip Sengupta, Vivek Jayaram, Brian Curless, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. Back- ground matting: The world is your green screen. InProc. IEEE/CVF Conf. on Computer Vision and Pattern Recogni- tion (CVPR), pages 2288–2297, 2020. 2, 3

  43. [43]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2, 3, 4, 8

  44. [44]

    High-Resolution Representations for Labeling Pixels and Regions

    Ke Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, and Jingdong Wang. High-resolution representations for labeling pixels and regions.arXiv preprint arXiv:1904.04514, 2019. 3

  45. [45]

    Lift: A surprisingly simple lightweight feature transform for dense vit descriptors

    Saksham Suri, Matthew Walmer, Kamal Gupta, and Abhinav Shrivastava. Lift: A surprisingly simple lightweight feature transform for dense vit descriptors. InEuropean Conference on Computer Vision, pages 110–128. Springer, 2024. 3

  46. [46]

    The emergence of virtual production–a research agenda.Convergence, 30(5):1557– 1574, 2024

    Jon Swords and Nina Willment. The emergence of virtual production–a research agenda.Convergence, 30(5):1557– 1574, 2024. 2

  47. [47]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2

  48. [48]

    Jue Wang and Michael F. Cohen. An iterative optimization approach for unified image segmentation and matting. In Proc. IEEE Int. Conf. on Computer Vision (ICCV), pages 936–943, 2005. 3

  49. [49]

    Jue Wang and Michael F. Cohen. Optimized color sampling for robust matting. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2007. 3

  50. [50]

    Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion.Advances in Neural Information Processing Systems, 35:3502–3516, 2022

    Philippe Weinzaepfel, Vincent Leroy, Thomas Lucas, Ro- main Br´egier, Yohann Cabon, Vaibhav Arora, Leonid Ants- feld, Boris Chidlovskii, Gabriela Csurka, and J ´erˆome Re- vaud. Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion.Advances in Neural Information Processing Systems, 35:3502–3516, 2022. 3

  51. [51]

    Croco v2: Improved cross-view completion pre- training for stereo matching and optical flow

    Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy, Yohann Cabon, Vaibhav Arora, Romain Br ´egier, Gabriela Csurka, Leonid Antsfeld, Boris Chidlovskii, and J ´erˆome Revaud. Croco v2: Improved cross-view completion pre- training for stereo matching and optical flow. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 17969–17...

  52. [52]

    Segformer: Simple and efficient design for semantic segmentation with transform- ers.Advances in neural information processing systems, 34: 12077–12090, 2021

    Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transform- ers.Advances in neural information processing systems, 34: 12077–12090, 2021. 3

  53. [53]

    Aggregated residual transformations for deep neural networks

    Saining Xie, Ross Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500,

  54. [54]

    Deep image matting

    Ning Xu, Brian Price, Scott Cohen, and Thomas Huang. Deep image matting. InProc. IEEE Conf. on Computer 10 Vision and Pattern Recognition (CVPR), pages 2970–2979,

  55. [55]

    Matanyone: Stable video mat- ting with consistent memory propagation.arXiv preprint arXiv:2501.14677, 2025

    Peiqing Yang, Shangchen Zhou, Jixin Zhao, Qingyi Tao, and Chen Change Loy. Matanyone: Stable video mat- ting with consistent memory propagation.arXiv preprint arXiv:2501.14677, 2025. 2, 3, 6, 7, 8

  56. [56]

    Vitmatte: Boosting image matting with pre- trained plain vision transformers.Information Fusion, 103: 102091, 2024

    Jingfeng Yao, Xinggang Wang, Shusheng Yang, and Baoyuan Wang. Vitmatte: Boosting image matting with pre- trained plain vision transformers.Information Fusion, 103: 102091, 2024. 2, 3, 5, 7, 8

  57. [57]

    Mask guided matting via progressive refinement network

    Qihang Yu, Jianming Zhang, He Zhang, Yilin Wang, Zhe Lin, Ning Xu, Yutong Bai, and Alan Yuille. Mask guided matting via progressive refinement network. InProc. IEEE/CVF Conf. on Computer Vision and Pattern Recogni- tion (CVPR), pages 1154–1163, 2021. 2, 3, 7

  58. [58]

    Pyramid scene parsing network

    Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017. 3

  59. [59]

    A refreshed similarity-based upsampler for di- rect high-ratio feature upsampling.arXiv preprint arXiv:2407.02283, 2024

    Minghao Zhou, Hong Wang, Yefeng Zheng, and Deyu Meng. A refreshed similarity-based upsampler for di- rect high-ratio feature upsampling.arXiv preprint arXiv:2407.02283, 2024. 3 11