CineMatte: Background Matting for Virtual Production and Beyond

Chen Zhang; Fasheng Chen; Jiangbo Cao; Yuanjian He

arxiv: 2605.18328 · v1 · pith:SKZQIEGSnew · submitted 2026-05-18 · 💻 cs.CV

CineMatte: Background Matting for Virtual Production and Beyond

Yuanjian He , Chen Zhang , Fasheng Chen , Jiangbo Cao This is my paper

Pith reviewed 2026-05-20 11:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords background mattingvirtual productioncross-attentionvision transformerforeground extractionLED volume4K datasetimage-guided upsampler

0 comments

The pith

CineMatte encodes input frames and backgrounds separately with a frozen DINOv3 ViT then uses cross-attention to predict foreground mattes for virtual production.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops CineMatte to address the difficulty of removing backgrounds from LED virtual production footage so that new backgrounds can be inserted during post-production without labor-intensive manual work. Instead of feeding the input and background together through concatenation, the method runs a shared-weight frozen DINOv3 Vision Transformer on each stream independently and then applies cross-attention between the resulting features to identify the foreground. This design keeps the semantic knowledge learned during pretraining and makes the model more tolerant to changes in the background image. The authors also swap the usual convolutional detail branch for a pretrained image-guided feature upsampler to cut down on boundary artifacts that arise from semantic misalignment. They support the claims with a new real 4K HDR dataset captured on a professional LED stage using green-screen insertion and tracked camera motion, plus tests on public benchmarks that show good generalization to ordinary real-world video.

Core claim

CineMatte employs a cross-attention-conditioned design. Instead of concatenating the background with the input, CineMatte employs a Siamese, frozen DINOv3 Vision Transformer with shared weights to encode the input frame and the captured background separately. A cross-attention module compares the two streams to predict the foreground, preserving pretrained semantics and improving robustness to background shifts. Previous ViT-based matting models use a parallel convolutional detail branch to recover fine details, which can cause boundary artifacts in real-world samples due to semantic misalignment with the backbone. We instead replace it with a pretrained, image-guided feature upsampler, and

What carries the argument

A Siamese frozen DINOv3 Vision Transformer with shared weights that encodes the input frame and background separately, followed by a cross-attention module to compare the streams and predict the foreground, plus a pretrained image-guided feature upsampler for detail recovery.

If this is right

Foreground mattes stay accurate even when the final background differs from the one shown on the LED volume during capture.
Boundary artifacts decrease because the pretrained upsampler avoids semantic misalignment with the ViT backbone.
Tracked camera trajectories in the dataset let new backgrounds be rendered with correct parallax during later compositing.
Performance on VideoMatte240K and YouTubeMatte indicates the same pipeline works on ordinary real-world footage outside virtual production.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The frozen ViT approach could reduce the amount of task-specific labeled data needed when adapting matting to other controlled environments.
The separate-encoding pattern may transfer to related tasks such as video object segmentation where background variability is also high.
If inference can be made faster, the method could support live virtual production pipelines that require on-set matting.

Load-bearing premise

That encoding the input and background separately with a frozen shared-weight DINOv3 ViT and comparing them via cross-attention will reliably preserve semantics and outperform concatenation-based approaches without introducing misalignment artifacts in real VP footage.

What would settle it

A side-by-side evaluation on the CineMatte-4K test set in which a simple concatenation baseline matches or exceeds CineMatte on standard matting metrics such as SAD, MSE, or gradient error would undermine the claim that separate encoding plus cross-attention is necessary for robustness.

Figures

Figures reproduced from arXiv: 2605.18328 by Chen Zhang, Fasheng Chen, Jiangbo Cao, Yuanjian He.

**Figure 2.** Figure 2: Creation of the CineMatte-4K dataset tween the target scene and a green screen; the scene frame serves as the input, and the ground-truth alpha is obtained by manually matting the corresponding green-screen frame. This yields a non-synthetic dataset for background matting and virtual production. For videos, we record tracked camera trajectories together with green-screen foregrounds, enabling later rende… view at source ↗

**Figure 3.** Figure 3: Overview of CineMatte. A frozen Siamese DINOv3 [ [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Real-world results: our method yields crisp boundaries and the most complete human matte, while baselines misclassify the [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: JAFAR-style feature upsampler. It recovers high [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: More samples on the CineMatte-4K dataset. Faces are blurred for anonymity. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: The effect of connecting a high-resolution shortcut from [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

LED Virtual Production (VP) uses large LED volumes to render backgrounds in real time, enabling in-camera visual effects but making post-shot changes labor-intensive. We address this with CineMatte, a robust background matting framework for VP and beyond. CineMatte employs a cross-attention-conditioned design. Instead of concatenating the background with the input, CineMatte employs a Siamese, frozen DINOv3 Vision Transformer with shared weights to encode the input frame and the captured background separately. A cross-attention module compares the two streams to predict the foreground, preserving pretrained semantics and improving robustness to background shifts. Previous ViT-based matting models use a parallel convolutional "detail branch" to recover fine details, which can cause boundary artifacts in real-world samples due to semantic misalignment with the backbone. We instead replace it with a pretrained, image-guided feature upsampler, which largely mitigates the problem. We also introduce CineMatte-4K, a 4K HDR image-video dataset captured on a professional LED VP stage. To the best of our knowledge, the image subset is the first dataset for VP matting and is non-synthetic, obtained via green-screen insertion; the video subset includes camera motion with tracked trajectories so that arbitrary backgrounds can be rendered later with correct parallax. Across CineMatte-4K and public benchmarks (VideoMatte240K, YouTubeMatte), CineMatte not only excels in VP but also generalizes robustly to real-world footage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CineMatte adds a real VP matting dataset with tracked motion and tries cross-attention on frozen DINOv3 encoders, but the architecture's edge over concatenation still needs ablations to hold up.

read the letter

The main things here are a new non-synthetic 4K dataset captured on actual LED stages and an attempt to handle background shifts without direct concatenation. The dataset stands out because it includes green-screen ground truth for images and camera-tracked video sequences that support parallax when backgrounds are swapped later. That matches real virtual production needs better than most existing matting collections, which tend to be synthetic or lack VP-specific lighting and motion.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CineMatte, a background matting framework for LED Virtual Production. It uses a Siamese frozen DINOv3 ViT with shared weights to encode the input frame and captured background separately, applies cross-attention to predict the foreground while preserving pretrained semantics, and replaces the conventional detail branch with a pretrained image-guided feature upsampler to reduce boundary artifacts. The work also contributes the CineMatte-4K dataset of 4K HDR non-synthetic images (via green-screen) and videos with tracked camera motion for parallax-correct rendering. Claims include superior performance on CineMatte-4K and public benchmarks (VideoMatte240K, YouTubeMatte) with robust generalization to real-world footage.

Significance. If the results hold, the work offers a practical advance for VP post-production by enabling easier background changes. The non-synthetic VP-specific dataset fills a clear resource gap and supports future work on camera-motion-aware matting. The design choice of frozen Siamese DINOv3 plus cross-attention, together with the pretrained upsampler, provides a clean way to leverage semantic priors without introducing misalignment-prone detail branches. These elements, if validated by ablations and metrics, constitute a solid applied contribution to computer vision for media production.

major comments (2)

[Method] Method section (cross-attention design): The central claim that separate encoding with frozen shared-weight DINOv3 ViT followed by cross-attention reliably preserves semantics and improves robustness to background shifts (versus concatenation) is load-bearing. However, the manuscript provides no ablation that isolates the cross-attention module from the pretrained upsampler or the new CineMatte-4K dataset, particularly under LED-induced color or moiré shifts that could misalign the feature spaces. Without this, attribution of gains to the proposed design remains unclear.
[Experiments] Experiments section: The abstract states that CineMatte excels across CineMatte-4K and public benchmarks with robust generalization, yet the description supplies no quantitative metrics, error analysis, or specific comparisons (e.g., boundary error or foreground accuracy under controlled background shifts). This absence prevents verification that the architectural choices deliver the claimed improvements over prior ViT-based matting models.

minor comments (2)

[Abstract] The abstract would be strengthened by including one or two key quantitative results (e.g., matting accuracy or generalization gap) to support the performance claims.
[Method] Notation for the cross-attention module and the image-guided upsampler should be defined explicitly with equations or a diagram for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and positive review of our manuscript. We address each major comment below and indicate the revisions we will incorporate to improve clarity and rigor.

read point-by-point responses

Referee: [Method] Method section (cross-attention design): The central claim that separate encoding with frozen shared-weight DINOv3 ViT followed by cross-attention reliably preserves semantics and improves robustness to background shifts (versus concatenation) is load-bearing. However, the manuscript provides no ablation that isolates the cross-attention module from the pretrained upsampler or the new CineMatte-4K dataset, particularly under LED-induced color or moiré shifts that could misalign the feature spaces. Without this, attribution of gains to the proposed design remains unclear.

Authors: We appreciate the referee's emphasis on isolating the contribution of the cross-attention module. Our existing comparisons to concatenation-based baselines provide supporting evidence for the design choice, yet we agree that a targeted ablation—specifically evaluating the cross-attention component independently of the upsampler and under controlled LED-induced color and moiré shifts—would strengthen attribution. We will add this ablation study, including quantitative results on feature alignment and robustness, to the revised manuscript. revision: yes
Referee: [Experiments] Experiments section: The abstract states that CineMatte excels across CineMatte-4K and public benchmarks with robust generalization, yet the description supplies no quantitative metrics, error analysis, or specific comparisons (e.g., boundary error or foreground accuracy under controlled background shifts). This absence prevents verification that the architectural choices deliver the claimed improvements over prior ViT-based matting models.

Authors: We thank the referee for noting this presentation issue. The experiments section contains tables with quantitative metrics (SAD, MSE, gradient, and connectivity errors) and direct comparisons to prior ViT-based matting models on CineMatte-4K, VideoMatte240K, and YouTubeMatte. To address the concern directly, we will expand the section with a dedicated error analysis subsection that includes boundary-specific metrics and results under controlled background shifts, ensuring all claims are explicitly linked to the reported numbers. revision: yes

Circularity Check

0 steps flagged

No circularity detected in architectural or empirical claims

full rationale

The paper describes an architectural proposal (Siamese frozen DINOv3 encoders with cross-attention replacing concatenation, plus a pretrained image-guided upsampler replacing a convolutional detail branch) and introduces the CineMatte-4K dataset, then reports performance on that dataset plus public benchmarks. No equations, parameter fits, or derivations are shown that reduce the claimed predictions or robustness improvements to the inputs by construction. The central design choices are presented as independent innovations whose benefits are asserted via external empirical results rather than self-referential definitions or self-citation chains. The derivation chain is therefore self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach depends on the assumption that a frozen pretrained DINOv3 model retains useful semantics for foreground-background separation and that the new real-world VP dataset is representative; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption A frozen DINOv3 Vision Transformer preserves semantic features that are transferable to background matting without fine-tuning
The method freezes the model to retain pretrained semantics and relies on this for robustness.

pith-pipeline@v0.9.0 · 5806 in / 1375 out tokens · 51851 ms · 2026-05-20T11:30:15.554902+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CineMatte employs a Siamese, frozen DINOv3 Vision Transformer with shared weights to encode the input frame and the captured background separately. A cross-attention module compares the two streams to predict the foreground
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We instead replace it with a pretrained, image-guided feature upsampler

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 6 internal anchors

[1]

Transmat- ting: Enhancing transparent objects matting with transform- ers

Huanqia Cai, Fanglei Xue, Lele Xu, and Lili Guo. Transmat- ting: Enhancing transparent objects matting with transform- ers. InProc. Eur. Conf. on Computer Vision (ECCV), 2022. 3

work page 2022
[2]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2, 3

work page 2021
[3]

Rethinking Atrous Convolution for Semantic Image Segmentation

Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for seman- tic image segmentation.arXiv preprint arXiv:1706.05587,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Knn mat- ting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(9):2175–2188, 2013

Qifeng Chen, Dingzeyu Li, and Chi-Keung Tang. Knn mat- ting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(9):2175–2188, 2013. 2, 3, 7

work page 2013
[5]

Jafar: Jack up any feature at any resolution.arXiv preprint arXiv:2506.11136, 2025

Paul Couairon, Lo ¨ıck Chambon, Louis Serrano, Jean- Emmanuel Haugeard, Matthieu Cord, and Nicolas Thome. Jafar: Jack up any feature at any resolution.arXiv preprint arXiv:2506.11136, 2025. 3, 4, 5

work page arXiv 2025
[6]

Learning affinity- aware upsampling for deep image matting

Yutong Dai, Hao Lu, and Chunhua Shen. Learning affinity- aware upsampling for deep image matting. InProc. IEEE/CVF Conf. on Computer Vision and Pattern Recogni- tion (CVPR), pages 6841–6850, 2021. 3

work page 2021
[7]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 7

work page 2009
[8]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2010
[9]

Perceptually motivated benchmark for video matting

Mikhail Erofeev, Yury Gitman, Dmitriy S Vatolin, Alexey Fedorov, and Jue Wang. Perceptually motivated benchmark for video matting. InBMVC, page 2, 2015. 8

work page 2015
[10]

Featup: A model- agnostic framework for features at any resolution.arXiv preprint arXiv:2403.10516, 2024

Stephanie Fu, Mark Hamilton, Laura Brandt, Axel Feldman, Zhoutong Zhang, and William T Freeman. Featup: A model- agnostic framework for features at any resolution.arXiv preprint arXiv:2403.10516, 2024. 3

work page arXiv 2024
[11]

Eduardo S. L. Gastal and Manuel M. Oliveira. Shared sam- pling for real-time alpha matting.Computer Graphics Fo- rum, 29(2):575–584, 2010. 3

work page 2010
[12]

A global sampling method for alpha matting

Kaiming He, Christoph Rhemann, Carsten Rother, Xiaoou Tang, and Jian Sun. A global sampling method for alpha matting. InProc. IEEE Conf. on Computer Vision and Pat- tern Recognition (CVPR), pages 2049–2056, 2011. 3

work page 2049
[13]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 2, 3

work page 2016
[14]

End-to-end video matting with trimap propagation

Wei-Lun Huang and Ming-Sui Lee. End-to-end video matting with trimap propagation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14337–14347, 2023. 3

work page 2023
[15]

Maggie: Masked guided gradual human in- stance matting

Chuong Huynh, Seoung Wug Oh, Abhinav Shrivastava, and Joon-Young Lee. Maggie: Masked guided gradual human in- stance matting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3870– 3879, 2024

work page 2024
[16]

Zhanghan Ke, Jiayu Sun, Kaican Li, Qiong Yan, and Ryn- son W.H. Lau. Modnet: Real-time trimap-free portrait mat- ting via objective decomposition. InProc. AAAI Conf. on Artificial Intelligence (AAAI), 2022. 3, 6, 7

work page 2022
[17]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 3

work page 2023
[18]

Nonlocal matting

Philip Lee and Ying Wu. Nonlocal matting. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2193–2200, 2011. 3

work page 2011
[19]

A closed-form solution to natural image matting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):228–242,

Anat Levin, Dani Lischinski, and Yair Weiss. A closed-form solution to natural image matting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):228–242,

work page
[20]

Maybank, and Dacheng Tao

Jizhizi Li, Jing Zhang, Stephen J. Maybank, and Dacheng Tao. Bridging composite and real: Towards end-to-end deep image matting.arXiv preprint arXiv:2010.16188, 2020. 3, 6, 8

work page arXiv 2010
[21]

Privacy- preserving portrait matting

Jizhizi Li, Sihan Ma, Jing Zhang, and Dacheng Tao. Privacy- preserving portrait matting. InProceedings of the 29th ACM international conference on multimedia, pages 3501–3509,

work page
[22]

Deep automatic natural image matting.arXiv preprint arXiv:2107.07235,

Jizhizi Li, Jing Zhang, and Dacheng Tao. Deep automatic natural image matting.arXiv preprint arXiv:2107.07235,

work page arXiv
[23]

Vmformer: End-to-end video matting with transformer

Jiachen Li, Vidit Goel, Marianna Ohanyan, Shant Navasardyan, Yunchao Wei, and Humphrey Shi. Vmformer: End-to-end video matting with transformer. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6678–6687, 2024. 3

work page 2024
[24]

Natural image matting via guided contextual attention

Yaoyi Li and Hongtao Lu. Natural image matting via guided contextual attention. InProc. AAAI Conf. on Artificial Intel- ligence (AAAI), pages 11450–11457, 2020. 3

work page 2020
[25]

Exploring plain vision transformer backbones for object de- tection

Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object de- tection. InEuropean conference on computer vision, pages 280–296. Springer, 2022. 3

work page 2022
[26]

Refinenet: Multi-path refinement networks for high- resolution semantic segmentation

Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Refinenet: Multi-path refinement networks for high- resolution semantic segmentation. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1925–1934, 2017. 3

work page 1925
[27]

Seitz, and Ira Kemelmacher- Shlizerman

Shanchuan Lin, Andrey Ryabtsev, Soumyadip Sengupta, Brian Curless, Steven M. Seitz, and Ira Kemelmacher- Shlizerman. Real-time high-resolution background matting. arXiv preprint arXiv:2012.07810, 2020. 2, 3, 6, 7, 8

work page arXiv 2012
[28]

Robust high-resolution video matting with tempo- ral guidance

Shanchuan Lin, Linjie Yang, Imran Saleemi, and Soumyadip Sengupta. Robust high-resolution video matting with tempo- ral guidance. InProc. IEEE/CVF Winter Conf. on Applica- tions of Computer Vision (WACV), pages 238–247, 2022. 3, 7 9

work page 2022
[29]

In- dices matter: Learning to index for deep image matting

Hao Lu, Yutong Dai, Chunhua Shen, and Songcen Xu. In- dices matter: Learning to index for deep image matting. InProc. IEEE/CVF Int. Conf. on Computer Vision (ICCV), pages 3266–3275, 2019. 3

work page 2019
[30]

& Yadav, S

Achleshwar Luthra, Harsh Sulakhe, Tanish Mittal, Abhishek Iyer, and Santosh Yadav. Eformer: Edge enhancement based transformer for medical image denoising.arXiv preprint arXiv:2109.08044, 2021. 3

work page arXiv 2021
[31]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e J´egou, Julien Mairal, P...

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Matteformer: Transformer-based image mat- ting via prior-tokens

GyuTae Park, SungJoon Son, JaeYoung Yoo, SeHo Kim, and Nojun Kwak. Matteformer: Transformer-based image mat- ting via prior-tokens. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 11696–11706, 2022. 2, 3

work page 2022
[33]

How do vision transformers work?arXiv preprint arXiv:2202.06709, 2022

Namuk Park and Songkuk Kim. How do vision transformers work?arXiv preprint arXiv:2202.06709, 2022. 2, 5

work page arXiv 2022
[34]

A survey on vir- tual production and the future of compositing technologies

Filipe Pires, Rui Silva, and Rui Raposo. A survey on vir- tual production and the future of compositing technologies. Avanca Cinema Journal, 21(692-9), 2022. 2

work page 2022
[35]

Attention-guided hi- erarchical structure aggregation for image matting

Yu Qiao, Yuhao Liu, Xin Yang, Dongsheng Zhou, Mingliang Xu, Qiang Zhang, and Xiaopeng Wei. Attention-guided hi- erarchical structure aggregation for image matting. InProc. IEEE/CVF Conf. on Computer Vision and Pattern Recogni- tion (CVPR), pages 13676–13685, 2020. 6

work page 2020
[36]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 3, 5

work page 2021
[37]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

High resolution matting via interactive trimap segmentation

Christoph Rhemann, Carsten Rother, Alex Rav-Acha, and Toby Sharp. High resolution matting via interactive trimap segmentation. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008. 3

work page 2008
[39]

A perceptu- ally motivated online benchmark for image matting

Christoph Rhemann, Carsten Rother, Jue Wang, Margrit Gelautz, Pushmeet Kohli, and Pamela Rott. A perceptu- ally motivated online benchmark for image matting. In2009 IEEE conference on computer vision and pattern recogni- tion, pages 1826–1833. IEEE, 2009. 8

work page 2009
[40]

A spatially varying psf-based prior for al- pha matting

Christoph Rhemann, Carsten Rother, Pushmeet Kohli, and Margrit Gelautz. A spatially varying psf-based prior for al- pha matting. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2149–2156, 2010. 3

work page 2010
[41]

Ruzon and Carlo Tomasi

Mark A. Ruzon and Carlo Tomasi. Alpha estimation in nat- ural images. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 18–25, 2000. 3

work page 2000
[42]

Seitz, and Ira Kemelmacher-Shlizerman

Soumyadip Sengupta, Vivek Jayaram, Brian Curless, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. Back- ground matting: The world is your green screen. InProc. IEEE/CVF Conf. on Computer Vision and Pattern Recogni- tion (CVPR), pages 2288–2297, 2020. 2, 3

work page 2020
[43]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2, 3, 4, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

High-Resolution Representations for Labeling Pixels and Regions

Ke Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, and Jingdong Wang. High-resolution representations for labeling pixels and regions.arXiv preprint arXiv:1904.04514, 2019. 3

work page internal anchor Pith review Pith/arXiv arXiv 1904
[45]

Lift: A surprisingly simple lightweight feature transform for dense vit descriptors

Saksham Suri, Matthew Walmer, Kamal Gupta, and Abhinav Shrivastava. Lift: A surprisingly simple lightweight feature transform for dense vit descriptors. InEuropean Conference on Computer Vision, pages 110–128. Springer, 2024. 3

work page 2024
[46]

The emergence of virtual production–a research agenda.Convergence, 30(5):1557– 1574, 2024

Jon Swords and Nina Willment. The emergence of virtual production–a research agenda.Convergence, 30(5):1557– 1574, 2024. 2

work page 2024
[47]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2

work page 2017
[48]

Jue Wang and Michael F. Cohen. An iterative optimization approach for unified image segmentation and matting. In Proc. IEEE Int. Conf. on Computer Vision (ICCV), pages 936–943, 2005. 3

work page 2005
[49]

Jue Wang and Michael F. Cohen. Optimized color sampling for robust matting. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2007. 3

work page 2007
[50]

Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion.Advances in Neural Information Processing Systems, 35:3502–3516, 2022

Philippe Weinzaepfel, Vincent Leroy, Thomas Lucas, Ro- main Br´egier, Yohann Cabon, Vaibhav Arora, Leonid Ants- feld, Boris Chidlovskii, Gabriela Csurka, and J ´erˆome Re- vaud. Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion.Advances in Neural Information Processing Systems, 35:3502–3516, 2022. 3

work page 2022
[51]

Croco v2: Improved cross-view completion pre- training for stereo matching and optical flow

Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy, Yohann Cabon, Vaibhav Arora, Romain Br ´egier, Gabriela Csurka, Leonid Antsfeld, Boris Chidlovskii, and J ´erˆome Revaud. Croco v2: Improved cross-view completion pre- training for stereo matching and optical flow. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 17969–17...

work page 2023
[52]

Segformer: Simple and efficient design for semantic segmentation with transform- ers.Advances in neural information processing systems, 34: 12077–12090, 2021

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transform- ers.Advances in neural information processing systems, 34: 12077–12090, 2021. 3

work page 2021
[53]

Aggregated residual transformations for deep neural networks

Saining Xie, Ross Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500,

work page
[54]

Deep image matting

Ning Xu, Brian Price, Scott Cohen, and Thomas Huang. Deep image matting. InProc. IEEE Conf. on Computer 10 Vision and Pattern Recognition (CVPR), pages 2970–2979,

work page
[55]

Matanyone: Stable video mat- ting with consistent memory propagation.arXiv preprint arXiv:2501.14677, 2025

Peiqing Yang, Shangchen Zhou, Jixin Zhao, Qingyi Tao, and Chen Change Loy. Matanyone: Stable video mat- ting with consistent memory propagation.arXiv preprint arXiv:2501.14677, 2025. 2, 3, 6, 7, 8

work page arXiv 2025
[56]

Vitmatte: Boosting image matting with pre- trained plain vision transformers.Information Fusion, 103: 102091, 2024

Jingfeng Yao, Xinggang Wang, Shusheng Yang, and Baoyuan Wang. Vitmatte: Boosting image matting with pre- trained plain vision transformers.Information Fusion, 103: 102091, 2024. 2, 3, 5, 7, 8

work page 2024
[57]

Mask guided matting via progressive refinement network

Qihang Yu, Jianming Zhang, He Zhang, Yilin Wang, Zhe Lin, Ning Xu, Yutong Bai, and Alan Yuille. Mask guided matting via progressive refinement network. InProc. IEEE/CVF Conf. on Computer Vision and Pattern Recogni- tion (CVPR), pages 1154–1163, 2021. 2, 3, 7

work page 2021
[58]

Pyramid scene parsing network

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017. 3

work page 2017
[59]

A refreshed similarity-based upsampler for di- rect high-ratio feature upsampling.arXiv preprint arXiv:2407.02283, 2024

Minghao Zhou, Hong Wang, Yefeng Zheng, and Deyu Meng. A refreshed similarity-based upsampler for di- rect high-ratio feature upsampling.arXiv preprint arXiv:2407.02283, 2024. 3 11

work page arXiv 2024

[1] [1]

Transmat- ting: Enhancing transparent objects matting with transform- ers

Huanqia Cai, Fanglei Xue, Lele Xu, and Lili Guo. Transmat- ting: Enhancing transparent objects matting with transform- ers. InProc. Eur. Conf. on Computer Vision (ECCV), 2022. 3

work page 2022

[2] [2]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2, 3

work page 2021

[3] [3]

Rethinking Atrous Convolution for Semantic Image Segmentation

Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for seman- tic image segmentation.arXiv preprint arXiv:1706.05587,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Knn mat- ting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(9):2175–2188, 2013

Qifeng Chen, Dingzeyu Li, and Chi-Keung Tang. Knn mat- ting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(9):2175–2188, 2013. 2, 3, 7

work page 2013

[5] [5]

Jafar: Jack up any feature at any resolution.arXiv preprint arXiv:2506.11136, 2025

Paul Couairon, Lo ¨ıck Chambon, Louis Serrano, Jean- Emmanuel Haugeard, Matthieu Cord, and Nicolas Thome. Jafar: Jack up any feature at any resolution.arXiv preprint arXiv:2506.11136, 2025. 3, 4, 5

work page arXiv 2025

[6] [6]

Learning affinity- aware upsampling for deep image matting

Yutong Dai, Hao Lu, and Chunhua Shen. Learning affinity- aware upsampling for deep image matting. InProc. IEEE/CVF Conf. on Computer Vision and Pattern Recogni- tion (CVPR), pages 6841–6850, 2021. 3

work page 2021

[7] [7]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 7

work page 2009

[8] [8]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2010

[9] [9]

Perceptually motivated benchmark for video matting

Mikhail Erofeev, Yury Gitman, Dmitriy S Vatolin, Alexey Fedorov, and Jue Wang. Perceptually motivated benchmark for video matting. InBMVC, page 2, 2015. 8

work page 2015

[10] [10]

Featup: A model- agnostic framework for features at any resolution.arXiv preprint arXiv:2403.10516, 2024

Stephanie Fu, Mark Hamilton, Laura Brandt, Axel Feldman, Zhoutong Zhang, and William T Freeman. Featup: A model- agnostic framework for features at any resolution.arXiv preprint arXiv:2403.10516, 2024. 3

work page arXiv 2024

[11] [11]

Eduardo S. L. Gastal and Manuel M. Oliveira. Shared sam- pling for real-time alpha matting.Computer Graphics Fo- rum, 29(2):575–584, 2010. 3

work page 2010

[12] [12]

A global sampling method for alpha matting

Kaiming He, Christoph Rhemann, Carsten Rother, Xiaoou Tang, and Jian Sun. A global sampling method for alpha matting. InProc. IEEE Conf. on Computer Vision and Pat- tern Recognition (CVPR), pages 2049–2056, 2011. 3

work page 2049

[13] [13]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 2, 3

work page 2016

[14] [14]

End-to-end video matting with trimap propagation

Wei-Lun Huang and Ming-Sui Lee. End-to-end video matting with trimap propagation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14337–14347, 2023. 3

work page 2023

[15] [15]

Maggie: Masked guided gradual human in- stance matting

Chuong Huynh, Seoung Wug Oh, Abhinav Shrivastava, and Joon-Young Lee. Maggie: Masked guided gradual human in- stance matting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3870– 3879, 2024

work page 2024

[16] [16]

Zhanghan Ke, Jiayu Sun, Kaican Li, Qiong Yan, and Ryn- son W.H. Lau. Modnet: Real-time trimap-free portrait mat- ting via objective decomposition. InProc. AAAI Conf. on Artificial Intelligence (AAAI), 2022. 3, 6, 7

work page 2022

[17] [17]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 3

work page 2023

[18] [18]

Nonlocal matting

Philip Lee and Ying Wu. Nonlocal matting. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2193–2200, 2011. 3

work page 2011

[19] [19]

A closed-form solution to natural image matting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):228–242,

Anat Levin, Dani Lischinski, and Yair Weiss. A closed-form solution to natural image matting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):228–242,

work page

[20] [20]

Maybank, and Dacheng Tao

Jizhizi Li, Jing Zhang, Stephen J. Maybank, and Dacheng Tao. Bridging composite and real: Towards end-to-end deep image matting.arXiv preprint arXiv:2010.16188, 2020. 3, 6, 8

work page arXiv 2010

[21] [21]

Privacy- preserving portrait matting

Jizhizi Li, Sihan Ma, Jing Zhang, and Dacheng Tao. Privacy- preserving portrait matting. InProceedings of the 29th ACM international conference on multimedia, pages 3501–3509,

work page

[22] [22]

Deep automatic natural image matting.arXiv preprint arXiv:2107.07235,

Jizhizi Li, Jing Zhang, and Dacheng Tao. Deep automatic natural image matting.arXiv preprint arXiv:2107.07235,

work page arXiv

[23] [23]

Vmformer: End-to-end video matting with transformer

Jiachen Li, Vidit Goel, Marianna Ohanyan, Shant Navasardyan, Yunchao Wei, and Humphrey Shi. Vmformer: End-to-end video matting with transformer. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6678–6687, 2024. 3

work page 2024

[24] [24]

Natural image matting via guided contextual attention

Yaoyi Li and Hongtao Lu. Natural image matting via guided contextual attention. InProc. AAAI Conf. on Artificial Intel- ligence (AAAI), pages 11450–11457, 2020. 3

work page 2020

[25] [25]

Exploring plain vision transformer backbones for object de- tection

Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object de- tection. InEuropean conference on computer vision, pages 280–296. Springer, 2022. 3

work page 2022

[26] [26]

Refinenet: Multi-path refinement networks for high- resolution semantic segmentation

Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Refinenet: Multi-path refinement networks for high- resolution semantic segmentation. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1925–1934, 2017. 3

work page 1925

[27] [27]

Seitz, and Ira Kemelmacher- Shlizerman

Shanchuan Lin, Andrey Ryabtsev, Soumyadip Sengupta, Brian Curless, Steven M. Seitz, and Ira Kemelmacher- Shlizerman. Real-time high-resolution background matting. arXiv preprint arXiv:2012.07810, 2020. 2, 3, 6, 7, 8

work page arXiv 2012

[28] [28]

Robust high-resolution video matting with tempo- ral guidance

Shanchuan Lin, Linjie Yang, Imran Saleemi, and Soumyadip Sengupta. Robust high-resolution video matting with tempo- ral guidance. InProc. IEEE/CVF Winter Conf. on Applica- tions of Computer Vision (WACV), pages 238–247, 2022. 3, 7 9

work page 2022

[29] [29]

In- dices matter: Learning to index for deep image matting

Hao Lu, Yutong Dai, Chunhua Shen, and Songcen Xu. In- dices matter: Learning to index for deep image matting. InProc. IEEE/CVF Int. Conf. on Computer Vision (ICCV), pages 3266–3275, 2019. 3

work page 2019

[30] [30]

& Yadav, S

Achleshwar Luthra, Harsh Sulakhe, Tanish Mittal, Abhishek Iyer, and Santosh Yadav. Eformer: Edge enhancement based transformer for medical image denoising.arXiv preprint arXiv:2109.08044, 2021. 3

work page arXiv 2021

[31] [31]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e J´egou, Julien Mairal, P...

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Matteformer: Transformer-based image mat- ting via prior-tokens

GyuTae Park, SungJoon Son, JaeYoung Yoo, SeHo Kim, and Nojun Kwak. Matteformer: Transformer-based image mat- ting via prior-tokens. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 11696–11706, 2022. 2, 3

work page 2022

[33] [33]

How do vision transformers work?arXiv preprint arXiv:2202.06709, 2022

Namuk Park and Songkuk Kim. How do vision transformers work?arXiv preprint arXiv:2202.06709, 2022. 2, 5

work page arXiv 2022

[34] [34]

A survey on vir- tual production and the future of compositing technologies

Filipe Pires, Rui Silva, and Rui Raposo. A survey on vir- tual production and the future of compositing technologies. Avanca Cinema Journal, 21(692-9), 2022. 2

work page 2022

[35] [35]

Attention-guided hi- erarchical structure aggregation for image matting

Yu Qiao, Yuhao Liu, Xin Yang, Dongsheng Zhou, Mingliang Xu, Qiang Zhang, and Xiaopeng Wei. Attention-guided hi- erarchical structure aggregation for image matting. InProc. IEEE/CVF Conf. on Computer Vision and Pattern Recogni- tion (CVPR), pages 13676–13685, 2020. 6

work page 2020

[36] [36]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 3, 5

work page 2021

[37] [37]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

High resolution matting via interactive trimap segmentation

Christoph Rhemann, Carsten Rother, Alex Rav-Acha, and Toby Sharp. High resolution matting via interactive trimap segmentation. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008. 3

work page 2008

[39] [39]

A perceptu- ally motivated online benchmark for image matting

Christoph Rhemann, Carsten Rother, Jue Wang, Margrit Gelautz, Pushmeet Kohli, and Pamela Rott. A perceptu- ally motivated online benchmark for image matting. In2009 IEEE conference on computer vision and pattern recogni- tion, pages 1826–1833. IEEE, 2009. 8

work page 2009

[40] [40]

A spatially varying psf-based prior for al- pha matting

Christoph Rhemann, Carsten Rother, Pushmeet Kohli, and Margrit Gelautz. A spatially varying psf-based prior for al- pha matting. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2149–2156, 2010. 3

work page 2010

[41] [41]

Ruzon and Carlo Tomasi

Mark A. Ruzon and Carlo Tomasi. Alpha estimation in nat- ural images. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 18–25, 2000. 3

work page 2000

[42] [42]

Seitz, and Ira Kemelmacher-Shlizerman

Soumyadip Sengupta, Vivek Jayaram, Brian Curless, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. Back- ground matting: The world is your green screen. InProc. IEEE/CVF Conf. on Computer Vision and Pattern Recogni- tion (CVPR), pages 2288–2297, 2020. 2, 3

work page 2020

[43] [43]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2, 3, 4, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

High-Resolution Representations for Labeling Pixels and Regions

Ke Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, and Jingdong Wang. High-resolution representations for labeling pixels and regions.arXiv preprint arXiv:1904.04514, 2019. 3

work page internal anchor Pith review Pith/arXiv arXiv 1904

[45] [45]

Lift: A surprisingly simple lightweight feature transform for dense vit descriptors

Saksham Suri, Matthew Walmer, Kamal Gupta, and Abhinav Shrivastava. Lift: A surprisingly simple lightweight feature transform for dense vit descriptors. InEuropean Conference on Computer Vision, pages 110–128. Springer, 2024. 3

work page 2024

[46] [46]

The emergence of virtual production–a research agenda.Convergence, 30(5):1557– 1574, 2024

Jon Swords and Nina Willment. The emergence of virtual production–a research agenda.Convergence, 30(5):1557– 1574, 2024. 2

work page 2024

[47] [47]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2

work page 2017

[48] [48]

Jue Wang and Michael F. Cohen. An iterative optimization approach for unified image segmentation and matting. In Proc. IEEE Int. Conf. on Computer Vision (ICCV), pages 936–943, 2005. 3

work page 2005

[49] [49]

Jue Wang and Michael F. Cohen. Optimized color sampling for robust matting. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2007. 3

work page 2007

[50] [50]

Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion.Advances in Neural Information Processing Systems, 35:3502–3516, 2022

Philippe Weinzaepfel, Vincent Leroy, Thomas Lucas, Ro- main Br´egier, Yohann Cabon, Vaibhav Arora, Leonid Ants- feld, Boris Chidlovskii, Gabriela Csurka, and J ´erˆome Re- vaud. Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion.Advances in Neural Information Processing Systems, 35:3502–3516, 2022. 3

work page 2022

[51] [51]

Croco v2: Improved cross-view completion pre- training for stereo matching and optical flow

Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy, Yohann Cabon, Vaibhav Arora, Romain Br ´egier, Gabriela Csurka, Leonid Antsfeld, Boris Chidlovskii, and J ´erˆome Revaud. Croco v2: Improved cross-view completion pre- training for stereo matching and optical flow. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 17969–17...

work page 2023

[52] [52]

Segformer: Simple and efficient design for semantic segmentation with transform- ers.Advances in neural information processing systems, 34: 12077–12090, 2021

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transform- ers.Advances in neural information processing systems, 34: 12077–12090, 2021. 3

work page 2021

[53] [53]

Aggregated residual transformations for deep neural networks

Saining Xie, Ross Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500,

work page

[54] [54]

Deep image matting

Ning Xu, Brian Price, Scott Cohen, and Thomas Huang. Deep image matting. InProc. IEEE Conf. on Computer 10 Vision and Pattern Recognition (CVPR), pages 2970–2979,

work page

[55] [55]

Matanyone: Stable video mat- ting with consistent memory propagation.arXiv preprint arXiv:2501.14677, 2025

Peiqing Yang, Shangchen Zhou, Jixin Zhao, Qingyi Tao, and Chen Change Loy. Matanyone: Stable video mat- ting with consistent memory propagation.arXiv preprint arXiv:2501.14677, 2025. 2, 3, 6, 7, 8

work page arXiv 2025

[56] [56]

Vitmatte: Boosting image matting with pre- trained plain vision transformers.Information Fusion, 103: 102091, 2024

Jingfeng Yao, Xinggang Wang, Shusheng Yang, and Baoyuan Wang. Vitmatte: Boosting image matting with pre- trained plain vision transformers.Information Fusion, 103: 102091, 2024. 2, 3, 5, 7, 8

work page 2024

[57] [57]

Mask guided matting via progressive refinement network

Qihang Yu, Jianming Zhang, He Zhang, Yilin Wang, Zhe Lin, Ning Xu, Yutong Bai, and Alan Yuille. Mask guided matting via progressive refinement network. InProc. IEEE/CVF Conf. on Computer Vision and Pattern Recogni- tion (CVPR), pages 1154–1163, 2021. 2, 3, 7

work page 2021

[58] [58]

Pyramid scene parsing network

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017. 3

work page 2017

[59] [59]

A refreshed similarity-based upsampler for di- rect high-ratio feature upsampling.arXiv preprint arXiv:2407.02283, 2024

Minghao Zhou, Hong Wang, Yefeng Zheng, and Deyu Meng. A refreshed similarity-based upsampler for di- rect high-ratio feature upsampling.arXiv preprint arXiv:2407.02283, 2024. 3 11

work page arXiv 2024