CineMatte: Background Matting for Virtual Production and Beyond
Pith reviewed 2026-05-20 11:30 UTC · model grok-4.3
The pith
CineMatte encodes input frames and backgrounds separately with a frozen DINOv3 ViT then uses cross-attention to predict foreground mattes for virtual production.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CineMatte employs a cross-attention-conditioned design. Instead of concatenating the background with the input, CineMatte employs a Siamese, frozen DINOv3 Vision Transformer with shared weights to encode the input frame and the captured background separately. A cross-attention module compares the two streams to predict the foreground, preserving pretrained semantics and improving robustness to background shifts. Previous ViT-based matting models use a parallel convolutional detail branch to recover fine details, which can cause boundary artifacts in real-world samples due to semantic misalignment with the backbone. We instead replace it with a pretrained, image-guided feature upsampler, and
What carries the argument
A Siamese frozen DINOv3 Vision Transformer with shared weights that encodes the input frame and background separately, followed by a cross-attention module to compare the streams and predict the foreground, plus a pretrained image-guided feature upsampler for detail recovery.
If this is right
- Foreground mattes stay accurate even when the final background differs from the one shown on the LED volume during capture.
- Boundary artifacts decrease because the pretrained upsampler avoids semantic misalignment with the ViT backbone.
- Tracked camera trajectories in the dataset let new backgrounds be rendered with correct parallax during later compositing.
- Performance on VideoMatte240K and YouTubeMatte indicates the same pipeline works on ordinary real-world footage outside virtual production.
Where Pith is reading between the lines
- The frozen ViT approach could reduce the amount of task-specific labeled data needed when adapting matting to other controlled environments.
- The separate-encoding pattern may transfer to related tasks such as video object segmentation where background variability is also high.
- If inference can be made faster, the method could support live virtual production pipelines that require on-set matting.
Load-bearing premise
That encoding the input and background separately with a frozen shared-weight DINOv3 ViT and comparing them via cross-attention will reliably preserve semantics and outperform concatenation-based approaches without introducing misalignment artifacts in real VP footage.
What would settle it
A side-by-side evaluation on the CineMatte-4K test set in which a simple concatenation baseline matches or exceeds CineMatte on standard matting metrics such as SAD, MSE, or gradient error would undermine the claim that separate encoding plus cross-attention is necessary for robustness.
Figures
read the original abstract
LED Virtual Production (VP) uses large LED volumes to render backgrounds in real time, enabling in-camera visual effects but making post-shot changes labor-intensive. We address this with CineMatte, a robust background matting framework for VP and beyond. CineMatte employs a cross-attention-conditioned design. Instead of concatenating the background with the input, CineMatte employs a Siamese, frozen DINOv3 Vision Transformer with shared weights to encode the input frame and the captured background separately. A cross-attention module compares the two streams to predict the foreground, preserving pretrained semantics and improving robustness to background shifts. Previous ViT-based matting models use a parallel convolutional "detail branch" to recover fine details, which can cause boundary artifacts in real-world samples due to semantic misalignment with the backbone. We instead replace it with a pretrained, image-guided feature upsampler, which largely mitigates the problem. We also introduce CineMatte-4K, a 4K HDR image-video dataset captured on a professional LED VP stage. To the best of our knowledge, the image subset is the first dataset for VP matting and is non-synthetic, obtained via green-screen insertion; the video subset includes camera motion with tracked trajectories so that arbitrary backgrounds can be rendered later with correct parallax. Across CineMatte-4K and public benchmarks (VideoMatte240K, YouTubeMatte), CineMatte not only excels in VP but also generalizes robustly to real-world footage.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CineMatte, a background matting framework for LED Virtual Production. It uses a Siamese frozen DINOv3 ViT with shared weights to encode the input frame and captured background separately, applies cross-attention to predict the foreground while preserving pretrained semantics, and replaces the conventional detail branch with a pretrained image-guided feature upsampler to reduce boundary artifacts. The work also contributes the CineMatte-4K dataset of 4K HDR non-synthetic images (via green-screen) and videos with tracked camera motion for parallax-correct rendering. Claims include superior performance on CineMatte-4K and public benchmarks (VideoMatte240K, YouTubeMatte) with robust generalization to real-world footage.
Significance. If the results hold, the work offers a practical advance for VP post-production by enabling easier background changes. The non-synthetic VP-specific dataset fills a clear resource gap and supports future work on camera-motion-aware matting. The design choice of frozen Siamese DINOv3 plus cross-attention, together with the pretrained upsampler, provides a clean way to leverage semantic priors without introducing misalignment-prone detail branches. These elements, if validated by ablations and metrics, constitute a solid applied contribution to computer vision for media production.
major comments (2)
- [Method] Method section (cross-attention design): The central claim that separate encoding with frozen shared-weight DINOv3 ViT followed by cross-attention reliably preserves semantics and improves robustness to background shifts (versus concatenation) is load-bearing. However, the manuscript provides no ablation that isolates the cross-attention module from the pretrained upsampler or the new CineMatte-4K dataset, particularly under LED-induced color or moiré shifts that could misalign the feature spaces. Without this, attribution of gains to the proposed design remains unclear.
- [Experiments] Experiments section: The abstract states that CineMatte excels across CineMatte-4K and public benchmarks with robust generalization, yet the description supplies no quantitative metrics, error analysis, or specific comparisons (e.g., boundary error or foreground accuracy under controlled background shifts). This absence prevents verification that the architectural choices deliver the claimed improvements over prior ViT-based matting models.
minor comments (2)
- [Abstract] The abstract would be strengthened by including one or two key quantitative results (e.g., matting accuracy or generalization gap) to support the performance claims.
- [Method] Notation for the cross-attention module and the image-guided upsampler should be defined explicitly with equations or a diagram for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and positive review of our manuscript. We address each major comment below and indicate the revisions we will incorporate to improve clarity and rigor.
read point-by-point responses
-
Referee: [Method] Method section (cross-attention design): The central claim that separate encoding with frozen shared-weight DINOv3 ViT followed by cross-attention reliably preserves semantics and improves robustness to background shifts (versus concatenation) is load-bearing. However, the manuscript provides no ablation that isolates the cross-attention module from the pretrained upsampler or the new CineMatte-4K dataset, particularly under LED-induced color or moiré shifts that could misalign the feature spaces. Without this, attribution of gains to the proposed design remains unclear.
Authors: We appreciate the referee's emphasis on isolating the contribution of the cross-attention module. Our existing comparisons to concatenation-based baselines provide supporting evidence for the design choice, yet we agree that a targeted ablation—specifically evaluating the cross-attention component independently of the upsampler and under controlled LED-induced color and moiré shifts—would strengthen attribution. We will add this ablation study, including quantitative results on feature alignment and robustness, to the revised manuscript. revision: yes
-
Referee: [Experiments] Experiments section: The abstract states that CineMatte excels across CineMatte-4K and public benchmarks with robust generalization, yet the description supplies no quantitative metrics, error analysis, or specific comparisons (e.g., boundary error or foreground accuracy under controlled background shifts). This absence prevents verification that the architectural choices deliver the claimed improvements over prior ViT-based matting models.
Authors: We thank the referee for noting this presentation issue. The experiments section contains tables with quantitative metrics (SAD, MSE, gradient, and connectivity errors) and direct comparisons to prior ViT-based matting models on CineMatte-4K, VideoMatte240K, and YouTubeMatte. To address the concern directly, we will expand the section with a dedicated error analysis subsection that includes boundary-specific metrics and results under controlled background shifts, ensuring all claims are explicitly linked to the reported numbers. revision: yes
Circularity Check
No circularity detected in architectural or empirical claims
full rationale
The paper describes an architectural proposal (Siamese frozen DINOv3 encoders with cross-attention replacing concatenation, plus a pretrained image-guided upsampler replacing a convolutional detail branch) and introduces the CineMatte-4K dataset, then reports performance on that dataset plus public benchmarks. No equations, parameter fits, or derivations are shown that reduce the claimed predictions or robustness improvements to the inputs by construction. The central design choices are presented as independent innovations whose benefits are asserted via external empirical results rather than self-referential definitions or self-citation chains. The derivation chain is therefore self-contained against the reported benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A frozen DINOv3 Vision Transformer preserves semantic features that are transferable to background matting without fine-tuning
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CineMatte employs a Siamese, frozen DINOv3 Vision Transformer with shared weights to encode the input frame and the captured background separately. A cross-attention module compares the two streams to predict the foreground
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We instead replace it with a pretrained, image-guided feature upsampler
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Transmat- ting: Enhancing transparent objects matting with transform- ers
Huanqia Cai, Fanglei Xue, Lele Xu, and Lili Guo. Transmat- ting: Enhancing transparent objects matting with transform- ers. InProc. Eur. Conf. on Computer Vision (ECCV), 2022. 3
work page 2022
-
[2]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2, 3
work page 2021
-
[3]
Rethinking Atrous Convolution for Semantic Image Segmentation
Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for seman- tic image segmentation.arXiv preprint arXiv:1706.05587,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Knn mat- ting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(9):2175–2188, 2013
Qifeng Chen, Dingzeyu Li, and Chi-Keung Tang. Knn mat- ting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(9):2175–2188, 2013. 2, 3, 7
work page 2013
-
[5]
Jafar: Jack up any feature at any resolution.arXiv preprint arXiv:2506.11136, 2025
Paul Couairon, Lo ¨ıck Chambon, Louis Serrano, Jean- Emmanuel Haugeard, Matthieu Cord, and Nicolas Thome. Jafar: Jack up any feature at any resolution.arXiv preprint arXiv:2506.11136, 2025. 3, 4, 5
-
[6]
Learning affinity- aware upsampling for deep image matting
Yutong Dai, Hao Lu, and Chunhua Shen. Learning affinity- aware upsampling for deep image matting. InProc. IEEE/CVF Conf. on Computer Vision and Pattern Recogni- tion (CVPR), pages 6841–6850, 2021. 3
work page 2021
-
[7]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 7
work page 2009
-
[8]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2, 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[9]
Perceptually motivated benchmark for video matting
Mikhail Erofeev, Yury Gitman, Dmitriy S Vatolin, Alexey Fedorov, and Jue Wang. Perceptually motivated benchmark for video matting. InBMVC, page 2, 2015. 8
work page 2015
-
[10]
Stephanie Fu, Mark Hamilton, Laura Brandt, Axel Feldman, Zhoutong Zhang, and William T Freeman. Featup: A model- agnostic framework for features at any resolution.arXiv preprint arXiv:2403.10516, 2024. 3
-
[11]
Eduardo S. L. Gastal and Manuel M. Oliveira. Shared sam- pling for real-time alpha matting.Computer Graphics Fo- rum, 29(2):575–584, 2010. 3
work page 2010
-
[12]
A global sampling method for alpha matting
Kaiming He, Christoph Rhemann, Carsten Rother, Xiaoou Tang, and Jian Sun. A global sampling method for alpha matting. InProc. IEEE Conf. on Computer Vision and Pat- tern Recognition (CVPR), pages 2049–2056, 2011. 3
work page 2049
-
[13]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 2, 3
work page 2016
-
[14]
End-to-end video matting with trimap propagation
Wei-Lun Huang and Ming-Sui Lee. End-to-end video matting with trimap propagation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14337–14347, 2023. 3
work page 2023
-
[15]
Maggie: Masked guided gradual human in- stance matting
Chuong Huynh, Seoung Wug Oh, Abhinav Shrivastava, and Joon-Young Lee. Maggie: Masked guided gradual human in- stance matting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3870– 3879, 2024
work page 2024
-
[16]
Zhanghan Ke, Jiayu Sun, Kaican Li, Qiong Yan, and Ryn- son W.H. Lau. Modnet: Real-time trimap-free portrait mat- ting via objective decomposition. InProc. AAAI Conf. on Artificial Intelligence (AAAI), 2022. 3, 6, 7
work page 2022
-
[17]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 3
work page 2023
-
[18]
Philip Lee and Ying Wu. Nonlocal matting. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2193–2200, 2011. 3
work page 2011
-
[19]
Anat Levin, Dani Lischinski, and Yair Weiss. A closed-form solution to natural image matting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):228–242,
-
[20]
Jizhizi Li, Jing Zhang, Stephen J. Maybank, and Dacheng Tao. Bridging composite and real: Towards end-to-end deep image matting.arXiv preprint arXiv:2010.16188, 2020. 3, 6, 8
-
[21]
Privacy- preserving portrait matting
Jizhizi Li, Sihan Ma, Jing Zhang, and Dacheng Tao. Privacy- preserving portrait matting. InProceedings of the 29th ACM international conference on multimedia, pages 3501–3509,
-
[22]
Deep automatic natural image matting.arXiv preprint arXiv:2107.07235,
Jizhizi Li, Jing Zhang, and Dacheng Tao. Deep automatic natural image matting.arXiv preprint arXiv:2107.07235,
-
[23]
Vmformer: End-to-end video matting with transformer
Jiachen Li, Vidit Goel, Marianna Ohanyan, Shant Navasardyan, Yunchao Wei, and Humphrey Shi. Vmformer: End-to-end video matting with transformer. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6678–6687, 2024. 3
work page 2024
-
[24]
Natural image matting via guided contextual attention
Yaoyi Li and Hongtao Lu. Natural image matting via guided contextual attention. InProc. AAAI Conf. on Artificial Intel- ligence (AAAI), pages 11450–11457, 2020. 3
work page 2020
-
[25]
Exploring plain vision transformer backbones for object de- tection
Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object de- tection. InEuropean conference on computer vision, pages 280–296. Springer, 2022. 3
work page 2022
-
[26]
Refinenet: Multi-path refinement networks for high- resolution semantic segmentation
Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Refinenet: Multi-path refinement networks for high- resolution semantic segmentation. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1925–1934, 2017. 3
work page 1925
-
[27]
Seitz, and Ira Kemelmacher- Shlizerman
Shanchuan Lin, Andrey Ryabtsev, Soumyadip Sengupta, Brian Curless, Steven M. Seitz, and Ira Kemelmacher- Shlizerman. Real-time high-resolution background matting. arXiv preprint arXiv:2012.07810, 2020. 2, 3, 6, 7, 8
-
[28]
Robust high-resolution video matting with tempo- ral guidance
Shanchuan Lin, Linjie Yang, Imran Saleemi, and Soumyadip Sengupta. Robust high-resolution video matting with tempo- ral guidance. InProc. IEEE/CVF Winter Conf. on Applica- tions of Computer Vision (WACV), pages 238–247, 2022. 3, 7 9
work page 2022
-
[29]
In- dices matter: Learning to index for deep image matting
Hao Lu, Yutong Dai, Chunhua Shen, and Songcen Xu. In- dices matter: Learning to index for deep image matting. InProc. IEEE/CVF Int. Conf. on Computer Vision (ICCV), pages 3266–3275, 2019. 3
work page 2019
-
[30]
Achleshwar Luthra, Harsh Sulakhe, Tanish Mittal, Abhishek Iyer, and Santosh Yadav. Eformer: Edge enhancement based transformer for medical image denoising.arXiv preprint arXiv:2109.08044, 2021. 3
-
[31]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e J´egou, Julien Mairal, P...
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Matteformer: Transformer-based image mat- ting via prior-tokens
GyuTae Park, SungJoon Son, JaeYoung Yoo, SeHo Kim, and Nojun Kwak. Matteformer: Transformer-based image mat- ting via prior-tokens. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 11696–11706, 2022. 2, 3
work page 2022
-
[33]
How do vision transformers work?arXiv preprint arXiv:2202.06709, 2022
Namuk Park and Songkuk Kim. How do vision transformers work?arXiv preprint arXiv:2202.06709, 2022. 2, 5
-
[34]
A survey on vir- tual production and the future of compositing technologies
Filipe Pires, Rui Silva, and Rui Raposo. A survey on vir- tual production and the future of compositing technologies. Avanca Cinema Journal, 21(692-9), 2022. 2
work page 2022
-
[35]
Attention-guided hi- erarchical structure aggregation for image matting
Yu Qiao, Yuhao Liu, Xin Yang, Dongsheng Zhou, Mingliang Xu, Qiang Zhang, and Xiaopeng Wei. Attention-guided hi- erarchical structure aggregation for image matting. InProc. IEEE/CVF Conf. on Computer Vision and Pattern Recogni- tion (CVPR), pages 13676–13685, 2020. 6
work page 2020
-
[36]
Vi- sion transformers for dense prediction
Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 3, 5
work page 2021
-
[37]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
High resolution matting via interactive trimap segmentation
Christoph Rhemann, Carsten Rother, Alex Rav-Acha, and Toby Sharp. High resolution matting via interactive trimap segmentation. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008. 3
work page 2008
-
[39]
A perceptu- ally motivated online benchmark for image matting
Christoph Rhemann, Carsten Rother, Jue Wang, Margrit Gelautz, Pushmeet Kohli, and Pamela Rott. A perceptu- ally motivated online benchmark for image matting. In2009 IEEE conference on computer vision and pattern recogni- tion, pages 1826–1833. IEEE, 2009. 8
work page 2009
-
[40]
A spatially varying psf-based prior for al- pha matting
Christoph Rhemann, Carsten Rother, Pushmeet Kohli, and Margrit Gelautz. A spatially varying psf-based prior for al- pha matting. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2149–2156, 2010. 3
work page 2010
-
[41]
Mark A. Ruzon and Carlo Tomasi. Alpha estimation in nat- ural images. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 18–25, 2000. 3
work page 2000
-
[42]
Seitz, and Ira Kemelmacher-Shlizerman
Soumyadip Sengupta, Vivek Jayaram, Brian Curless, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. Back- ground matting: The world is your green screen. InProc. IEEE/CVF Conf. on Computer Vision and Pattern Recogni- tion (CVPR), pages 2288–2297, 2020. 2, 3
work page 2020
-
[43]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2, 3, 4, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
High-Resolution Representations for Labeling Pixels and Regions
Ke Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, and Jingdong Wang. High-resolution representations for labeling pixels and regions.arXiv preprint arXiv:1904.04514, 2019. 3
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[45]
Lift: A surprisingly simple lightweight feature transform for dense vit descriptors
Saksham Suri, Matthew Walmer, Kamal Gupta, and Abhinav Shrivastava. Lift: A surprisingly simple lightweight feature transform for dense vit descriptors. InEuropean Conference on Computer Vision, pages 110–128. Springer, 2024. 3
work page 2024
-
[46]
The emergence of virtual production–a research agenda.Convergence, 30(5):1557– 1574, 2024
Jon Swords and Nina Willment. The emergence of virtual production–a research agenda.Convergence, 30(5):1557– 1574, 2024. 2
work page 2024
-
[47]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2
work page 2017
-
[48]
Jue Wang and Michael F. Cohen. An iterative optimization approach for unified image segmentation and matting. In Proc. IEEE Int. Conf. on Computer Vision (ICCV), pages 936–943, 2005. 3
work page 2005
-
[49]
Jue Wang and Michael F. Cohen. Optimized color sampling for robust matting. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2007. 3
work page 2007
-
[50]
Philippe Weinzaepfel, Vincent Leroy, Thomas Lucas, Ro- main Br´egier, Yohann Cabon, Vaibhav Arora, Leonid Ants- feld, Boris Chidlovskii, Gabriela Csurka, and J ´erˆome Re- vaud. Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion.Advances in Neural Information Processing Systems, 35:3502–3516, 2022. 3
work page 2022
-
[51]
Croco v2: Improved cross-view completion pre- training for stereo matching and optical flow
Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy, Yohann Cabon, Vaibhav Arora, Romain Br ´egier, Gabriela Csurka, Leonid Antsfeld, Boris Chidlovskii, and J ´erˆome Revaud. Croco v2: Improved cross-view completion pre- training for stereo matching and optical flow. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 17969–17...
work page 2023
-
[52]
Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transform- ers.Advances in neural information processing systems, 34: 12077–12090, 2021. 3
work page 2021
-
[53]
Aggregated residual transformations for deep neural networks
Saining Xie, Ross Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500,
-
[54]
Ning Xu, Brian Price, Scott Cohen, and Thomas Huang. Deep image matting. InProc. IEEE Conf. on Computer 10 Vision and Pattern Recognition (CVPR), pages 2970–2979,
-
[55]
Peiqing Yang, Shangchen Zhou, Jixin Zhao, Qingyi Tao, and Chen Change Loy. Matanyone: Stable video mat- ting with consistent memory propagation.arXiv preprint arXiv:2501.14677, 2025. 2, 3, 6, 7, 8
-
[56]
Jingfeng Yao, Xinggang Wang, Shusheng Yang, and Baoyuan Wang. Vitmatte: Boosting image matting with pre- trained plain vision transformers.Information Fusion, 103: 102091, 2024. 2, 3, 5, 7, 8
work page 2024
-
[57]
Mask guided matting via progressive refinement network
Qihang Yu, Jianming Zhang, He Zhang, Yilin Wang, Zhe Lin, Ning Xu, Yutong Bai, and Alan Yuille. Mask guided matting via progressive refinement network. InProc. IEEE/CVF Conf. on Computer Vision and Pattern Recogni- tion (CVPR), pages 1154–1163, 2021. 2, 3, 7
work page 2021
-
[58]
Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017. 3
work page 2017
-
[59]
Minghao Zhou, Hong Wang, Yefeng Zheng, and Deyu Meng. A refreshed similarity-based upsampler for di- rect high-ratio feature upsampling.arXiv preprint arXiv:2407.02283, 2024. 3 11
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.