pith. sign in

arxiv: 1907.11544 · v1 · pith:F5IPKHFNnew · submitted 2019-07-25 · 💻 cs.CV · eess.IV

Learning Transparent Object Matting

Pith reviewed 2026-05-24 16:07 UTC · model grok-4.3

classification 💻 cs.CV eess.IV
keywords transparent object mattingrefractive flowdeep learningsynthetic datasetimage mattingencoder-decoder networkattenuation maskTOM-Net
0
0 comments X

The pith

Transparent object matting reduces to estimating refractive flow, which a deep network can recover from one image in a single forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the slow, multi-shot capture methods used for matting transparent objects by recasting the task as refractive flow estimation. It introduces TOM-Net, a network with a multi-scale encoder-decoder stage for coarse output followed by a residual refinement stage. The network accepts one photograph and directly produces the three components of the matte: an object mask, an attenuation mask, and the refractive flow field. Training relies on a new collection of 178K synthetic images created by rendering transparent objects over COCO backgrounds, supplemented by 876 real captured samples for evaluation. The same architecture is shown to accept an optional trimap or background image when those are supplied.

Core claim

The authors formulate transparent object matting as a refractive flow estimation problem and propose TOM-Net, a deep learning framework comprising a multi-scale encoder-decoder network for coarse prediction and a residual network for refinement. At test time the model ingests a single image and outputs an object mask, an attenuation mask, and a refractive flow field in one feed-forward pass. A synthetic dataset of 178K images is generated by placing rendered transparent objects in front of Microsoft COCO backgrounds, and a real dataset of 876 samples is captured with 14 objects and 60 backgrounds. The framework is also shown to accept a trimap or background image as additional input.

What carries the argument

TOM-Net, the two-part network (multi-scale encoder-decoder plus residual refinement) that learns to predict refractive flow for transparent objects.

If this is right

  • A single photograph suffices to produce the full matte without any additional capture steps.
  • The output matte consists of an object mask, attenuation mask, and refractive flow field.
  • The same network architecture accepts an optional trimap or background image when those are supplied.
  • Processing occurs in one fast feed-forward pass rather than iterative optimization.
  • Results are demonstrated on both the 178K-image synthetic set and the 876-sample real set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the refractive-flow formulation holds, the same network structure could be tested on other refractive effects such as caustics or underwater distortion without changing the output representation.
  • The single-image design opens the possibility of applying the matte estimation to video frames by adding a temporal smoothness term between consecutive predictions.
  • Because the method separates the flow field from the mask, downstream tasks such as compositing or relighting could use the flow component independently.

Load-bearing premise

That a large collection of synthetic renderings of transparent objects over COCO backgrounds will let the network generalize to the lighting, reflections, and distortions present in real photographs.

What would settle it

Measure the accuracy of predicted masks, attenuation values, and flow fields when the trained network is applied to new real photographs of the same objects against previously unseen backgrounds, compared against manual ground truth.

read the original abstract

This paper addresses the problem of image matting for transparent objects. Existing approaches often require tedious capturing procedures and long processing time, which limit their practical use. In this paper, we formulate transparent object matting as a refractive flow estimation problem, and propose a deep learning framework, called TOM-Net, for learning the refractive flow. Our framework comprises two parts, namely a multi-scale encoder-decoder network for producing a coarse prediction, and a residual network for refinement. At test time, TOM-Net takes a single image as input, and outputs a matte (consisting of an object mask, an attenuation mask and a refractive flow field) in a fast feed-forward pass. As no off-the-shelf dataset is available for transparent object matting, we create a large-scale synthetic dataset consisting of $178K$ images of transparent objects rendered in front of images sampled from the Microsoft COCO dataset. We also capture a real dataset consisting of $876$ samples using $14$ transparent objects and $60$ background images. Besides, we show that our method can be easily extended to handle the cases where a trimap or a background image is available.Promising experimental results have been achieved on both synthetic and real data, which clearly demonstrate the effectiveness of our approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes TOM-Net, a deep learning framework for transparent object matting formulated as refractive flow estimation. It consists of a multi-scale encoder-decoder network for coarse prediction followed by a residual refinement network. At test time the model takes a single RGB image and outputs a matte comprising an object mask, an attenuation mask, and a refractive flow field. A new synthetic dataset of 178K rendered images (transparent objects composited over COCO backgrounds) is introduced for training; a real dataset of 876 captures from 14 objects and 60 backgrounds is collected for evaluation. The method is also shown to accept optional trimap or background-image inputs. Promising results are reported on both synthetic and real data.

Significance. If the central claim of reliable single-image generalization holds, the work would be a meaningful advance in computer vision by replacing multi-view or controlled-capture pipelines with a fast feed-forward network for a notoriously difficult material class. The release of both a large synthetic training set and a real evaluation set constitutes a concrete resource contribution. The refractive-flow formulation is a clean conceptual step. These strengths are offset by the need for stronger evidence that the synthetic renderer reproduces the joint statistics of refraction, caustics, and Fresnel effects that appear in real photographs.

major comments (2)
  1. [Abstract / Experiments] Abstract / Experiments: The central claim that a network trained exclusively on the 178K synthetic renderings produces usable mattes on the 876 real captures is load-bearing, yet the manuscript reports only that “promising experimental results have been achieved on … real data” without quantitative metrics (e.g., mask IoU, attenuation MSE, flow endpoint error), cross-domain ablation tables, or error analysis that would allow assessment of the synthetic-to-real gap.
  2. [Dataset Creation] Dataset section: The description of the synthetic renderer does not specify whether Fresnel reflectance, wavelength-dependent refraction, or sensor noise are modeled; if these phenomena are omitted or approximated, the learned mapping cannot be expected to transfer, directly undermining the generalization result that the paper asserts.
minor comments (2)
  1. [Abstract] The abstract states that the method “can be easily extended” to trimap or background inputs but provides no architectural diagram or loss-function description for these modes.
  2. [Method] Notation for the three output channels (mask, attenuation, flow) is introduced without an explicit equation relating them to the final compositing formula.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript accordingly to provide stronger evidence on generalization.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract / Experiments: The central claim that a network trained exclusively on the 178K synthetic renderings produces usable mattes on the 876 real captures is load-bearing, yet the manuscript reports only that “promising experimental results have been achieved on … real data” without quantitative metrics (e.g., mask IoU, attenuation MSE, flow endpoint error), cross-domain ablation tables, or error analysis that would allow assessment of the synthetic-to-real gap.

    Authors: We agree that quantitative metrics on real data are essential to support the generalization claim. In the revised manuscript we have added quantitative results on the 876 real captures (mask IoU, attenuation MSE, refractive flow endpoint error), cross-domain ablation tables, and error analysis comparing synthetic and real performance. These additions appear in the Experiments section and strengthen the central claim. revision: yes

  2. Referee: [Dataset Creation] Dataset section: The description of the synthetic renderer does not specify whether Fresnel reflectance, wavelength-dependent refraction, or sensor noise are modeled; if these phenomena are omitted or approximated, the learned mapping cannot be expected to transfer, directly undermining the generalization result that the paper asserts.

    Authors: We have expanded the Dataset Creation section with a detailed description of the renderer. Fresnel reflectance and refraction are modeled using physically based techniques; wavelength-dependent refraction is approximated through RGB channels, and sensor noise is not explicitly simulated. We now explicitly discuss these modeling choices and their implications for synthetic-to-real transfer as a limitation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a standard supervised deep learning pipeline: a multi-scale encoder-decoder plus residual refinement network is trained on an independently generated synthetic dataset (178K renderings over COCO backgrounds) and evaluated on held-out synthetic images plus a separately captured real dataset (876 samples). The formulation of matting as refractive flow estimation is an explicit modeling choice, not a self-definition. No equations reduce a claimed prediction to a fitted parameter by construction, no uniqueness theorems are imported from self-citations, and no ansatz is smuggled via prior work. The central empirical claim (feed-forward inference on real photos) is falsifiable against external real captures and does not collapse to the training inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that refractive flow adequately models transparency and that the synthetic dataset distribution matches real conditions sufficiently for generalization.

free parameters (1)
  • Network weights and hyperparameters
    The encoder-decoder and residual network parameters are learned from the synthetic training data.
axioms (1)
  • domain assumption Transparent object matting can be formulated as a refractive flow estimation problem
    Stated directly in the abstract as the problem formulation.

pith-pipeline@v0.9.0 · 5752 in / 1148 out tokens · 25984 ms · 2026-05-24T16:07:49.291841+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

  1. [1]

    http://www.povray

    Persistence of vision (tm) raytracer. http://www.povray. org/ 7

  2. [2]

    In: CVPR (2018) 2

    Chen, G., Han, K., Wong, K.Y.K.: TOM-Net: Learn- ing transparent object matting from a single image. In: CVPR (2018) 2

  3. [3]

    In: ECCV (2016) 3

    Cho, D., Tai, Y.W., Kweon, I.: Natural image matting us- ing deep convolutional neural networks. In: ECCV (2016) 3

  4. [4]

    In: SIGGRAPH (2000) 2, 3, 4, 17

    Chuang, Y.Y., Zongker, D.E., Hindorff, J., Curless, B., Salesin, D.H., Szeliski, R.: Environment matting exten- sions: Towards higher accuracy and real-time capture. In: SIGGRAPH (2000) 2, 3, 4, 17

  5. [5]

    In: CVPR (2009) 7 18 Guanying Chen* et al

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR (2009) 7 18 Guanying Chen* et al

  6. [6]

    The Visual Computer (2015) 2

    Duan, Q., Cai, J., Zheng, J.: Compressive environment matting. The Visual Computer (2015) 2

  7. [7]

    In: ICME (2011) 3

    Duan, Q., Cai, J., Zheng, J., Lin, W.: Fast environment matting extraction using compressive sensing. In: ICME (2011) 3

  8. [8]

    In: Computer Graphics Forum (2011) 2

    Duan, Q., Zheng, J., Cai, J.: Flexible and accurate transparent-object matting and compositing using refrac- tive vector field. In: Computer Graphics Forum (2011) 2

  9. [9]

    In: NIPS (2014) 5

    Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NIPS (2014) 5

  10. [10]

    In: ICCV (2015) 5

    Fischer, P., Dosovitskiy, A., Ilg, E., H¨ ausser, P., Hazırba¸ s, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional net- works. In: ICCV (2015) 5

  11. [11]

    In: CVPR (2016) 5

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016) 5

  12. [12]

    In: CVPR (2017) 6

    Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow esti- mation with deep networks. In: CVPR (2017) 6

  13. [13]

    In: CVPR (2016) 6

    Kim, J., Kwon Lee, J., Mu Lee, K.: Accurate image super- resolution using very deep convolutional networks. In: CVPR (2016) 6

  14. [14]

    In: ICLR (2015) 6

    Kingma, D., Ba, J.: Adam: A method for stochastic op- timization. In: ICLR (2015) 6

  15. [15]

    In: ECCV (2014) 7

    Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ ar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV (2014) 7

  16. [16]

    In: CVPR (2017) 6

    Nah, S., Kim, T.H., Lee, K.M.: Deep multi-scale convo- lutional neural network for dynamic scene deblurring. In: CVPR (2017) 6

  17. [17]

    In: Eurographics workshop on Rendering (2003) 2, 3

    Peers, P., Dutr´ e, P.: Wavelet environment matting. In: Eurographics workshop on Rendering (2003) 2, 3

  18. [18]

    In: ICCV (2015) 3

    Qian, Y., Gong, M., Yang, Y.H.: Frequency-based en- vironment matting by compressive sensing. In: ICCV (2015) 3

  19. [19]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention (2015) 5

    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolu- tional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (2015) 5

  20. [20]

    In: ECCV (2016) 3

    Shen, X., Tao, X., Gao, H., Zhou, C., Jia, J.: Deep auto- matic portrait matting. In: ECCV (2016) 3

  21. [21]

    In: CVPR (2017) 5

    Shi, J., Dong, Y., Su, H., Yu, S.X.: Learning non- Lambertian object intrinsics across shapenet categories. In: CVPR (2017) 5

  22. [22]

    In: SIG- GRAPH (1996) 1

    Smith, A.R., Blinn, J.F.: Blue screen matting. In: SIG- GRAPH (1996) 1

  23. [23]

    IEEE TIP (2004) 3, 8

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to struc- tural similarity. IEEE TIP (2004) 3, 8

  24. [24]

    In: Rendering Tech- niques (2002) 2, 3

    Wexler, Y., Fitzgibbon, A.W., Zisserman, A., et al.: Image-based environment matting. In: Rendering Tech- niques (2002) 2, 3

  25. [25]

    In: CVPR (2017) 3, 5

    Xu, N., Price, B., Cohen, S., Huang, T.: Deep image mat- ting. In: CVPR (2017) 3, 5

  26. [26]

    ACM TOG (2011) 3, 17

    Yeung, S.K., Tang, C.K., Brown, M.S., Kang, S.B.: Mat- ting and compositing of transparent and refractive ob- jects. ACM TOG (2011) 3, 17

  27. [27]

    In: Computer Graphics and Applications (2004) 2, 3

    Zhu, J., Yang, Y.H.: Frequency-based environment mat- ting. In: Computer Graphics and Applications (2004) 2, 3

  28. [28]

    In: SIGGRAPH (1999) 2, 3, 4

    Zongker, D.E., Werner, D.M., Curless, B., Salesin, D.H.: Environment matting and compositing. In: SIGGRAPH (1999) 2, 3, 4