pith. sign in

arxiv: 2209.08455 · v1 · pith:XRNJZPGAnew · submitted 2022-09-18 · 💻 cs.CV · cs.RO

TODE-Trans: Transparent Object Depth Estimation with Transformer

classification 💻 cs.CV cs.RO
keywords depthtransparentbetterestimationmodelobjectobjectsgain
0
0 comments X
read the original abstract

Transparent objects are widely used in industrial automation and daily life. However, robust visual recognition and perception of transparent objects have always been a major challenge. Currently, most commercial-grade depth cameras are still not good at sensing the surfaces of transparent objects due to the refraction and reflection of light. In this work, we present a transformer-based transparent object depth estimation approach from a single RGB-D input. We observe that the global characteristics of the transformer make it easier to extract contextual information to perform depth estimation of transparent areas. In addition, to better enhance the fine-grained features, a feature fusion module (FFM) is designed to assist coherent prediction. Our empirical evidence demonstrates that our model delivers significant improvements in recent popular datasets, e.g., 25% gain on RMSE and 21% gain on REL compared to previous state-of-the-art convolutional-based counterparts in ClearGrasp dataset. Extensive results show that our transformer-based model enables better aggregation of the object's RGB and inaccurate depth information to obtain a better depth representation. Our code and the pre-trained model will be available at https://github.com/yuchendoudou/TODE.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Trans2Occ: Voxel Occupancy Estimation and Grasp for Transparent Objects from Simulation to Reality

    cs.RO 2026-06 unverdicted novelty 4.0

    A simulation-trained model predicts voxel occupancy from single RGB views for transparent object grasping and transfers to real robotic setups without fine-tuning.