Efficiently Linking Real Scenes with Synthetic Data Generation for AI-based Cognitive Robotics and Computer Vision Applications

Adem Karakurt; Andr\'e Sers; J\"org Kr\"uger; Mohamad Zaher Ziadeh; Paul Hofmann; Paul Koch; Vivek Chavan

arxiv: 2606.20272 · v1 · pith:EISZKCSBnew · submitted 2026-06-18 · 💻 cs.RO · cs.CV

Efficiently Linking Real Scenes with Synthetic Data Generation for AI-based Cognitive Robotics and Computer Vision Applications

Paul Koch , Vivek Chavan , Andr\'e Sers , Adem Karakurt , Paul Hofmann , Mohamad Zaher Ziadeh , J\"org Kr\"uger This is my paper

Pith reviewed 2026-06-26 16:54 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords domain gapsynthetic datacognitive roboticscomputer visiontraining data generationpose estimationgraspingsimulation to real

0 comments

The pith

Linking real scenes to synthetic data bridges domain gaps in robotic vision training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines limits in precision and scalability for AI vision models used in cognitive robotics, attributing them to domain gaps between simulations and reality. It reviews state-of-the-art approaches in semantic analysis, 6D pose estimation, and grasping that rely on large training datasets. The authors describe their work in progress that generates training data by explicitly linking real scenes with synthetic data to close those gaps. This linkage is presented as a route to more effective synergy between data generation and AI architectures for both industrial and household uses. Readers would care because successful linkage could allow vision models to move beyond current performance ceilings without requiring entirely new data collection at scale.

Core claim

The domain gap between simulation and real-world data limits the precision and scalability of AI models for tasks such as 6D pose estimation and grasping; linking real scenes directly with synthetic data generation during training data creation provides a practical way to bridge that gap.

What carries the argument

The linking mechanism that combines real scenes with synthetic data generation to produce training datasets.

If this is right

AI architectures can reach higher precision in 6D pose estimation and grasping when trained on linked data.
Training data generation can scale more efficiently for both industrial and household robotics scenarios.
Synergies between data-generation methods and model architectures become usable to address current limits.
Domain-gap problems in semantic environment analysis can be reduced without collecting exhaustive real-world datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same linking approach might extend to other robotics perception tasks such as navigation or object manipulation.
Implementation details of the linking step would need to be tested to confirm they work at scale.
Related simulation-reality transfer problems in non-robotics computer vision could benefit from similar linkage methods.

Load-bearing premise

That connecting real scenes to synthetic data will be enough to overcome domain gaps even though no specific linking technique or performance result is shown.

What would settle it

Run a side-by-side test of an AI pose-estimation model trained on three datasets: purely real, purely synthetic, and linked real-synthetic, then measure whether the linked version yields no measurable gain in accuracy or robustness on held-out real scenes.

Figures

Figures reproduced from arXiv: 2606.20272 by Adem Karakurt, Andr\'e Sers, J\"org Kr\"uger, Mohamad Zaher Ziadeh, Paul Hofmann, Paul Koch, Vivek Chavan.

**Figure 1.** Figure 1: Linking real robot workspace scenes with simulations: In a continuous loop we propose to scan real scenes (1) and transform them into simulations (2). Here we can conduct many experiments, find grasping candidates, train control policies and annotate training data (3). Eventually, we can now train further AI methods to help to transform the annotations and control policies back into the real scenery (4).… view at source ↗

**Figure 2.** Figure 2: With Nerfs [54] we can create good looking 3D representations (some issues with the shade), but we cant yet export high quality masked 3D metric assets and textures. (2) Generate Simulations: From the 3D assets we can now build novel simulated scenarios with randomly placed objects. However the scene generation needs to incorporate prior knowledge about the possible constellations of things, such that e.g.… view at source ↗

**Figure 3.** Figure 3: 3D Assets with Textures from Nvdiffrec [ [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Image augmentation with Perfusion [57]. First experiments for natural language based editing of the visual appearance of assets (before or after simulation). (4) Train AI-Helpers: Rather then training AI-models to solve a given downstream task from the simulation data alone, it is has been found to be very beneficial to incorporate real data along site of the synthetic data, in order to close the sim2real… view at source ↗

read the original abstract

AI vision models are a driving factor for the potential use case scenarios of cognitive robotics within in the industry and household applications. A large array of methods from semantic environment analysis towards 6D and grasping pose estimation have been proposed based on the latest AI achievements. However, such advancements require further strong and efficient methods w.r.t. training data and AI-architectures, which are capable in synergy to tackle current challenges, precision limits, and scalability beyond domain gaps. In this paper, we discuss these current limits and trends in the related state-of-the-art which are challenging those. Further we discuss our current work in progress on bridging the domain gap between simulations and real world applications by linking those in the training data generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a short discussion note on sim-to-real challenges in robotics vision with no new methods, results, or evidence presented.

read the letter

The main point to take away is that this paper does not deliver a technical contribution. It summarizes known limits in AI vision for cognitive robotics and states that the authors are working on linking real scenes to synthetic data generation, but it stops there.

What the paper does reasonably is to connect the dots between training data needs, domain gaps, and practical uses in industry and household robotics. It notes that methods for semantic analysis, pose estimation, and grasping all run into precision and scalability problems when moving from simulation to reality, and it flags the need for better synergy between data and model design.

The soft spots are substantial and central. No mechanism for the proposed linking is described, no experiments or datasets are shown, and no comparison to existing domain adaptation techniques appears. The text remains at the level of identifying the problem and expressing intent. This makes it impossible to judge whether their approach has any chance of addressing the issues they raise.

The paper is aimed at readers who want a brief status report on these challenges rather than solutions. Specialists already familiar with sim-to-real work will find nothing new, while newcomers might get a high-level sense of the landscape but no actionable details.

Because there are no claims, derivations, or findings that can be evaluated, this does not merit sending out for serious peer review. It might fit as a workshop position piece or extended abstract, but not in a standard research track.

Referee Report

1 major / 0 minor

Summary. The manuscript discusses current limits and trends in state-of-the-art AI vision methods for cognitive robotics applications, such as semantic analysis, 6D pose estimation, and grasping. It highlights challenges in training data, AI architectures, precision, scalability, and domain gaps between simulation and reality. The paper further describes the authors' ongoing work-in-progress on bridging these domain gaps by linking real scenes with synthetic data generation during training data creation.

Significance. The general topic of domain gap reduction via mixed real-synthetic training data is relevant to scalable robotic vision systems. However, because the manuscript contains no specific methods, algorithms, datasets, experiments, or quantitative results, its potential significance cannot be assessed beyond a high-level overview of challenges and intent. No machine-checked proofs, reproducible code, or falsifiable predictions are provided.

major comments (1)

[Abstract] Abstract: The manuscript positions itself as a discussion of ongoing work without presenting any concrete mechanism, equation, algorithm, or preliminary validation for 'linking real scenes with synthetic data generation.' This absence means the central claim of addressing domain gaps cannot be evaluated for correctness or novelty.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. The manuscript is positioned as a discussion of challenges and work-in-progress rather than a complete technical contribution with algorithms or results. We address the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript positions itself as a discussion of ongoing work without presenting any concrete mechanism, equation, algorithm, or preliminary validation for 'linking real scenes with synthetic data generation.' This absence means the central claim of addressing domain gaps cannot be evaluated for correctness or novelty.

Authors: We agree that the paper presents no concrete mechanisms, equations, algorithms, datasets, or validation results. It is explicitly a discussion paper reviewing limits in AI vision for robotics (semantic analysis, 6D pose estimation, grasping) and describing ongoing work-in-progress on linking real scenes with synthetic data generation to address domain gaps. The abstract and introduction state this scope directly. No claim is made to a novel evaluated method; the contribution is the overview of trends and the high-level intent of the linking approach. Such discussion papers can usefully frame open problems even without quantitative results. revision: no

Circularity Check

0 steps flagged

No significant circularity; purely descriptive discussion with no derivations or fitted claims

full rationale

The manuscript is explicitly positioned as a discussion of state-of-the-art limits plus ongoing work-in-progress on linking real scenes to synthetic data generation. No concrete mechanism, algorithm, equation, dataset, or result is asserted as solved or demonstrated. Consequently there are no load-bearing technical assumptions, predictions, self-citations, or derivations whose failure would falsify a central claim, and no steps reduce to inputs by construction. The text contains no mathematical content at all.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the document is a high-level discussion based on the abstract.

pith-pipeline@v0.9.1-grok · 5678 in / 942 out tokens · 20107 ms · 2026-06-26T16:54:54.603663+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 18 linked inside Pith

[1]

Imagenet classification with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inAdvances in Neural Information Processing Sys- tems(F. Pereira, C. Burges, L. Bottou, and K. Weinberger, eds.), vol. 25, Curran Associates, Inc., 2012. 1

2012
[2]

Faster R-CNN: towards real-time object detection with region proposal networks,

S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,”CoRR, vol. abs/1506.01497, 2015. 1

Pith/arXiv arXiv 2015
[3]

You only look once: Unified, real-time object detection,

J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,”CoRR, vol. abs/1506.02640, 2015. 1

Pith/arXiv arXiv 2015
[4]

End- to-end object detection with transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End- to-end object detection with transformers,”CoRR, vol. abs/2005.12872, 2020. 1

arXiv 2005
[5]

Deformable DETR: deformable transformers for end-to-end object detection,

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: deformable transformers for end-to-end object detection,”CoRR, vol. abs/2010.04159, 2020. 1

Pith/arXiv arXiv 2010
[6]

Detrs with collaborative hybrid assignments train- ing,

Z. Zong, G. Song, and Y. Liu, “Detrs with collaborative hybrid assignments train- ing,” 2023. 1

2023
[7]

Masked-attention mask transformer for universal image segmentation,

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,”CoRR, vol. abs/2112.01527,

arXiv
[8]

Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes,

Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes,”CoRR, vol. abs/1711.00199, 2017. 1, 2, 3

Pith/arXiv arXiv 2017
[9]

Maskfusion: Real-time recognition, tracking and re- construction of multiple moving objects,

M. R¨ unz and L. Agapito, “Maskfusion: Real-time recognition, tracking and re- construction of multiple moving objects,”CoRR, vol. abs/1804.09194, 2018. 1, 2 8 Koch et al

Pith/arXiv arXiv 2018
[10]

Densefusion: 6d object pose estimation by iterative dense fusion,

C. Wang, D. Xu, Y. Zhu, R. Mart´ ın-Mart´ ın, C. Lu, L. Fei-Fei, and S. Savarese, “Densefusion: 6d object pose estimation by iterative dense fusion,”CoRR, vol. abs/1901.04780, 2019. 1, 2

Pith/arXiv arXiv 1901
[11]

Zebrapose: Coarse to fine surface encoding for 6dof object pose esti- mation,

Y. Su, M. Saleh, T. Fetzer, J. Rambach, N. Navab, B. Busam, D. Stricker, and F. Tombari, “Zebrapose: Coarse to fine surface encoding for 6dof object pose esti- mation,” 2022. 1, 2

2022
[12]

Anygrasp: Robust and efficient grasp perception in spatial and temporal domains,

H.-S. Fang, C. Wang, H. Fang, M. Gou, J. Liu, H. Yan, W. Liu, Y. Xie, and C. Lu, “Anygrasp: Robust and efficient grasp perception in spatial and temporal domains,”IEEE Transactions on Robotics (T-RO), 2023. 1, 2

2023
[13]

Dinov2: Learning robust visual features without supervision,

M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fer- nandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y. Huang, S.-W. Li, I. Misra, M. Rabbat, V. Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features without superv...

2023
[14]

Learning transfer- able visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transfer- able visual models from natural language supervision,”CoRR, vol. abs/2103.00020,

Pith/arXiv arXiv
[15]

Language models are few-shot learners,

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Nee- lakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCan- dlish, A. Radford, I. Sutskever, and D. ...

Pith/arXiv arXiv 2005
[16]

Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes,

S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab, “Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes,” inComputer Vision – ACCV 2012(K. M. Lee, Y. Matsushita, J. M. Rehg, and Z. Hu, eds.), (Berlin, Heidelberg), pp. 548– 562, Springer Berlin Heidelberg, 2013. 2, 3

2012
[17]

Graspnet-1billion: A large-scale bench- mark for general object grasping,

H.-S. Fang, C. Wang, M. Gou, and C. Lu, “Graspnet-1billion: A large-scale bench- mark for general object grasping,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11444–11453, 2020. 2, 3, 4

2020
[18]

Suctionnet-1billion: A large-scale bench- mark for suction grasping,

H. Cao, H.-S. Fang, W. Liu, and C. Lu, “Suctionnet-1billion: A large-scale bench- mark for suction grasping,”IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 8718–8725, 2021. 2

2021
[19]

Unseen object 6d pose estimation: A benchmark and baselines,

M. Gou, H. Pan, H.-S. Fang, Z. Liu, C. Lu, and P. Tan, “Unseen object 6d pose estimation: A benchmark and baselines,”arXiv preprint, 2022. 2

2022
[20]

Transcg: A large-scale real-world dataset for transparent object depth completion and a grasping baseline,

H. Fang, H.-S. Fang, S. Xu, and C. Lu, “Transcg: A large-scale real-world dataset for transparent object depth completion and a grasping baseline,”IEEE Robotics and Automation Letters, pp. 1–8, 2022. 2

2022
[21]

Self-supervised 6d object pose estimation for robot manipulation,

X. Deng, Y. Xiang, A. Mousavian, C. Eppner, T. Bretl, and D. Fox, “Self-supervised 6d object pose estimation for robot manipulation,”CoRR, vol. abs/1909.10159, 2019. 2, 3, 6

arXiv 1909
[22]

Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,

J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Gold- berg, “Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,”CoRR, vol. abs/1703.09312, 2017. 2

Pith/arXiv arXiv 2017
[23]

6-dof graspnet: Variational grasp genera- tion for object manipulation,

A. Mousavian, C. Eppner, and D. Fox, “6-dof graspnet: Variational grasp genera- tion for object manipulation,”CoRR, vol. abs/1905.10520, 2019. 2

arXiv 1905
[24]

Jacquard: A large scale dataset for robotic grasp detection,

A. Depierre, E. Dellandr´ ea, and L. Chen, “Jacquard: A large scale dataset for robotic grasp detection,”CoRR, vol. abs/1803.11469, 2018. 2 Efficiently Linking Real Scenes with Synthetic Data Generation 9

Pith/arXiv arXiv 2018
[25]

Learning 6-dof grasping interaction via deep geometry-aware 3d repre- sentations,

X. Yan, J. Hsu, M. Khansari, Y. Bai, A. Pathak, A. Gupta, J. Davidson, and H. Lee, “Learning 6-dof grasping interaction via deep geometry-aware 3d repre- sentations,” in2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 3766–3773, 2018. 2

2018
[26]

With synthetic data towards part recognition generalized beyond the training instances,

P. Koch, M. Schl¨ uter, and J. Kr¨ uger, “With synthetic data towards part recognition generalized beyond the training instances,”AIP Conference Proceedings, vol. 2989, p. 020007, 01 2024. 2, 3, 6

2024
[27]

Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours,

L. Pinto and A. Gupta, “Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours,”CoRR, vol. abs/1509.06825, 2015. 2, 3

Pith/arXiv arXiv 2015
[28]

Learning hand-eye coordi- nation for robotic grasping with deep learning and large-scale data collection,

S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning hand-eye coordi- nation for robotic grasping with deep learning and large-scale data collection,” CoRR, vol. abs/1603.02199, 2016. 2

Pith/arXiv arXiv 2016
[29]

Towards robot-assisted data gener- ation with minimal user interaction for autonomously training 6d pose estimation in operational environments,

P. Koch, M. Schl¨ uter, S. Thill, and J. Kr¨ uger, “Towards robot-assisted data gener- ation with minimal user interaction for autonomously training 6d pose estimation in operational environments,”Procedia CIRP, vol. 120, pp. 249–254, 2023. 56th CIRP International Conference on Manufacturing Systems 2023. 2, 3

2023
[30]

Noise and the reality gap: The use of simulation in evolutionary robotics,

N. Jakobi, P. Husbands, and I. Harvey, “Noise and the reality gap: The use of simulation in evolutionary robotics,” vol. 929, pp. 704–720, 01 1995. 2, 3

1995
[31]

Palm-e: An embodied multimodal language model,

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y. Chebotar, P. Sermanet, D. Duck- worth, S. Levine, V. Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence, “Palm-e: An embodied multimodal language model,”
[32]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, S. Feng, Y. Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” 2023. 2, 3, 4

2023
[33]

Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learn- ing,

T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine, “Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learn- ing,”CoRR, vol. abs/1910.10897, 2019. 2, 4

arXiv 1910
[34]

Gradient surgery for multi-task learning,

T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient surgery for multi-task learning,”CoRR, vol. abs/2001.06782, 2020. 2, 4

arXiv 2001
[35]

Multi-task reinforcement learning with context-based representations,

S. Sodhani, A. Zhang, and J. Pineau, “Multi-task reinforcement learning with context-based representations,” 2021. 2, 4

2021
[36]

Contrastive preference learning: Learning from human feedback without rl,

J. Hejna, R. Rafailov, H. Sikchi, C. Finn, S. Niekum, W. B. Knox, and D. Sadigh, “Contrastive preference learning: Learning from human feedback without rl,” 2023. 2, 4

2023
[37]

Graspness discovery in clutters for fast and accurate grasp detection,

C. Wang, H.-S. Fang, M. Gou, H. Fang, J. Gao, and C. Lu, “Graspness discovery in clutters for fast and accurate grasp detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15964–15973, October
[38]

Target-referenced reactive grasping for dynamic objects,

J. Liu, R. Zhang, H.-S. Fang, M. Gou, H. Fang, C. Wang, S. Xu, H. Yan, and C. Lu, “Target-referenced reactive grasping for dynamic objects,” pp. 8824–8833, June 2023. 2

2023
[39]

Learning ambidextrous robot grasping policies,

J. Mahler, M. Matl, V. Satish, M. Danielczuk, B. DeRose, S. McKinley, and K. Goldberg, “Learning ambidextrous robot grasping policies,”Science Robotics, vol. 4, no. 26, p. eaau4984, 2019. 2

2019
[40]

Efficient grasping from rgbd images: Learn- ing using a new rectangle representation,

Y. Jiang, S. Moseson, and A. Saxena, “Efficient grasping from rgbd images: Learn- ing using a new rectangle representation,” in2011 IEEE International Conference on Robotics and Automation, pp. 3304–3311, 2011. 2

2011
[41]

Deep learning for detecting robotic grasps,

I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic grasps,”The International Journal of Robotics Research, vol. 34, no. 4-5, pp. 705–724, 2015. 2 10 Koch et al

2015
[42]

Roi-based robotic grasp detec- tion in object overlapping scenes using convolutional neural network,

H. Zhang, X. Lan, X. Zhou, and N. Zheng, “Roi-based robotic grasp detec- tion in object overlapping scenes using convolutional neural network,”CoRR, vol. abs/1808.10313, 2018. 2

Pith/arXiv arXiv 2018
[43]

Learning grasp affordance reasoning through semantic relations,

P. Ard´ on, `E. Pairet, R. P. A. Petrick, S. Ramamoorthy, and K. S. Lo- han, “Learning grasp affordance reasoning through semantic relations,”CoRR, vol. abs/1906.09836, 2019. 2, 3

Pith/arXiv arXiv 1906
[44]

Faster recognition of graspable targets de- fined by orientation in a visual search task,

L. Bamford, N. Klassen, and J. Karl, “Faster recognition of graspable targets de- fined by orientation in a visual search task,”Experimental Brain Research, vol. 238, 04 2020. 2, 3

2020
[45]

Defgraspsim: Physics-based simulation of grasp outcomes for 3d deformable objects,

I. Huang, Y. Narang, C. Eppner, B. Sundaralingam, M. Macklin, R. Bajcsy, T. Her- mans, and D. Fox, “Defgraspsim: Physics-based simulation of grasp outcomes for 3d deformable objects,”IEEE Robotics and Automation Letters, vol. 7, p. 6274–6281, July 2022. 2

2022
[46]

Imagenet: A large- scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large- scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009. 3

2009
[47]

Microsoft COCO: common objects in context,

T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll´ ar, and C. L. Zitnick, “Microsoft COCO: common objects in context,”CoRR, vol. abs/1405.0312, 2014. 3

Pith/arXiv arXiv 2014
[48]

The cityscapes dataset for semantic urban scene understanding,

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,”CoRR, vol. abs/1604.01685, 2016. 3

Pith/arXiv arXiv 2016
[49]

Homebreweddb: RGB-D dataset for 6d pose estimation of 3d objects,

R. Kaskman, S. Zakharov, I. Shugurov, and S. Ilic, “Homebreweddb: RGB-D dataset for 6d pose estimation of 3d objects,”CoRR, vol. abs/1904.03167, 2019. 3

arXiv 1904
[50]

Bop challenge 2022 on detection, segmentation and pose estimation of specific rigid objects,

M. Sundermeyer, T. Hodan, Y. Labbe, G. Wang, E. Brachmann, B. Drost, C. Rother, and J. Matas, “Bop challenge 2022 on detection, segmentation and pose estimation of specific rigid objects,” 2023. 3

2022
[51]

Domain randomization for transferring deep neural networks from simulation to the real world,

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,”CoRR, vol. abs/1703.06907, 2017. 3

Pith/arXiv arXiv 2017
[52]

A neural algorithm of artistic style,

L. A. Gatys, A. S. Ecker, and M. Bethge, “A neural algorithm of artistic style,” CoRR, vol. abs/1508.06576, 2015. 3, 7

Pith/arXiv arXiv 2015
[53]

Domain enhanced arbitrary image style transfer via contrastive learning,

Y. Zhang, F. Tang, W. Dong, H. Huang, C. Ma, T.-Y. Lee, and C. Xu, “Domain enhanced arbitrary image style transfer via contrastive learning,” inSpecial Interest Group on Computer Graphics and Interactive Techniques Conference Proceedings, SIGGRAPH ’22, ACM, Aug. 2022. 3

2022
[54]

Instant neural graphics primitives with a multiresolution hash encoding,

T. M¨ uller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,”CoRR, vol. abs/2201.05989, 2022. 4, 5

arXiv 2022
[55]

Extracting triangular 3d models, materials, and lighting from images,

J. Munkberg, J. Hasselgren, T. Shen, J. Gao, W. Chen, A. Evans, T. M¨ uller, and S. Fidler, “Extracting triangular 3d models, materials, and lighting from images,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8280–8290, June 2022. 4, 5

2022
[56]

Shape, Light, and Material Decomposition from Images using Monte Carlo Rendering and Denoising,

J. Hasselgren, N. Hofmann, and J. Munkberg, “Shape, Light, and Material Decomposition from Images using Monte Carlo Rendering and Denoising,” arXiv:2206.03380, 2022. 4

arXiv 2022
[57]

Key-locked rank one editing for text-to-image personalization,

Y. Tewel, R. Gal, G. Chechik, and Y. Atzmon, “Key-locked rank one editing for text-to-image personalization,” 2023. 6

2023
[58]

Segment any- thing,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Doll´ ar, and R. Girshick, “Segment any- thing,” 2023. 7

2023

[1] [1]

Imagenet classification with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inAdvances in Neural Information Processing Sys- tems(F. Pereira, C. Burges, L. Bottou, and K. Weinberger, eds.), vol. 25, Curran Associates, Inc., 2012. 1

2012

[2] [2]

Faster R-CNN: towards real-time object detection with region proposal networks,

S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,”CoRR, vol. abs/1506.01497, 2015. 1

Pith/arXiv arXiv 2015

[3] [3]

You only look once: Unified, real-time object detection,

J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,”CoRR, vol. abs/1506.02640, 2015. 1

Pith/arXiv arXiv 2015

[4] [4]

End- to-end object detection with transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End- to-end object detection with transformers,”CoRR, vol. abs/2005.12872, 2020. 1

arXiv 2005

[5] [5]

Deformable DETR: deformable transformers for end-to-end object detection,

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: deformable transformers for end-to-end object detection,”CoRR, vol. abs/2010.04159, 2020. 1

Pith/arXiv arXiv 2010

[6] [6]

Detrs with collaborative hybrid assignments train- ing,

Z. Zong, G. Song, and Y. Liu, “Detrs with collaborative hybrid assignments train- ing,” 2023. 1

2023

[7] [7]

Masked-attention mask transformer for universal image segmentation,

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,”CoRR, vol. abs/2112.01527,

arXiv

[8] [8]

Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes,

Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes,”CoRR, vol. abs/1711.00199, 2017. 1, 2, 3

Pith/arXiv arXiv 2017

[9] [9]

Maskfusion: Real-time recognition, tracking and re- construction of multiple moving objects,

M. R¨ unz and L. Agapito, “Maskfusion: Real-time recognition, tracking and re- construction of multiple moving objects,”CoRR, vol. abs/1804.09194, 2018. 1, 2 8 Koch et al

Pith/arXiv arXiv 2018

[10] [10]

Densefusion: 6d object pose estimation by iterative dense fusion,

C. Wang, D. Xu, Y. Zhu, R. Mart´ ın-Mart´ ın, C. Lu, L. Fei-Fei, and S. Savarese, “Densefusion: 6d object pose estimation by iterative dense fusion,”CoRR, vol. abs/1901.04780, 2019. 1, 2

Pith/arXiv arXiv 1901

[11] [11]

Zebrapose: Coarse to fine surface encoding for 6dof object pose esti- mation,

Y. Su, M. Saleh, T. Fetzer, J. Rambach, N. Navab, B. Busam, D. Stricker, and F. Tombari, “Zebrapose: Coarse to fine surface encoding for 6dof object pose esti- mation,” 2022. 1, 2

2022

[12] [12]

Anygrasp: Robust and efficient grasp perception in spatial and temporal domains,

H.-S. Fang, C. Wang, H. Fang, M. Gou, J. Liu, H. Yan, W. Liu, Y. Xie, and C. Lu, “Anygrasp: Robust and efficient grasp perception in spatial and temporal domains,”IEEE Transactions on Robotics (T-RO), 2023. 1, 2

2023

[13] [13]

Dinov2: Learning robust visual features without supervision,

M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fer- nandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y. Huang, S.-W. Li, I. Misra, M. Rabbat, V. Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features without superv...

2023

[14] [14]

Learning transfer- able visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transfer- able visual models from natural language supervision,”CoRR, vol. abs/2103.00020,

Pith/arXiv arXiv

[15] [15]

Language models are few-shot learners,

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Nee- lakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCan- dlish, A. Radford, I. Sutskever, and D. ...

Pith/arXiv arXiv 2005

[16] [16]

Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes,

S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab, “Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes,” inComputer Vision – ACCV 2012(K. M. Lee, Y. Matsushita, J. M. Rehg, and Z. Hu, eds.), (Berlin, Heidelberg), pp. 548– 562, Springer Berlin Heidelberg, 2013. 2, 3

2012

[17] [17]

Graspnet-1billion: A large-scale bench- mark for general object grasping,

H.-S. Fang, C. Wang, M. Gou, and C. Lu, “Graspnet-1billion: A large-scale bench- mark for general object grasping,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11444–11453, 2020. 2, 3, 4

2020

[18] [18]

Suctionnet-1billion: A large-scale bench- mark for suction grasping,

H. Cao, H.-S. Fang, W. Liu, and C. Lu, “Suctionnet-1billion: A large-scale bench- mark for suction grasping,”IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 8718–8725, 2021. 2

2021

[19] [19]

Unseen object 6d pose estimation: A benchmark and baselines,

M. Gou, H. Pan, H.-S. Fang, Z. Liu, C. Lu, and P. Tan, “Unseen object 6d pose estimation: A benchmark and baselines,”arXiv preprint, 2022. 2

2022

[20] [20]

Transcg: A large-scale real-world dataset for transparent object depth completion and a grasping baseline,

H. Fang, H.-S. Fang, S. Xu, and C. Lu, “Transcg: A large-scale real-world dataset for transparent object depth completion and a grasping baseline,”IEEE Robotics and Automation Letters, pp. 1–8, 2022. 2

2022

[21] [21]

Self-supervised 6d object pose estimation for robot manipulation,

X. Deng, Y. Xiang, A. Mousavian, C. Eppner, T. Bretl, and D. Fox, “Self-supervised 6d object pose estimation for robot manipulation,”CoRR, vol. abs/1909.10159, 2019. 2, 3, 6

arXiv 1909

[22] [22]

Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,

J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Gold- berg, “Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,”CoRR, vol. abs/1703.09312, 2017. 2

Pith/arXiv arXiv 2017

[23] [23]

6-dof graspnet: Variational grasp genera- tion for object manipulation,

A. Mousavian, C. Eppner, and D. Fox, “6-dof graspnet: Variational grasp genera- tion for object manipulation,”CoRR, vol. abs/1905.10520, 2019. 2

arXiv 1905

[24] [24]

Jacquard: A large scale dataset for robotic grasp detection,

A. Depierre, E. Dellandr´ ea, and L. Chen, “Jacquard: A large scale dataset for robotic grasp detection,”CoRR, vol. abs/1803.11469, 2018. 2 Efficiently Linking Real Scenes with Synthetic Data Generation 9

Pith/arXiv arXiv 2018

[25] [25]

Learning 6-dof grasping interaction via deep geometry-aware 3d repre- sentations,

X. Yan, J. Hsu, M. Khansari, Y. Bai, A. Pathak, A. Gupta, J. Davidson, and H. Lee, “Learning 6-dof grasping interaction via deep geometry-aware 3d repre- sentations,” in2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 3766–3773, 2018. 2

2018

[26] [26]

With synthetic data towards part recognition generalized beyond the training instances,

P. Koch, M. Schl¨ uter, and J. Kr¨ uger, “With synthetic data towards part recognition generalized beyond the training instances,”AIP Conference Proceedings, vol. 2989, p. 020007, 01 2024. 2, 3, 6

2024

[27] [27]

Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours,

L. Pinto and A. Gupta, “Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours,”CoRR, vol. abs/1509.06825, 2015. 2, 3

Pith/arXiv arXiv 2015

[28] [28]

Learning hand-eye coordi- nation for robotic grasping with deep learning and large-scale data collection,

S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning hand-eye coordi- nation for robotic grasping with deep learning and large-scale data collection,” CoRR, vol. abs/1603.02199, 2016. 2

Pith/arXiv arXiv 2016

[29] [29]

Towards robot-assisted data gener- ation with minimal user interaction for autonomously training 6d pose estimation in operational environments,

P. Koch, M. Schl¨ uter, S. Thill, and J. Kr¨ uger, “Towards robot-assisted data gener- ation with minimal user interaction for autonomously training 6d pose estimation in operational environments,”Procedia CIRP, vol. 120, pp. 249–254, 2023. 56th CIRP International Conference on Manufacturing Systems 2023. 2, 3

2023

[30] [30]

Noise and the reality gap: The use of simulation in evolutionary robotics,

N. Jakobi, P. Husbands, and I. Harvey, “Noise and the reality gap: The use of simulation in evolutionary robotics,” vol. 929, pp. 704–720, 01 1995. 2, 3

1995

[31] [31]

Palm-e: An embodied multimodal language model,

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y. Chebotar, P. Sermanet, D. Duck- worth, S. Levine, V. Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence, “Palm-e: An embodied multimodal language model,”

[32] [32]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, S. Feng, Y. Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” 2023. 2, 3, 4

2023

[33] [33]

Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learn- ing,

T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine, “Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learn- ing,”CoRR, vol. abs/1910.10897, 2019. 2, 4

arXiv 1910

[34] [34]

Gradient surgery for multi-task learning,

T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient surgery for multi-task learning,”CoRR, vol. abs/2001.06782, 2020. 2, 4

arXiv 2001

[35] [35]

Multi-task reinforcement learning with context-based representations,

S. Sodhani, A. Zhang, and J. Pineau, “Multi-task reinforcement learning with context-based representations,” 2021. 2, 4

2021

[36] [36]

Contrastive preference learning: Learning from human feedback without rl,

J. Hejna, R. Rafailov, H. Sikchi, C. Finn, S. Niekum, W. B. Knox, and D. Sadigh, “Contrastive preference learning: Learning from human feedback without rl,” 2023. 2, 4

2023

[37] [37]

Graspness discovery in clutters for fast and accurate grasp detection,

C. Wang, H.-S. Fang, M. Gou, H. Fang, J. Gao, and C. Lu, “Graspness discovery in clutters for fast and accurate grasp detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15964–15973, October

[38] [38]

Target-referenced reactive grasping for dynamic objects,

J. Liu, R. Zhang, H.-S. Fang, M. Gou, H. Fang, C. Wang, S. Xu, H. Yan, and C. Lu, “Target-referenced reactive grasping for dynamic objects,” pp. 8824–8833, June 2023. 2

2023

[39] [39]

Learning ambidextrous robot grasping policies,

J. Mahler, M. Matl, V. Satish, M. Danielczuk, B. DeRose, S. McKinley, and K. Goldberg, “Learning ambidextrous robot grasping policies,”Science Robotics, vol. 4, no. 26, p. eaau4984, 2019. 2

2019

[40] [40]

Efficient grasping from rgbd images: Learn- ing using a new rectangle representation,

Y. Jiang, S. Moseson, and A. Saxena, “Efficient grasping from rgbd images: Learn- ing using a new rectangle representation,” in2011 IEEE International Conference on Robotics and Automation, pp. 3304–3311, 2011. 2

2011

[41] [41]

Deep learning for detecting robotic grasps,

I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic grasps,”The International Journal of Robotics Research, vol. 34, no. 4-5, pp. 705–724, 2015. 2 10 Koch et al

2015

[42] [42]

Roi-based robotic grasp detec- tion in object overlapping scenes using convolutional neural network,

H. Zhang, X. Lan, X. Zhou, and N. Zheng, “Roi-based robotic grasp detec- tion in object overlapping scenes using convolutional neural network,”CoRR, vol. abs/1808.10313, 2018. 2

Pith/arXiv arXiv 2018

[43] [43]

Learning grasp affordance reasoning through semantic relations,

P. Ard´ on, `E. Pairet, R. P. A. Petrick, S. Ramamoorthy, and K. S. Lo- han, “Learning grasp affordance reasoning through semantic relations,”CoRR, vol. abs/1906.09836, 2019. 2, 3

Pith/arXiv arXiv 1906

[44] [44]

Faster recognition of graspable targets de- fined by orientation in a visual search task,

L. Bamford, N. Klassen, and J. Karl, “Faster recognition of graspable targets de- fined by orientation in a visual search task,”Experimental Brain Research, vol. 238, 04 2020. 2, 3

2020

[45] [45]

Defgraspsim: Physics-based simulation of grasp outcomes for 3d deformable objects,

I. Huang, Y. Narang, C. Eppner, B. Sundaralingam, M. Macklin, R. Bajcsy, T. Her- mans, and D. Fox, “Defgraspsim: Physics-based simulation of grasp outcomes for 3d deformable objects,”IEEE Robotics and Automation Letters, vol. 7, p. 6274–6281, July 2022. 2

2022

[46] [46]

Imagenet: A large- scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large- scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009. 3

2009

[47] [47]

Microsoft COCO: common objects in context,

T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll´ ar, and C. L. Zitnick, “Microsoft COCO: common objects in context,”CoRR, vol. abs/1405.0312, 2014. 3

Pith/arXiv arXiv 2014

[48] [48]

The cityscapes dataset for semantic urban scene understanding,

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,”CoRR, vol. abs/1604.01685, 2016. 3

Pith/arXiv arXiv 2016

[49] [49]

Homebreweddb: RGB-D dataset for 6d pose estimation of 3d objects,

R. Kaskman, S. Zakharov, I. Shugurov, and S. Ilic, “Homebreweddb: RGB-D dataset for 6d pose estimation of 3d objects,”CoRR, vol. abs/1904.03167, 2019. 3

arXiv 1904

[50] [50]

Bop challenge 2022 on detection, segmentation and pose estimation of specific rigid objects,

M. Sundermeyer, T. Hodan, Y. Labbe, G. Wang, E. Brachmann, B. Drost, C. Rother, and J. Matas, “Bop challenge 2022 on detection, segmentation and pose estimation of specific rigid objects,” 2023. 3

2022

[51] [51]

Domain randomization for transferring deep neural networks from simulation to the real world,

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,”CoRR, vol. abs/1703.06907, 2017. 3

Pith/arXiv arXiv 2017

[52] [52]

A neural algorithm of artistic style,

L. A. Gatys, A. S. Ecker, and M. Bethge, “A neural algorithm of artistic style,” CoRR, vol. abs/1508.06576, 2015. 3, 7

Pith/arXiv arXiv 2015

[53] [53]

Domain enhanced arbitrary image style transfer via contrastive learning,

Y. Zhang, F. Tang, W. Dong, H. Huang, C. Ma, T.-Y. Lee, and C. Xu, “Domain enhanced arbitrary image style transfer via contrastive learning,” inSpecial Interest Group on Computer Graphics and Interactive Techniques Conference Proceedings, SIGGRAPH ’22, ACM, Aug. 2022. 3

2022

[54] [54]

Instant neural graphics primitives with a multiresolution hash encoding,

T. M¨ uller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,”CoRR, vol. abs/2201.05989, 2022. 4, 5

arXiv 2022

[55] [55]

Extracting triangular 3d models, materials, and lighting from images,

J. Munkberg, J. Hasselgren, T. Shen, J. Gao, W. Chen, A. Evans, T. M¨ uller, and S. Fidler, “Extracting triangular 3d models, materials, and lighting from images,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8280–8290, June 2022. 4, 5

2022

[56] [56]

Shape, Light, and Material Decomposition from Images using Monte Carlo Rendering and Denoising,

J. Hasselgren, N. Hofmann, and J. Munkberg, “Shape, Light, and Material Decomposition from Images using Monte Carlo Rendering and Denoising,” arXiv:2206.03380, 2022. 4

arXiv 2022

[57] [57]

Key-locked rank one editing for text-to-image personalization,

Y. Tewel, R. Gal, G. Chechik, and Y. Atzmon, “Key-locked rank one editing for text-to-image personalization,” 2023. 6

2023

[58] [58]

Segment any- thing,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Doll´ ar, and R. Girshick, “Segment any- thing,” 2023. 7

2023