Pickalo: Leveraging 6D Pose Estimation for Low-Cost Industrial Bin Picking

Alessandro Tarsi; Matteo Mastrogiuseppe; Saverio Taliani; Simone Cortinovis; Ugo Pattacini

arxiv: 2604.04690 · v1 · submitted 2026-04-06 · 💻 cs.RO · cs.AI

Pickalo: Leveraging 6D Pose Estimation for Low-Cost Industrial Bin Picking

Alessandro Tarsi , Matteo Mastrogiuseppe , Saverio Taliani , Simone Cortinovis , Ugo Pattacini This is my paper

Pith reviewed 2026-05-10 19:19 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords bin picking6D pose estimationsynthetic data trainingindustrial roboticsgrasp planningmulti-view fusionlow-cost sensingcollision checking

0 comments

The pith

Pickalo shows that a low-cost RGB-D pipeline with 6D pose estimation from synthetic data can reach 600 mean picks per hour and 96-99% grasp success in dense industrial bins.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Pickalo is a complete bin-picking pipeline that mounts a standard RGB-D camera on a robot wrist to gather multiple scene views. It refines depth maps, segments objects with a Mask-RCNN model trained only on synthetic images, and localizes them with a zero-shot 6D pose estimator. A buffer fuses the estimates across views to cut noise and resolve symmetries before selecting grasps from precomputed candidates. The authors report the system running on a UR5e arm with a parallel-jaw gripper and RealSense camera, sustaining high throughput over 30-minute trials in filled euroboxes. This matters because it removes the need for costly 3D sensors or large real-world datasets in factory settings.

Core claim

The Pickalo pipeline, built on multi-view exploration, synthetic-trained segmentation, zero-shot 6D pose estimation, and a pose buffer for temporal fusion, achieves up to 600 mean picks per hour with 96-99% grasp success when deployed on a UR5e robot with parallel-jaw gripper and Intel RealSense D435i in densely filled euroboxes, with ablation studies confirming gains from depth refinement and pose buffering.

What carries the argument

The pose buffer module that fuses multi-view 6D observations over time to reduce noise and handle symmetries, combined with offline antipodal grasp candidate generation and online utility-based ranking with collision checking.

If this is right

The system sustains performance across 30-minute continuous runs in realistic dense clutter.
Refined depth maps from BridgeDepth improve collision reasoning and raise overall throughput.
Active multi-view capture from the wrist camera overcomes occlusions that single-view methods cannot handle.
Precomputed grasp sets queried online enable fast planning once poses are buffered.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The synthetic-data approach could cut data-collection costs for other robotic tasks that currently require real-world labels.
Factories might adapt the pipeline to new part types by updating only the offline grasp database rather than retraining perception models.
Longer deployments or different bin geometries would test whether the temporal fusion continues to stabilize poses under changing lighting or wear.

Load-bearing premise

A segmentation model trained only on photorealistic synthetic data plus a zero-shot 6D pose estimator can produce accurate enough poses for reliable grasping in real cluttered and occluded industrial scenes without any real-world fine-tuning.

What would settle it

A side-by-side test in the same densely filled euroboxes where replacing the synthetic-trained Mask-RCNN and zero-shot SAM-6D with versions fine-tuned on real images yields substantially lower pose error or grasp success rates.

Figures

Figures reproduced from arXiv: 2604.04690 by Alessandro Tarsi, Matteo Mastrogiuseppe, Saverio Taliani, Simone Cortinovis, Ugo Pattacini.

**Figure 1.** Figure 1: The experimental setup consists of a UR5e manipulator with a consumer-grade camera attached to the wrist. The bin is a standard eurobox heavily filled with small metallic objects. align object models with observations, even under partial occlusion. However, these solutions require a complex and expensive setup, posing a barrier to user adoption. More recently, research has shifted toward learning-based app… view at source ↗

**Figure 2.** Figure 2: Overview of the presented pipeline. A stereo-pair image is acquired and processed by the depth estimation block to obtain an enhanced depth reconstruction. The resulting depth is aligned to the left RGB frame and provided to the 6D Pose Estimation model, together with the object model CAD. The scene state is reconstructed by combining pose estimates across multiple views, occupied voxels, and static object… view at source ↗

**Figure 3.** Figure 3: RealSense on the left. BridgeDepth on the right. B. Depth Map Acquisition Standard active IR stereo sensors often produce noisy depth maps, especially in the presence of metallic objects [27]. To address this problem, we adopted BridgeDepth, a lightweight deep stereo matching framework that merges the advantages of stereo matching with the context understanding of monocular depth estimation [15]. The use o… view at source ↗

**Figure 5.** Figure 5: The sorted grasp poses are checked following the ranking order until a feasible grasp pose is found. Grasp poses leading to a collision are discarded. • Pose Confidence (Sconf ): A normalized score derived from the Pose Buffer, prioritizing objects detected with higher certainty. • Stacking Height (Sheight): A normalized height value, prioritizing objects on top of the pile to minimize possible collisions.… view at source ↗

**Figure 7.** Figure 7: Test objects used in the evaluation: (A) Square, (B) Cylindrical, and (C) Complex geometry. B. Grasp Pipeline Performance To quantitatively evaluate the system’s operational efficiency and grasp reliability, we adopted the following set of performance metrics: • Mean Picks per Hour (MPPH): A standard metric to evaluate system throughput. • Success Rate (SR): The percentage of successful grasp executions … view at source ↗

**Figure 6.** Figure 6: The grasping pipeline is parallelized to elaborate the perception and planning blocks for the current iteration, while the robot executes the grasping and releasing trajectories computed at the previous iteration. IV. EXPERIMENTS A. Experimental Setup The experimental testbed consists of the hardware components listed below. While specific models are detailed for reproducibility, the pipeline’s modular d… view at source ↗

**Figure 8.** Figure 8: Object pose estimation error distribution on XYZ-IBD dataset, enabling and disabling the pose buffer module. Next, we compared the full pipeline performance using Object B over three time intervals (5, 10, and 30 minutes). The results, shown in Table III, highlight the critical role of the pose buffer in optimizing grasp execution. TABLE III: Ablation: No Memory (NM) vs. With Memory (WM) Time Method MPPH S… view at source ↗

**Figure 9.** Figure 9 [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

read the original abstract

Bin picking in real industrial environments remains challenging due to severe clutter, occlusions, and the high cost of traditional 3D sensing setups. We present Pickalo, a modular 6D pose-based bin-picking pipeline built entirely on low-cost hardware. A wrist-mounted RGB-D camera actively explores the scene from multiple viewpoints, while raw stereo streams are processed with BridgeDepth to obtain refined depth maps suitable for accurate collision reasoning. Object instances are segmented with a Mask-RCNN model trained purely on photorealistic synthetic data and localized using the zero-shot SAM-6D pose estimator. A pose buffer module fuses multi-view observations over time, handling object symmetries and significantly reducing pose noise. Offline, we generate and curate large sets of antipodal grasp candidates per object; online, a utility-based ranking and fast collision checking are queried for the grasp planning. Deployed on a UR5e with a parallel-jaw gripper and an Intel RealSense D435i, Pickalo achieves up to 600 mean picks per hour with 96-99% grasp success and robust performance over 30-minute runs on densely filled euroboxes. Ablation studies demonstrate the benefits of enhanced depth estimation and of the pose buffer for long-term stability and throughput in realistic industrial conditions. Videos are available at https://mesh-iit.github.io/project-jl2-camozzi/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pickalo shows a working low-cost bin-picking system hitting 600 picks per hour on real hardware by gluing together existing vision pieces, but the sim-to-real pose accuracy in clutter is the unverified link.

read the letter

The main thing to know is that this paper reports a deployed pipeline on a UR5e with a wrist-mounted RealSense D435i that reaches up to 600 mean picks per hour and 96-99% grasp success over 30-minute runs on densely packed euroboxes. It does this without expensive 3D sensors or real-world fine-tuning of the vision stack. They combine BridgeDepth for refined depth maps, Mask-RCNN trained only on photorealistic synthetic data for segmentation, zero-shot SAM-6D for 6D poses, a multi-view pose buffer to reduce noise and handle symmetries, and offline antipodal grasp sets with online utility ranking plus collision checks. The ablations on depth refinement and the buffer are a plus because they tie directly to measured throughput and stability gains in the physical setup. That gives the work concrete value for anyone trying to lower the cost of industrial automation. The soft spot is exactly where the stress test points: the central performance numbers rest on the assumption that synthetic-trained Mask-RCNN plus zero-shot SAM-6D will deliver poses accurate enough (under 5° rotation or 5 mm translation errors) for reliable grasp selection in occluded, cluttered bins under real lighting. The abstract and reported runs do not include per-instance pose error distributions, comparisons against real-data baselines, or detailed failure breakdowns, so it is unclear how much the multi-view buffer is masking initial errors versus truly solving them. If those errors are larger than claimed on a non-trivial fraction of objects, the downstream ranking and collision checks would not sustain the quoted success rates. This is applied systems work aimed at robotics practitioners and factory automation teams who need practical numbers rather than new theory. A reader looking for evidence that current off-the-shelf models can close the gap to industrial rates will find usable details here. The paper deserves peer review because the hardware deployment and throughput metrics are specific enough to be checked and improved, even with the vision generalization questions left open.

Referee Report

3 major / 2 minor

Summary. Pickalo is a modular 6D pose-based bin-picking pipeline for low-cost industrial settings. It mounts an Intel RealSense D435i on a UR5e wrist, refines depth via BridgeDepth, segments instances with a Mask-RCNN trained exclusively on photorealistic synthetic data, estimates poses zero-shot with SAM-6D, fuses multi-view observations in a pose buffer to mitigate noise and symmetries, and selects from pre-generated antipodal grasps using utility ranking plus fast collision checks. The system reports up to 600 mean picks per hour with 96-99% grasp success over 30-minute runs on densely filled euroboxes, with ablations showing benefits from depth refinement and the pose buffer.

Significance. If the reported throughput and success rates hold under rigorous controls, the work is significant for demonstrating that low-cost RGB-D hardware plus synthetic-only training and zero-shot estimators can reach industrial bin-picking performance. This could lower barriers to automation in manufacturing. The emphasis on multi-view fusion for long-term stability and the open video release are positive for reproducibility.

major comments (3)

[§5.1] §5.1 (Experimental Results): The headline metrics (600 picks/hr, 96-99% grasp success) rest on the assumption that the synthetic-trained Mask-RCNN + zero-shot SAM-6D pipeline produces sufficiently accurate 6D poses in occluded, cluttered euroboxes. No quantitative pose accuracy metrics (e.g., ADD-S, mean rotation/translation error) are reported on the real test scenes, so it is impossible to confirm that downstream grasp ranking and collision checking are operating on reliable inputs rather than tolerating large errors.
[§5.2] §5.2 (Ablation Studies): The ablations credit the pose buffer and BridgeDepth for improved long-term stability and throughput, yet they presuppose that the per-view SAM-6D estimates are already reliable enough for fusion to be effective. Without before/after pose-error statistics or failure-case analysis under varying occlusion and lighting, the contribution of the buffer cannot be isolated from possible scene-selection effects.
[Table 1] Table 1 / Results: The 30-minute run success rates lack error bars, trial counts, or explicit description of scene randomization and post-hoc selection criteria. This makes it difficult to judge whether the 96-99% figure generalizes or reflects favorable conditions.

minor comments (2)

[Abstract] The abstract and §3.2 refer to 'robust performance' without defining the exact failure modes (e.g., dropped objects, collisions) that were counted as unsuccessful grasps.
[Figure 3] Figure captions for the system overview and grasp examples would benefit from explicit scale bars or camera-to-object distance annotations to help readers assess the clutter density.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments correctly identify areas where additional detail and analysis would strengthen the presentation of our results. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [§5.1] §5.1 (Experimental Results): The headline metrics (600 picks/hr, 96-99% grasp success) rest on the assumption that the synthetic-trained Mask-RCNN + zero-shot SAM-6D pipeline produces sufficiently accurate 6D poses in occluded, cluttered euroboxes. No quantitative pose accuracy metrics (e.g., ADD-S, mean rotation/translation error) are reported on the real test scenes, so it is impossible to confirm that downstream grasp ranking and collision checking are operating on reliable inputs rather than tolerating large errors.

Authors: We agree that direct quantitative pose accuracy metrics (such as ADD-S or mean rotation/translation error) on real cluttered scenes would provide valuable additional evidence. However, obtaining reliable ground-truth 6D poses for objects in densely occluded industrial scenes is extremely difficult without specialized motion-capture or marker-based setups that would alter the experimental conditions. Our evaluation therefore centers on the end-to-end task metric of grasp success rate, which directly measures whether the estimated poses are sufficiently accurate for the downstream planning and execution steps. In the revised manuscript we will add an explicit discussion of this evaluation choice, including the practical challenges of real-world pose ground truth, and we will include qualitative visualizations of estimated poses on representative real scenes to support the quantitative grasp results. revision: partial
Referee: [§5.2] §5.2 (Ablation Studies): The ablations credit the pose buffer and BridgeDepth for improved long-term stability and throughput, yet they presuppose that the per-view SAM-6D estimates are already reliable enough for fusion to be effective. Without before/after pose-error statistics or failure-case analysis under varying occlusion and lighting, the contribution of the buffer cannot be isolated from possible scene-selection effects.

Authors: The referee is correct that isolating the precise contribution of the pose buffer would be stronger with before/after pose-error statistics. Because ground-truth poses are unavailable on the real test scenes, we cannot compute such statistics. The current ablations instead demonstrate the buffer’s benefit through measurable improvements in long-term throughput and reduced failure accumulation over the 30-minute runs. In the revision we will expand the ablation discussion to include a more detailed qualitative failure-case analysis (e.g., examples of symmetry-induced drift that the buffer corrects) and will clarify how the buffer’s multi-view fusion mitigates per-view noise under the observed occlusion and lighting conditions. revision: partial
Referee: [Table 1] Table 1 / Results: The 30-minute run success rates lack error bars, trial counts, or explicit description of scene randomization and post-hoc selection criteria. This makes it difficult to judge whether the 96-99% figure generalizes or reflects favorable conditions.

Authors: We acknowledge that the current description of the experimental protocol is insufficient for full reproducibility and assessment of generalizability. In the revised manuscript we will (i) report the number of independent 30-minute trials performed, (ii) add error bars or standard deviations to the success-rate figures where appropriate, (iii) provide a detailed description of the scene-randomization procedure (object placement, density variation, and lighting), and (iv) explicitly state any post-hoc selection or reporting criteria. These additions will allow readers to better evaluate the robustness of the reported performance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical hardware performance claims with no derivations or self-referential predictions

full rationale

The paper presents a modular bin-picking system using off-the-shelf components (Mask-RCNN trained on synthetic data, zero-shot SAM-6D, BridgeDepth, multi-view pose buffer) and reports measured throughput and success rates on physical hardware. No equations, fitted parameters, or predictions are defined in terms of the paper's own outputs. Ablations compare empirical variants (depth refinement, pose buffer) against baselines but do not reduce any claimed result to a quantity defined inside the paper by construction. The central claims are falsifiable external measurements (picks per hour, grasp success over 30-minute runs) rather than internal consistency arguments.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The system rests on standard computer-vision and robotics assumptions rather than new mathematical axioms or invented physical entities. No free parameters are introduced in the abstract claims.

axioms (2)

domain assumption Synthetic photorealistic images are sufficient to train a Mask-RCNN that generalizes to real industrial clutter and lighting.
Stated in the abstract as the training regime for the segmentation model.
domain assumption Zero-shot SAM-6D pose estimation produces usable 6D poses on real objects without domain-specific fine-tuning.
Invoked when describing object localization in the deployed pipeline.

pith-pipeline@v0.9.0 · 5560 in / 1504 out tokens · 33109 ms · 2026-05-10T19:19:46.145365+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Object instances are segmented with a Mask-RCNN model trained purely on photorealistic synthetic data and localized using the zero-shot SAM-6D pose estimator. A pose buffer module fuses multi-view observations over time...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

[1]

6d pose estimation of textureless shiny objects using random ferns for bin-picking

Jos ´e Jeronimo Rodrigues, Jun-Sik Kim, Makoto Furukawa, Jo ˜ao Xavier, Pedro Aguiar, and Takeo Kanade. 6d pose estimation of textureless shiny objects using random ferns for bin-picking. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3334–3341, 2012

work page 2012
[2]

Model globally, match locally: Efficient and robust 3d object recognition

Bertram Drost, Markus Ulrich, Nassir Navab, and Slobodan Ili ´c. Model globally, match locally: Efficient and robust 3d object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 998–1005, 2010

work page 2010
[3]

Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes

Stefan Hinterstoisser, Stefan Holzer, Cedric Cagniart, Slobodan Ilic, Kurt Konolige, Nassir Navab, and Vincent Lepetit. Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In2011 International Conference on Computer Vision, pages 858–865, 2011

work page 2011
[4]

Cad-based recognition of 3d objects in monocular images

Markus Ulrich, Christian Wiedemann, and Carsten Steger. Cad-based recognition of 3d objects in monocular images. In2009 IEEE Inter- national Conference on Robotics and Automation, pages 1191–1198, 2009. 9

work page 2009
[5]

Cad-based pose estimation for random bin-picking of multiple objects using a rgb-d camera

Cheng-Hei Wu, Sin-Yi Jiang, and Kai-Tai Song. Cad-based pose estimation for random bin-picking of multiple objects using a rgb-d camera. In2015 15th International Conference on Control, Automation and Systems (ICCAS), pages 1645–1649, 2015

work page 2015
[6]

A novel robotic grasp detection framework using low-cost rgb-d camera for industrial bin picking.IEEE Trans

Han Sun, Zhuangzhuang Zhang, Haili Wang, Yizhao Wang, and Qixin Cao. A novel robotic grasp detection framework using low-cost rgb-d camera for industrial bin picking.IEEE Trans. on Instrumentation and Measurement, 73:1–12, 2024

work page 2024
[7]

Xungao Zhong, Tao Gong, Junzhi Yu, Jiaguo Luo, Chengxian Zhou, Xunyu Zhong, and Qiang Liu. Region-aware grasping for stacked work- pieces: A 6d-wise label self-generation method and robust evaluation strategy.IEEE Transactions on Automation Science and Engineering, PP:1–1, 01 2025

work page 2025
[8]

Ang, and Gregory S

Peiyuan Ni, Chee Meng Chew, Marcelo H. Ang, and Gregory S. Chirikjian. Reasoning and learning a perceptual metric for self-training of reflective objects in bin-picking with a low-cost camera.IEEE Robotics and Automation Letters, 10(10):10458–10465, October 2025

work page 2025
[9]

Accurate and efficient zero-shot 6d pose estimation with frozen foundation models, 2025

Andrea Caraffa, Davide Boscaini, and Fabio Poiesi. Accurate and efficient zero-shot 6d pose estimation with frozen foundation models, 2025

work page 2025
[10]

Foundationpose: Unified 6d pose estimation and tracking of novel objects, 2024

Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects, 2024

work page 2024
[11]

SAM-6D: Segment anything model meets zero-shot 6d object pose estimation.arXiv preprint arXiv:2311.15707, 2023

Jiehong Lin, Lihua Liu, Dekun Lu, and Kui Jia. SAM-6D: Segment anything model meets zero-shot 6d object pose estimation.arXiv preprint arXiv:2311.15707, 2023

work page arXiv 2023
[12]

Waslander

Jun Yang, Dong Li, and Steven L. Waslander. Probabilistic multi- view fusion of active stereo depth maps for robotic bin-picking.IEEE Robotics and Automation Letters, 6(3):4472–4479, 2021

work page 2021
[13]

Foundationstereo: Zero-shot stereo matching

Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundationstereo: Zero-shot stereo matching. CVPR, 2025

work page 2025
[14]

DEFOM-Stereo: Depth Foundation Model Based Stereo Matching

Hualie Jiang, Zhiqiang Lou, Laiyan Ding, Rui Xu, Minglang Tan, Wenjie Jiang, and Rui Huang. DEFOM-Stereo: Depth Foundation Model Based Stereo Matching . In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21857–21867, Los Alamitos, CA, USA, June 2025. IEEE Computer Society

work page 2025
[15]

BridgeDepth: Bridging monocular and stereo reasoning with latent alignment

Tongfan Guan, Jiaxin Guo, Chen Wang, and Yun-Hui Liu. BridgeDepth: Bridging monocular and stereo reasoning with latent alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 27681–27691, 2025. Highlight paper

work page 2025
[16]

Deep learning based 6-dof antipodal grasp planning from point cloud in random bin- picking task using single-view

Tat Hieu Bui, Yeong Gwang Son, Seung Jae Moon, Quang Huy Nguyen, Issac Rhee, Juyong Hong, and Hyouk Ryeol Choi. Deep learning based 6-dof antipodal grasp planning from point cloud in random bin- picking task using single-view. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024

work page 2024
[17]

AnyGrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics, 2023

Hao-Shu Fang, Chenxi Wang, Hongjie Fang, Minghao Gou, Jirong Liu, Hengxu Yan, Wenhai Liu, Yichen Xie, and Cewu Lu. AnyGrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics, 2023

work page 2023
[18]

PoseCNN: A convolutional neural network for 6d object pose estimation in cluttered scenes

Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. PoseCNN: A convolutional neural network for 6d object pose estimation in cluttered scenes. InProceedings of Robotics: Science and Systems (RSS), 2018

work page 2018
[19]

Sinha, and Pascal Fua

Bugra Tekin, Sudipta N. Sinha, and Pascal Fua. Real-time seamless single shot 6d object pose prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 292–301, 2018

work page 2018
[20]

Megapose: 6d pose estimation of novel objects via render & compare, 2022

Yann Labb ´e, Lucas Manuelli, Arsalan Mousavian, Stephen Tyree, Stan Birchfield, Jonathan Tremblay, Justin Carpentier, Mathieu Aubry, Dieter Fox, and Josef Sivic. Megapose: 6d pose estimation of novel objects via render & compare, 2022

work page 2022
[21]

A sim-to-real object recognition and localization framework for industrial robotic bin picking.IEEE Robotics and Automation Letters, 7(2):3961–3968, 2022

Xianzhi Li, Rui Cao, Yidan Feng, Kai Chen, Biqi Yang, Chi-Wing Fu, Yichuan Li, Qi Dou, Yun-Hui Liu, and Pheng-Ann Heng. A sim-to-real object recognition and localization framework for industrial robotic bin picking.IEEE Robotics and Automation Letters, 7(2):3961–3968, 2022

work page 2022
[22]

Densefusion: 6d object pose estimation by iterative dense fusion

Chen Wang, Danfei Xu, Yuke Zhu, Roberto Mart ´ın-Mart´ın, Cewu Lu, Li Fei-Fei, and Silvio Savarese. Densefusion: 6d object pose estimation by iterative dense fusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3343–3352, 2019

work page 2019
[23]

Waslander

Jun Yang, Yizhou Gao, Dong Li, and Steven L. Waslander. Robi: A multi-view dataset for reflective objects in robotic bin-picking. In2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9788–9795, 2021

work page 2021
[24]

A low-cost, high-speed, and robust bin picking system for factory automation enabled by a non-stop, multi-view, and active vision scheme

Xingdou Fu, Lin Miao, Yasuhiro Ohnishi, Yuki Hasegawa, and Masaki Suwa. A low-cost, high-speed, and robust bin picking system for factory automation enabled by a non-stop, multi-view, and active vision scheme. InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11566–11573, 2024

work page 2024
[25]

Cosy- Pose: Consistent multi-view multi-object 6d pose estimation

Yann Labb ´e, Justin Carpentier, Mathieu Aubry, and Josef Sivic. Cosy- Pose: Consistent multi-view multi-object 6d pose estimation. InPro- ceedings of the European Conference on Computer Vision (ECCV), 2020

work page 2020
[26]

Lessons and winning solutions in industrial object detection and pose estimation from the 2025 bin-picking perception challenge

Ziqin Huang, Chengxi Li, Yingyue Li, Xingyu Liu, Chenyangguang Zhang, Ruida Zhang, Bowen Fu, Xinggang Hu, Yun Qu, Mengge Liu, Yixiu Mao, Wendong Huang, Gu Wang, and Xiangyang Ji. Lessons and winning solutions in industrial object detection and pose estimation from the 2025 bin-picking perception challenge. InProceedings of the IEEE/CVF International Confe...

work page 2025
[27]

A low-cost, high-speed, and robust bin picking system for factory automation enabled by a non-stop, multi-view, and active vision scheme

Xingdou Fu, Lin Miao, Yasuhiro Ohnishi, Yuki Hasegawa, and Masaki Suwa. A low-cost, high-speed, and robust bin picking system for factory automation enabled by a non-stop, multi-view, and active vision scheme. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11566–11573, 2024

work page 2024
[28]

Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything, 2023

work page 2023
[29]

Strobl, Matthias Humt, and Rudolph Triebel

Maximilian Denninger, Dominik Winkelbauer, Martin Sundermeyer, Wout Boerdijk, Markus Knauer, Klaus H. Strobl, Matthias Humt, and Rudolph Triebel. Blenderproc2: A procedural pipeline for photorealistic rendering.Journal of Open Source Software, 8(82):4901, 2023

work page 2023
[30]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Girshick. Mask r-cnn. In2017 IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017

work page 2017
[31]

Averaging quaternions.Journal of Guidance, Control, and Dynamics, 30(4):1193–1197, 2007

F Landis Markley, Yang Cheng, John L Crassidis, and Yaakov Oshman. Averaging quaternions.Journal of Guidance, Control, and Dynamics, 30(4):1193–1197, 2007

work page 2007
[32]

On object symmetries and 6d pose estimation from images

Giorgia Pitteri, Micha ¨el Ramamonjisoa, Slobodan Ilic, and Vincent Lepetit. On object symmetries and 6d pose estimation from images. In2019 International conference on 3D vision (3DV), pages 614–622. IEEE, 2019

work page 2019
[33]

Kilian Kleeberger, Florian Roth, Richard Bormann, and Marco F. Huber. Automatic grasp pose generation for parallel jaw grippers, 2021

work page 2021
[34]

Xyz-ibd: A high-precision bin-picking dataset for object 6d pose estimation capturing real-world industrial complexity, 2025

Junwen Huang, Jizhong Liang, Jiaqi Hu, Martin Sundermeyer, Peter KT Yu, Nassir Navab, and Benjamin Busam. Xyz-ibd: A high-precision bin-picking dataset for object 6d pose estimation capturing real-world industrial complexity, 2025. Alessandro Tarsireceived the B.S. and M.S. de- grees in Automation Engineering from the Univer- sity Politecnica delle Marche...

work page 2025

[1] [1]

6d pose estimation of textureless shiny objects using random ferns for bin-picking

Jos ´e Jeronimo Rodrigues, Jun-Sik Kim, Makoto Furukawa, Jo ˜ao Xavier, Pedro Aguiar, and Takeo Kanade. 6d pose estimation of textureless shiny objects using random ferns for bin-picking. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3334–3341, 2012

work page 2012

[2] [2]

Model globally, match locally: Efficient and robust 3d object recognition

Bertram Drost, Markus Ulrich, Nassir Navab, and Slobodan Ili ´c. Model globally, match locally: Efficient and robust 3d object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 998–1005, 2010

work page 2010

[3] [3]

Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes

Stefan Hinterstoisser, Stefan Holzer, Cedric Cagniart, Slobodan Ilic, Kurt Konolige, Nassir Navab, and Vincent Lepetit. Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In2011 International Conference on Computer Vision, pages 858–865, 2011

work page 2011

[4] [4]

Cad-based recognition of 3d objects in monocular images

Markus Ulrich, Christian Wiedemann, and Carsten Steger. Cad-based recognition of 3d objects in monocular images. In2009 IEEE Inter- national Conference on Robotics and Automation, pages 1191–1198, 2009. 9

work page 2009

[5] [5]

Cad-based pose estimation for random bin-picking of multiple objects using a rgb-d camera

Cheng-Hei Wu, Sin-Yi Jiang, and Kai-Tai Song. Cad-based pose estimation for random bin-picking of multiple objects using a rgb-d camera. In2015 15th International Conference on Control, Automation and Systems (ICCAS), pages 1645–1649, 2015

work page 2015

[6] [6]

A novel robotic grasp detection framework using low-cost rgb-d camera for industrial bin picking.IEEE Trans

Han Sun, Zhuangzhuang Zhang, Haili Wang, Yizhao Wang, and Qixin Cao. A novel robotic grasp detection framework using low-cost rgb-d camera for industrial bin picking.IEEE Trans. on Instrumentation and Measurement, 73:1–12, 2024

work page 2024

[7] [7]

Xungao Zhong, Tao Gong, Junzhi Yu, Jiaguo Luo, Chengxian Zhou, Xunyu Zhong, and Qiang Liu. Region-aware grasping for stacked work- pieces: A 6d-wise label self-generation method and robust evaluation strategy.IEEE Transactions on Automation Science and Engineering, PP:1–1, 01 2025

work page 2025

[8] [8]

Ang, and Gregory S

Peiyuan Ni, Chee Meng Chew, Marcelo H. Ang, and Gregory S. Chirikjian. Reasoning and learning a perceptual metric for self-training of reflective objects in bin-picking with a low-cost camera.IEEE Robotics and Automation Letters, 10(10):10458–10465, October 2025

work page 2025

[9] [9]

Accurate and efficient zero-shot 6d pose estimation with frozen foundation models, 2025

Andrea Caraffa, Davide Boscaini, and Fabio Poiesi. Accurate and efficient zero-shot 6d pose estimation with frozen foundation models, 2025

work page 2025

[10] [10]

Foundationpose: Unified 6d pose estimation and tracking of novel objects, 2024

Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects, 2024

work page 2024

[11] [11]

SAM-6D: Segment anything model meets zero-shot 6d object pose estimation.arXiv preprint arXiv:2311.15707, 2023

Jiehong Lin, Lihua Liu, Dekun Lu, and Kui Jia. SAM-6D: Segment anything model meets zero-shot 6d object pose estimation.arXiv preprint arXiv:2311.15707, 2023

work page arXiv 2023

[12] [12]

Waslander

Jun Yang, Dong Li, and Steven L. Waslander. Probabilistic multi- view fusion of active stereo depth maps for robotic bin-picking.IEEE Robotics and Automation Letters, 6(3):4472–4479, 2021

work page 2021

[13] [13]

Foundationstereo: Zero-shot stereo matching

Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundationstereo: Zero-shot stereo matching. CVPR, 2025

work page 2025

[14] [14]

DEFOM-Stereo: Depth Foundation Model Based Stereo Matching

Hualie Jiang, Zhiqiang Lou, Laiyan Ding, Rui Xu, Minglang Tan, Wenjie Jiang, and Rui Huang. DEFOM-Stereo: Depth Foundation Model Based Stereo Matching . In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21857–21867, Los Alamitos, CA, USA, June 2025. IEEE Computer Society

work page 2025

[15] [15]

BridgeDepth: Bridging monocular and stereo reasoning with latent alignment

Tongfan Guan, Jiaxin Guo, Chen Wang, and Yun-Hui Liu. BridgeDepth: Bridging monocular and stereo reasoning with latent alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 27681–27691, 2025. Highlight paper

work page 2025

[16] [16]

Deep learning based 6-dof antipodal grasp planning from point cloud in random bin- picking task using single-view

Tat Hieu Bui, Yeong Gwang Son, Seung Jae Moon, Quang Huy Nguyen, Issac Rhee, Juyong Hong, and Hyouk Ryeol Choi. Deep learning based 6-dof antipodal grasp planning from point cloud in random bin- picking task using single-view. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024

work page 2024

[17] [17]

AnyGrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics, 2023

Hao-Shu Fang, Chenxi Wang, Hongjie Fang, Minghao Gou, Jirong Liu, Hengxu Yan, Wenhai Liu, Yichen Xie, and Cewu Lu. AnyGrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics, 2023

work page 2023

[18] [18]

PoseCNN: A convolutional neural network for 6d object pose estimation in cluttered scenes

Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. PoseCNN: A convolutional neural network for 6d object pose estimation in cluttered scenes. InProceedings of Robotics: Science and Systems (RSS), 2018

work page 2018

[19] [19]

Sinha, and Pascal Fua

Bugra Tekin, Sudipta N. Sinha, and Pascal Fua. Real-time seamless single shot 6d object pose prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 292–301, 2018

work page 2018

[20] [20]

Megapose: 6d pose estimation of novel objects via render & compare, 2022

Yann Labb ´e, Lucas Manuelli, Arsalan Mousavian, Stephen Tyree, Stan Birchfield, Jonathan Tremblay, Justin Carpentier, Mathieu Aubry, Dieter Fox, and Josef Sivic. Megapose: 6d pose estimation of novel objects via render & compare, 2022

work page 2022

[21] [21]

A sim-to-real object recognition and localization framework for industrial robotic bin picking.IEEE Robotics and Automation Letters, 7(2):3961–3968, 2022

Xianzhi Li, Rui Cao, Yidan Feng, Kai Chen, Biqi Yang, Chi-Wing Fu, Yichuan Li, Qi Dou, Yun-Hui Liu, and Pheng-Ann Heng. A sim-to-real object recognition and localization framework for industrial robotic bin picking.IEEE Robotics and Automation Letters, 7(2):3961–3968, 2022

work page 2022

[22] [22]

Densefusion: 6d object pose estimation by iterative dense fusion

Chen Wang, Danfei Xu, Yuke Zhu, Roberto Mart ´ın-Mart´ın, Cewu Lu, Li Fei-Fei, and Silvio Savarese. Densefusion: 6d object pose estimation by iterative dense fusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3343–3352, 2019

work page 2019

[23] [23]

Waslander

Jun Yang, Yizhou Gao, Dong Li, and Steven L. Waslander. Robi: A multi-view dataset for reflective objects in robotic bin-picking. In2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9788–9795, 2021

work page 2021

[24] [24]

A low-cost, high-speed, and robust bin picking system for factory automation enabled by a non-stop, multi-view, and active vision scheme

Xingdou Fu, Lin Miao, Yasuhiro Ohnishi, Yuki Hasegawa, and Masaki Suwa. A low-cost, high-speed, and robust bin picking system for factory automation enabled by a non-stop, multi-view, and active vision scheme. InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11566–11573, 2024

work page 2024

[25] [25]

Cosy- Pose: Consistent multi-view multi-object 6d pose estimation

Yann Labb ´e, Justin Carpentier, Mathieu Aubry, and Josef Sivic. Cosy- Pose: Consistent multi-view multi-object 6d pose estimation. InPro- ceedings of the European Conference on Computer Vision (ECCV), 2020

work page 2020

[26] [26]

Lessons and winning solutions in industrial object detection and pose estimation from the 2025 bin-picking perception challenge

Ziqin Huang, Chengxi Li, Yingyue Li, Xingyu Liu, Chenyangguang Zhang, Ruida Zhang, Bowen Fu, Xinggang Hu, Yun Qu, Mengge Liu, Yixiu Mao, Wendong Huang, Gu Wang, and Xiangyang Ji. Lessons and winning solutions in industrial object detection and pose estimation from the 2025 bin-picking perception challenge. InProceedings of the IEEE/CVF International Confe...

work page 2025

[27] [27]

A low-cost, high-speed, and robust bin picking system for factory automation enabled by a non-stop, multi-view, and active vision scheme

Xingdou Fu, Lin Miao, Yasuhiro Ohnishi, Yuki Hasegawa, and Masaki Suwa. A low-cost, high-speed, and robust bin picking system for factory automation enabled by a non-stop, multi-view, and active vision scheme. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11566–11573, 2024

work page 2024

[28] [28]

Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything, 2023

work page 2023

[29] [29]

Strobl, Matthias Humt, and Rudolph Triebel

Maximilian Denninger, Dominik Winkelbauer, Martin Sundermeyer, Wout Boerdijk, Markus Knauer, Klaus H. Strobl, Matthias Humt, and Rudolph Triebel. Blenderproc2: A procedural pipeline for photorealistic rendering.Journal of Open Source Software, 8(82):4901, 2023

work page 2023

[30] [30]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Girshick. Mask r-cnn. In2017 IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017

work page 2017

[31] [31]

Averaging quaternions.Journal of Guidance, Control, and Dynamics, 30(4):1193–1197, 2007

F Landis Markley, Yang Cheng, John L Crassidis, and Yaakov Oshman. Averaging quaternions.Journal of Guidance, Control, and Dynamics, 30(4):1193–1197, 2007

work page 2007

[32] [32]

On object symmetries and 6d pose estimation from images

Giorgia Pitteri, Micha ¨el Ramamonjisoa, Slobodan Ilic, and Vincent Lepetit. On object symmetries and 6d pose estimation from images. In2019 International conference on 3D vision (3DV), pages 614–622. IEEE, 2019

work page 2019

[33] [33]

Kilian Kleeberger, Florian Roth, Richard Bormann, and Marco F. Huber. Automatic grasp pose generation for parallel jaw grippers, 2021

work page 2021

[34] [34]

Xyz-ibd: A high-precision bin-picking dataset for object 6d pose estimation capturing real-world industrial complexity, 2025

Junwen Huang, Jizhong Liang, Jiaqi Hu, Martin Sundermeyer, Peter KT Yu, Nassir Navab, and Benjamin Busam. Xyz-ibd: A high-precision bin-picking dataset for object 6d pose estimation capturing real-world industrial complexity, 2025. Alessandro Tarsireceived the B.S. and M.S. de- grees in Automation Engineering from the Univer- sity Politecnica delle Marche...

work page 2025