COFFAIL: A Dataset of Successful and Anomalous Robot Skill Executions in the Context of Coffee Preparation

Alex Mitrevski; Ayush Salunke

arxiv: 2604.18236 · v1 · submitted 2026-04-20 · 💻 cs.RO

COFFAIL: A Dataset of Successful and Anomalous Robot Skill Executions in the Context of Coffee Preparation

Alex Mitrevski , Ayush Salunke This is my paper

Pith reviewed 2026-05-10 04:10 UTC · model grok-4.3

classification 💻 cs.RO

keywords robot learningdatasetimitation learningmanipulationanomalous executionscoffee preparationbimanual manipulationkitchen environment

0 comments

The pith

The COFFAIL dataset supplies robot executions of coffee-preparation skills that include both successes and anomalies to train imitation-learning policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces COFFAIL, a collection of robot skill episodes performed while making coffee. Most datasets record only successful runs, but this one also captures anomalous executions in which the robot encounters problems. The episodes were gathered on a physical robot in a kitchen, with a few tasks using two arms at once. The authors demonstrate how the combined data can be fed into imitation learning to produce a robot policy. Readers should care because everyday robot use will involve mistakes, and data that shows those mistakes may help policies recover or avoid them.

Core claim

The COFFAIL dataset comprises successful and anomalous skill execution episodes collected with a physical robot in a kitchen environment for coffee preparation tasks, including a couple performed with bimanual manipulation, and the data is shown to support robot policy learning through imitation learning.

What carries the argument

The COFFAIL dataset of mixed successful and anomalous robot skill executions for coffee preparation.

If this is right

Policies trained on the mixed data can be expected to handle errors that arise during coffee-preparation sequences.
The dataset supplies examples for both single-arm and two-arm manipulation within the same task domain.
Data gathered in a real kitchen setting can serve as a starting point for testing policies outside laboratory conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same approach of recording both successes and failures could be repeated for other household tasks such as cooking or cleaning.
The anomalous episodes might also be used to train separate modules that detect when a skill is going wrong.
Evaluating the learned policy on failures that were never seen during training would test whether the dataset generalizes beyond the collected anomalies.

Load-bearing premise

The recorded anomalous executions are representative of the failures a robot would meet in actual use, so that imitation learning on the mixed set yields better policies than success-only training.

What would settle it

Train one imitation-learning policy on only the successful COFFAIL episodes and another on the full set, then measure both policies on the same set of test tasks; if the mixed-data policy shows no higher success rate, the benefit of including anomalies is not demonstrated.

Figures

Figures reproduced from arXiv: 2604.18236 by Alex Mitrevski, Ayush Salunke.

**Figure 2.** Figure 2: Illustration of the skill for picking up a cup [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the bimanual pouring skill (the left arm holds the cup and the right arm pours) [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: CNN-based policy network used to illustrate imitation learning on our data [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: Predicted vs. ground-truth actions of the CNN policy [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

read the original abstract

In the context of robot learning for manipulation, curated datasets are an important resource for advancing the state of the art; however, available datasets typically only include successful executions or are focused on one particular type of skill. In this short paper, we briefly describe a dataset of various skills performed in the context of coffee preparation. The dataset, which we call COFFAIL, includes both successful and anomalous skill execution episodes collected with a physical robot in a kitchen environment, a couple of which are performed with bimanual manipulation. In addition to describing the data collection setup and the collected data, the paper illustrates the use of the data in COFFAIL to learn a robot policy using imitation learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

COFFAIL is a small new dataset of physical robot coffee-prep skills that includes anomalous executions, but the paper stays mostly descriptive with minimal validation.

read the letter

COFFAIL is a new dataset of robot-executed coffee preparation skills that includes both successful and anomalous runs, collected on physical hardware in a kitchen. A couple of the episodes use bimanual manipulation. The paper does a decent job describing the collection setup and showing one simple use case with imitation learning. Real-world robot data with failures is still relatively rare, so this fills a small gap for people working on robust manipulation policies. The soft spots are that the manuscript is very short and provides almost no quantitative information. There are no numbers on the total number of episodes, how the anomalous ones were identified or labeled, or any metrics showing that training on the mixed data improves performance over successful-only data. The imitation learning illustration is mentioned but not evaluated in any detail. This is the kind of paper that might interest a narrow group of researchers building datasets for robot learning or studying anomaly detection in manipulation. It does not claim to solve a big problem or introduce new methods. I would send it to peer review as a data paper, but the reviewers would likely ask for more details on the data characteristics and at least basic experiments comparing policies trained with and without the anomalous examples.

Referee Report

2 major / 1 minor

Summary. The paper introduces the COFFAIL dataset of successful and anomalous robot skill executions for coffee preparation tasks, collected with a physical robot in a kitchen environment and including some bimanual manipulation episodes. It describes the data collection setup and collected data, and illustrates the dataset's use for learning a robot policy via imitation learning.

Significance. A dataset explicitly containing both successful and anomalous executions fills a notable gap, as most robot manipulation datasets focus solely on successes; this could support research on failure-aware or robust policies. The inclusion of bimanual examples adds value for complex tasks. The basic imitation learning illustration shows one potential use case, though its impact depends on the quality and documentation of the anomalous data.

major comments (2)

[Abstract / data collection setup] Abstract and data collection setup description: the central claim that the dataset includes anomalous skill execution episodes is load-bearing for the paper's contribution, yet no details are provided on how anomalies were identified, labeled, or verified (e.g., via human annotation, sensor thresholds, or post-hoc analysis), leaving reproducibility and representativeness unsupported.
[Imitation learning illustration] Imitation learning illustration: the paper states it illustrates use of the data to learn a robot policy using imitation learning, but supplies no quantitative results, error metrics, baselines, or comparison of policies trained with vs. without anomalous episodes, which undermines the demonstration of the dataset's utility.

minor comments (1)

Consider adding a summary table of episode counts (successful vs. anomalous) per skill type to improve clarity of the collected data description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments on our manuscript describing the COFFAIL dataset. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: Abstract and data collection setup description: the central claim that the dataset includes anomalous skill execution episodes is load-bearing for the paper's contribution, yet no details are provided on how anomalies were identified, labeled, or verified (e.g., via human annotation, sensor thresholds, or post-hoc analysis), leaving reproducibility and representativeness unsupported.

Authors: We agree that providing details on anomaly identification is essential for the dataset's utility and reproducibility. The current manuscript focuses on describing the setup and data but omits this aspect. In the revised manuscript, we will add a subsection under data collection explaining that anomalous episodes were identified through post-hoc review by the researchers, noting deviations such as spills, incorrect placements, or failed grasps based on video recordings and task outcomes. revision: yes
Referee: Imitation learning illustration: the paper states it illustrates use of the data to learn a robot policy using imitation learning, but supplies no quantitative results, error metrics, baselines, or comparison of policies trained with vs. without anomalous episodes, which undermines the demonstration of the dataset's utility.

Authors: The imitation learning example serves as a basic illustration of dataset usage for policy learning, consistent with the short paper format. We recognize that quantitative results would better showcase the dataset's value. We will revise to include simple quantitative metrics, such as the success rate of the learned policy on held-out test episodes. However, a full comparison of policies trained with and without anomalous data would necessitate additional training runs and analysis, which we view as extending beyond the illustrative purpose; we will instead add a discussion on how including anomalous data could aid in learning failure-aware policies. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is a short data-description paper whose central claim is the release of the COFFAIL dataset (successful plus anomalous executions for coffee-preparation skills, some bimanual) together with a basic illustration of imitation learning on the data. No equations, derivations, fitted parameters, or predictions appear; the only technical step is a straightforward demonstration that the released data can be used for standard imitation learning. No self-citations are load-bearing, no ansatz is smuggled, and no result is renamed or redefined in terms of itself. The argument is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no mathematical models, free parameters, axioms, or invented entities; it is a descriptive dataset paper relying on standard robotics data collection practices.

pith-pipeline@v0.9.0 · 5417 in / 984 out tokens · 28076 ms · 2026-05-10T04:10:25.452349+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

A Review of Robot Learn- ing for Manipulation: Challenges, Representations, and Algorithms,

O. Kroemer, S. Niekum, and G. Konidaris, “A Review of Robot Learn- ing for Manipulation: Challenges, Representations, and Algorithms,” Journal Machine Learning Research, vol. 22, pp. 1–82, 2021

work page 2021
[2]

Foundation models in robotics: Applications, challenges, and the future,

R. Firooziet al., “Foundation models in robotics: Applications, challenges, and the future,”Int. Journal Robotics Research, vol. 44, no. 5, pp. 701–739, 2025

work page 2025
[3]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models,

A. O’Neillet al., “Open X-Embodiment: Robotic Learning Datasets and RT-X Models,” inProc. IEEE Int. Conf. Robotics and Automation (ICRA), 2024, pp. 6892–6903

work page 2024
[4]

BridgeData V2: A Dataset for Robot Learning at Scale,

H. Walkeet al., “BridgeData V2: A Dataset for Robot Learning at Scale,” inProc. 7th Conf. Robot Learning (CoRL), 2023

work page 2023
[5]

Demonstrating REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly,

D. Sliwowskiet al., “Demonstrating REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly,” inProc. Robotics: Science and Systems (RSS), 2025

work page 2025
[6]

Stow: Robotic Packing of Items into Fabric Pods,

N. Hudsonet al., “Stow: Robotic Packing of Items into Fabric Pods,”CoRR, vol. abs/2505.04572, 2025. [Online]. Available: https://arxiv.org/abs/2505.04572

work page arXiv 2025
[7]

ARMBench: An Object-centric Benchmark Dataset for Robotic Manipulation,

C. Mitashet al., “ARMBench: An Object-centric Benchmark Dataset for Robotic Manipulation,” inProc. IEEE Int. Conf. Robotics and Automation (ICRA), 2023, pp. 9132–9139

work page 2023
[8]

Aursad: Universal robot screwdriving anomaly detection dataset

B. Leporowski, D. Tola, C. Hansen, and A. Iosifidis, “AURSAD: Universal Robot Screwdriving Anomaly Detection Dataset,”CoRR, vol. abs/2102.01409, 2021. [Online]. Available: https://arxiv.org/abs/ 2102.01409

work page arXiv 2021
[9]

ConditionNET: Learning Preconditions and Effects for Execution Monitoring,

D. Sliwowski and D. Lee, “ConditionNET: Learning Preconditions and Effects for Execution Monitoring,”IEEE Robotics and Automation Letters (RA-L), vol. 10, no. 2, pp. 1337–1344, 2025

work page 2025
[10]

A Multimodal Handover Failure Detection Dataset and Baselines,

S. Thoduka, N. Hochgeschwender, J. Gall, and P. G. Pl ¨oger, “A Multimodal Handover Failure Detection Dataset and Baselines,” in Proc. IEEE Int. Conf. Robotics and Automation (ICRA), 2024, pp. 17 013–17 019

work page 2024
[11]

Using Visual Anomaly Detection for Task Execution Monitoring,

S. Thoduka, J. Gall, and P. G. Pl ¨oger, “Using Visual Anomaly Detection for Task Execution Monitoring,” inProc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), 2021, pp. 4604–4610. TABLE I: A summary of the number of successful (✓) and failed (✗) episodes for each skill in the dataset Skill Cup pickup Moving a cup Pouring Cup placing Spoon pic...

work page 2021
[12]

Robot Action Diagno- sis and Experience Correction by Falsifying Parameterised Execution Models,

A. Mitrevski, P. G. Pl ¨oger, and G. Lakemeyer, “Robot Action Diagno- sis and Experience Correction by Falsifying Parameterised Execution Models,” inProc. IEEE Int. Conf. Robotics and Automation (ICRA), 2021, pp. 11 025–11 031

work page 2021
[13]

REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction,

Z. Liu, A. Bahety, and S. Song, “REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction,” inProc. 7th Conf. Robot Learning (CoRL), 2023, pp. 3468–3484

work page 2023
[14]

FINO- Net: A Deep Multimodal Sensor Fusion Framework for Manipulation Failure Detection,

A. Inceoglu, E. E. Aksoy, A. Cihan Ak, and S. Sariel, “FINO- Net: A Deep Multimodal Sensor Fusion Framework for Manipulation Failure Detection,” inProc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), 2021, pp. 6841–6847

work page 2021
[15]

RoboMIND: Benchmark on Multi-embodiment Intel- Fig. 5: Predicted vs. ground-truth actions of the CNN policy ligence Normative Data for Robot Manipulation,

K. Wuet al., “RoboMIND: Benchmark on Multi-embodiment Intel- Fig. 5: Predicted vs. ground-truth actions of the CNN policy ligence Normative Data for Robot Manipulation,” inProc. Robotics: Science and Systems (RSS), 2025

work page 2025
[16]

Adam: A Method for Stochastic Optimiza- tion,

D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimiza- tion,” inProc. Int. Conf. Learning Representations (ICLR), 2015

work page 2015

[1] [1]

A Review of Robot Learn- ing for Manipulation: Challenges, Representations, and Algorithms,

O. Kroemer, S. Niekum, and G. Konidaris, “A Review of Robot Learn- ing for Manipulation: Challenges, Representations, and Algorithms,” Journal Machine Learning Research, vol. 22, pp. 1–82, 2021

work page 2021

[2] [2]

Foundation models in robotics: Applications, challenges, and the future,

R. Firooziet al., “Foundation models in robotics: Applications, challenges, and the future,”Int. Journal Robotics Research, vol. 44, no. 5, pp. 701–739, 2025

work page 2025

[3] [3]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models,

A. O’Neillet al., “Open X-Embodiment: Robotic Learning Datasets and RT-X Models,” inProc. IEEE Int. Conf. Robotics and Automation (ICRA), 2024, pp. 6892–6903

work page 2024

[4] [4]

BridgeData V2: A Dataset for Robot Learning at Scale,

H. Walkeet al., “BridgeData V2: A Dataset for Robot Learning at Scale,” inProc. 7th Conf. Robot Learning (CoRL), 2023

work page 2023

[5] [5]

Demonstrating REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly,

D. Sliwowskiet al., “Demonstrating REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly,” inProc. Robotics: Science and Systems (RSS), 2025

work page 2025

[6] [6]

Stow: Robotic Packing of Items into Fabric Pods,

N. Hudsonet al., “Stow: Robotic Packing of Items into Fabric Pods,”CoRR, vol. abs/2505.04572, 2025. [Online]. Available: https://arxiv.org/abs/2505.04572

work page arXiv 2025

[7] [7]

ARMBench: An Object-centric Benchmark Dataset for Robotic Manipulation,

C. Mitashet al., “ARMBench: An Object-centric Benchmark Dataset for Robotic Manipulation,” inProc. IEEE Int. Conf. Robotics and Automation (ICRA), 2023, pp. 9132–9139

work page 2023

[8] [8]

Aursad: Universal robot screwdriving anomaly detection dataset

B. Leporowski, D. Tola, C. Hansen, and A. Iosifidis, “AURSAD: Universal Robot Screwdriving Anomaly Detection Dataset,”CoRR, vol. abs/2102.01409, 2021. [Online]. Available: https://arxiv.org/abs/ 2102.01409

work page arXiv 2021

[9] [9]

ConditionNET: Learning Preconditions and Effects for Execution Monitoring,

D. Sliwowski and D. Lee, “ConditionNET: Learning Preconditions and Effects for Execution Monitoring,”IEEE Robotics and Automation Letters (RA-L), vol. 10, no. 2, pp. 1337–1344, 2025

work page 2025

[10] [10]

A Multimodal Handover Failure Detection Dataset and Baselines,

S. Thoduka, N. Hochgeschwender, J. Gall, and P. G. Pl ¨oger, “A Multimodal Handover Failure Detection Dataset and Baselines,” in Proc. IEEE Int. Conf. Robotics and Automation (ICRA), 2024, pp. 17 013–17 019

work page 2024

[11] [11]

Using Visual Anomaly Detection for Task Execution Monitoring,

S. Thoduka, J. Gall, and P. G. Pl ¨oger, “Using Visual Anomaly Detection for Task Execution Monitoring,” inProc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), 2021, pp. 4604–4610. TABLE I: A summary of the number of successful (✓) and failed (✗) episodes for each skill in the dataset Skill Cup pickup Moving a cup Pouring Cup placing Spoon pic...

work page 2021

[12] [12]

Robot Action Diagno- sis and Experience Correction by Falsifying Parameterised Execution Models,

A. Mitrevski, P. G. Pl ¨oger, and G. Lakemeyer, “Robot Action Diagno- sis and Experience Correction by Falsifying Parameterised Execution Models,” inProc. IEEE Int. Conf. Robotics and Automation (ICRA), 2021, pp. 11 025–11 031

work page 2021

[13] [13]

REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction,

Z. Liu, A. Bahety, and S. Song, “REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction,” inProc. 7th Conf. Robot Learning (CoRL), 2023, pp. 3468–3484

work page 2023

[14] [14]

FINO- Net: A Deep Multimodal Sensor Fusion Framework for Manipulation Failure Detection,

A. Inceoglu, E. E. Aksoy, A. Cihan Ak, and S. Sariel, “FINO- Net: A Deep Multimodal Sensor Fusion Framework for Manipulation Failure Detection,” inProc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), 2021, pp. 6841–6847

work page 2021

[15] [15]

RoboMIND: Benchmark on Multi-embodiment Intel- Fig. 5: Predicted vs. ground-truth actions of the CNN policy ligence Normative Data for Robot Manipulation,

K. Wuet al., “RoboMIND: Benchmark on Multi-embodiment Intel- Fig. 5: Predicted vs. ground-truth actions of the CNN policy ligence Normative Data for Robot Manipulation,” inProc. Robotics: Science and Systems (RSS), 2025

work page 2025

[16] [16]

Adam: A Method for Stochastic Optimiza- tion,

D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimiza- tion,” inProc. Int. Conf. Learning Representations (ICLR), 2015

work page 2015