Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations

Benedictus Kent Rachmat; Ihsan Ullah; Isabelle Guyon; Joaquin Vanschoren; Jonathan Lebensold; Leonardo Martins Bianco; Luis Oala; Omar Benjelloun; Peyman Vahidi; Sebastian Lobentanzer

arxiv: 2605.29786 · v1 · pith:QIT3JDI2new · submitted 2026-05-28 · 💻 cs.AI

Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations

Omar Benjelloun , Leonardo Martins Bianco , Isabelle Guyon , Thanh Gia Hieu Khuong , Jonathan Lebensold , Sebastian Lobentanzer , Luis Oala , Benedictus Kent Rachmat

show 3 more authors

Ihsan Ullah Peyman Vahidi Joaquin Vanschoren

This is my paper

classification 💻 cs.AI

keywords formatcroissantlearningmachinereproducibilitytasksautomatedbrittle

0 comments

read the original abstract

Reproducibility is fundamental to the scientific method, yet remains a critical challenge in machine learning. Contributing factors include underspecified execution details and brittle software environments. Human-centric remedies, such as checklists and manual verification, help but require intensive effort and fail to scale. To address this, we introduce Croissant Tasks: a declarative, machine-actionable metadata format that abstracts low-level implementation details into high-level specifications. This format enables conceptual reproducibility: verifying claims via independent, agent-generated implementations rather than brittle source code replication. We contribute: (1) the Croissant Tasks specification, formally decoupling task problem from solution; (2) an automated LLM pipeline that retrofits existing benchmarks into this format; and (3) empirical validation showing autonomous agents can ingest these specifications to generate functional, accurate reproduction pipelines from scratch. We envision this format as a new foundation for automated and conceptual reproducibility in machine learning.

This paper has not been read by Pith yet.

Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations

discussion (0)