CEDAR: Context Engineering for Agentic Data Science

Chris Hinze; Fabian Kuech; Luzian Hahn; Rishiraj Saha Roy

arxiv: 2601.06606 · v2 · submitted 2026-01-10 · 💻 cs.LG · cs.AI

CEDAR: Context Engineering for Agentic Data Science

Rishiraj Saha Roy , Chris Hinze , Luzian Hahn , Fabian Kuech This is my paper

Pith reviewed 2026-05-16 14:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords CEDARcontext engineeringagentic data scienceLLM agentsKaggle challengesplan-code blocksprompt structuringautomated workflows

0 comments

The pith

CEDAR automates data science tasks by structuring LLM prompts into interleaved plan and code sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CEDAR to automate data science tasks with an agentic LLM setup. It establishes that challenges of task complexity, data size, and context restrictions can be addressed through effective context engineering. Structured prompts use DS-specific input fields as instructions, and solutions emerge as enumerated sequences of plan and code blocks from separate agents. Function calls generate these elements while keeping raw data local and injecting only aggregates into prompts, with iterative generation and smart history for fault tolerance and context control. The approach is validated on canonical Kaggle challenges.

Core claim

CEDAR imposes structure into the initial prompt with DS-specific input fields that serve as instructions for the agentic system. The solution is materialized as an enumerated sequence of interleaved plan and code blocks generated by separate LLM agents, providing a readable structure to the context at any step of the workflow. Function calls for generating these intermediate texts and corresponding Python code ensure that data stays local, with only aggregate statistics and associated instructions injected into LLM prompts. Fault tolerance and context management are introduced via iterative code generation and smart history rendering. The viability of this agentic data scientist is shown onK

What carries the argument

Structured DS-specific prompt fields combined with an agentic workflow that generates interleaved plan-code blocks via separate LLM agents and function calls to keep data local.

If this is right

Data science solutions gain readable structure from the plan-code sequence at every workflow step.
Raw data never enters LLM prompts, reducing context size through local function calls and aggregate statistics only.
Iterative code generation and smart history rendering add fault tolerance to the automation process.
Canonical Kaggle challenges can be solved with the agentic setup, showing reduced need for constant oversight.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could extend to other agentic domains like code refactoring if similar structure is imposed on prompts.
Performance on proprietary internal datasets might differ from public Kaggle results due to unseen data characteristics.
Adding external knowledge retrieval could further stabilize outputs on tasks requiring domain expertise beyond standard benchmarks.

Load-bearing premise

The structured prompts and agentic workflow reliably produce correct and complete solutions for arbitrary data science tasks without frequent human intervention or hitting unmanageable context limits.

What would settle it

Running CEDAR on a Kaggle challenge involving high-dimensional data and advanced feature engineering, then checking whether the generated code runs to completion and produces valid results without manual fixes.

read the original abstract

We demonstrate CEDAR, an application for automating data science (DS) tasks with an agentic setup. Solving DS problems with LLMs is an underexplored area that has immense market value. The challenges are manifold: task complexities, data sizes, computational limitations, and context restrictions. We show that these can be alleviated via effective context engineering. We first impose structure into the initial prompt with DS-specific input fields, that serve as instructions for the agentic system. The solution is then materialized as an enumerated sequence of interleaved plan and code blocks generated by separate LLM agents, providing a readable structure to the context at any step of the workflow. Function calls for generating these intermediate texts, and for corresponding Python code, ensure that data stays local, and only aggregate statistics and associated instructions are injected into LLM prompts. Fault tolerance and context management are introduced via iterative code generation and smart history rendering. The viability of our agentic data scientist is demonstrated using canonical Kaggle challenges.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CEDAR outlines a structured agentic workflow for data science but reports no metrics or baselines to show it actually works.

read the letter

CEDAR describes an agentic system that uses context engineering to automate data science tasks with LLMs. The main contribution is the concrete setup: DS-specific fields in the initial prompt, separate agents producing enumerated interleaved plan and code blocks, and function calls that keep raw data local so only statistics reach the LLM. This targets real pain points like context limits and data size in long workflows, plus iterative generation for fault tolerance and history management. The architecture is a practical engineering response to keeping agent traces readable and manageable. The soft spot is the evaluation. The paper claims viability on canonical Kaggle challenges but supplies no success rates, failure modes, context-length traces, or comparisons to simpler prompting. Without those numbers the central claim that the workflow reliably produces complete solutions with minimal intervention stays untested. This paper is for engineers and researchers building LLM agents for routine data analysis work. A reader already working on similar systems might borrow the prompt structuring and interleaving pattern. It deserves peer review so the authors can add the missing experiments and the community can assess the implementation details.

Referee Report

2 major / 2 minor

Summary. The paper presents CEDAR, an agentic LLM-based system for automating data science tasks. It addresses challenges of task complexity, data size, computational limits, and context restrictions through context engineering: DS-specific structured input fields in the initial prompt, separate agents generating interleaved enumerated plan/code blocks, function calls that keep raw data local while injecting only aggregates and instructions, and iterative generation with smart history rendering for fault tolerance. Viability is asserted via demonstrations on canonical Kaggle challenges.

Significance. If the workflow reliably produces correct, complete solutions with limited human intervention, the structured context-management approach could advance practical agentic data science automation and offer a template for handling long-horizon DS workflows under LLM constraints. The emphasis on readable interleaved plans and data isolation is a concrete engineering contribution.

major comments (2)

[Abstract and Kaggle demonstrations section] The central claim that context engineering alleviates the listed challenges and that viability is demonstrated on Kaggle challenges is unsupported by any quantitative evidence. No success rates, error rates, completion statistics, context-length traces, failure-mode analysis, or comparisons to simpler prompting baselines appear in the evaluation section.
[§3 (Workflow and context management)] The weakest assumption—that the interleaved plan/code workflow and function-call isolation produce correct and complete solutions for arbitrary DS tasks without frequent human intervention or context overflow—is stated but never tested or quantified. No ablation of the individual engineering choices (structured fields, separate agents, iterative rendering) is provided.

minor comments (2)

[Abstract] Clarify the exact Kaggle challenges used, the precise success criteria applied, and whether any human oversight occurred during the demonstrations.
[Introduction] Add a short related-work subsection contrasting CEDAR with prior agentic DS or code-generation frameworks to better situate the context-engineering contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that quantitative metrics and component ablations would strengthen the claims regarding context engineering and will revise the manuscript to include them.

read point-by-point responses

Referee: [Abstract and Kaggle demonstrations section] The central claim that context engineering alleviates the listed challenges and that viability is demonstrated on Kaggle challenges is unsupported by any quantitative evidence. No success rates, error rates, completion statistics, context-length traces, failure-mode analysis, or comparisons to simpler prompting baselines appear in the evaluation section.

Authors: We acknowledge that the current evaluation relies on qualitative demonstrations. In the revision we will add quantitative results to the Kaggle section, including success rates and completion statistics across multiple challenges, context-length traces, failure-mode analysis, and direct comparisons to simpler prompting baselines. These additions will provide empirical support for the alleviation of task complexity, data size, and context restrictions. revision: yes
Referee: [§3 (Workflow and context management)] The weakest assumption—that the interleaved plan/code workflow and function-call isolation produce correct and complete solutions for arbitrary DS tasks without frequent human intervention or context overflow—is stated but never tested or quantified. No ablation of the individual engineering choices (structured fields, separate agents, iterative rendering) is provided.

Authors: We agree that the assumptions in §3 require quantification and that ablations of the individual choices are needed. In the revised manuscript we will include ablations isolating the effects of structured input fields, separate plan/code agents, and iterative smart-history rendering. We will also report statistics on solution correctness, frequency of human intervention, and context-overflow events observed in the Kaggle demonstrations. revision: yes

Circularity Check

0 steps flagged

No circularity: system description with no derivations or self-referential claims

full rationale

The paper presents CEDAR as an engineering system for agentic data science via structured prompts, interleaved plan/code blocks, function-call isolation, and iterative generation. No equations, fitted parameters, predictions, or first-principles derivations appear in the abstract or described content. The central claim of viability on Kaggle challenges is an empirical demonstration rather than a reduction to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The workflow is self-contained as a descriptive architecture without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLMs can reliably follow DS-specific structured instructions to produce correct plans and code without external verification at each step.

axioms (1)

domain assumption LLMs can follow structured instructions for planning and coding in data science tasks
Invoked throughout the description of the agentic workflow and prompt structure.

pith-pipeline@v0.9.0 · 5472 in / 1079 out tokens · 24533 ms · 2026-05-16T14:54:12.993587+00:00 · methodology

CEDAR: Context Engineering for Agentic Data Science

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)