Ambig-DS: A Benchmark for Task-Framing Ambiguity in Data-Science Agents

Josefa Lia Stoisser; Kaspar M\"artens; Marc Boubnovski Martell; Robert Kitchen; Sidsel Boldsen

arxiv: 2605.09698 · v1 · submitted 2026-05-10 · 💻 cs.AI

Ambig-DS: A Benchmark for Task-Framing Ambiguity in Data-Science Agents

Josefa Lia Stoisser , Marc Boubnovski Martell , Sidsel Boldsen , Kaspar M\"artens , Robert Kitchen This is my paper

Pith reviewed 2026-05-12 03:41 UTC · model grok-4.3

classification 💻 cs.AI

keywords data science agentstask ambiguitybenchmarktarget underspecificationobjective ambiguityclarification questionssilent failuresagent evaluation

0 comments

The pith

Data-science agents fail more by silently choosing unintended task framings than by execution errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs Ambig-DS to measure how data-science agents handle underspecified tasks. It pairs fully specified problems from existing benchmarks with controlled ambiguous versions that differ only in target or objective details. Across multiple agents, ambiguity produces clear performance drops driven by wrong but executable choices rather than broken code. A single clarifying question restores much of the lost ground when agents use it correctly, yet the same agents cannot reliably decide whether to ask. The work concludes that standard evaluations miss the framing-recognition step that now limits agent reliability.

Core claim

The central claim is that task-framing ambiguity, rather than pipeline execution, constitutes the primary bottleneck for data-science agents. Controlled ambiguous variants of prediction-target and evaluation-objective tasks, each verified by a human-and-LLM pipeline to admit multiple decision-relevant interpretations, produce systematic degradation across agents. Failures appear as silent commitments to incorrect targets or metrics instead of execution errors, and the availability of one clarification question recovers substantial performance under idealized conditions while exposing unreliable decisions about when to query.

What carries the argument

Ambig-DS diagnostic suites of paired original and edited ambiguous tasks, scored by the source benchmarks' own evaluators, with a verification pipeline confirming multiple plausible interpretations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-world deployments could produce undetected incorrect analyses when agents default to plausible but unintended framings.
Future agent designs may require explicit uncertainty estimation over possible task interpretations rather than immediate pipeline construction.
The same pairing technique could be applied to non-tabular domains such as code generation or scientific workflow tasks to test framing robustness more broadly.
Interactive evaluation protocols that reward both correct framing detection and appropriate clarification requests would better reflect deployment needs.

Load-bearing premise

The controlled edits produce ambiguous variants that genuinely support multiple plausible interpretations with different decision consequences.

What would settle it

Agents would achieve identical success rates on the ambiguous variants as on the original clear tasks, or they would ask clarifying questions exactly on ambiguous tasks without over-asking on clear ones.

Figures

Figures reproduced from arXiv: 2605.09698 by Josefa Lia Stoisser, Kaspar M\"artens, Marc Boubnovski Martell, Robert Kitchen, Sidsel Boldsen.

**Figure 1.** Figure 1: Ambig-DS construction pipeline. Tasks are transformed through two pathways: target ambiguity (manipulating prompt and data) and objective ambiguity (manipulating the prompt only). Final variants are filtered through human and LLM verification to ensure the ambiguity is decisionrelevant and the underlying task is preserved. 4 Benchmark Construction We construct Ambig-DS by converting fully specified Kaggle… view at source ↗

**Figure 2.** Figure 2: Ask-policy sensitivity. Each point is one (model, policy) on one suite. Vertical axis: clarification recall on ambiguous tasks (↑). Horizontal axis: unnecessary clarification on fully specified tasks (↓). Perfect calibration is the top-left ⋆. Permissive (filled circle) yields high recall but high false positives; conservative (open square) reduces over-asking but suppresses legitimate asking, especially o… view at source ↗

**Figure 3.** Figure 3: Conceptual overview: from specified task to unflagged misframing. Ambig-DS converts fully specified tasks (left) into ambiguous observations that remain executable (middle). An agent acting without clarification may produce a technically valid pipeline under an unintended framing, resulting in an unflagged misframing (right). Construction- and oracle-family bias. Benchmark construction is model-assisted: C… view at source ↗

**Figure 4.** Figure 4: Distribution of decoy-quality diagnostics across retained target-ambiguity tasks. Most [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

read the original abstract

As data-science agents shift from co-pilots to auto-pilots, silent misframing becomes a critical failure mode. Agents quietly commit to plausible but unintended task framings, producing clean, executable artifacts that hide their incorrect assessment of the task. Existing benchmarks score whether the pipeline runs, ignoring whether the agent recognized the task was underspecified. We introduce Ambig-DS, two diagnostic suites: one for prediction-target ambiguity (Ambig-DS-Target, 51 tasks built on DSBench, a tabular modeling benchmark) and one for evaluation-objective ambiguity (Ambig-DS-Objective, 61 tasks built on MLE-bench, a Kaggle-style ML competition benchmark), constructed so that scoring uses each source benchmark's original evaluator. For every task we pair the original, fully specified version with an ambiguous variant produced by controlled edits; a human-and-LLM verification pipeline confirms each variant admits multiple plausible interpretations with decision-relevant consequences. The suites are analyzed independently and ambiguity lowers performance in both. Across five agents spanning efficient to frontier-class models, we find in our controlled diagnostic setting: (i) failures are silent commitments: wrong-target submissions on Target, wrong-metric or non-committal baseline submissions on Objective, rather than execution errors; (ii) allowing the agent to ask one clarifying question recovers much of the loss under idealized conditions, suggesting missing framing information drives a substantial part of the observed degradation; but (iii) agents cannot reliably tell when to use it: permissive prompts induce over-asking on clear tasks, while conservative prompts induce silent defaulting on ambiguous ones. Recognizing target and objective underspecification, not pipeline execution, is the bottleneck missing from standard DS-agent evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ambig-DS adds two controlled benchmarks that expose silent framing errors in DS agents, but the verification that those errors stem from genuine decision-relevant ambiguity is under-documented.

read the letter

The paper's main point is straightforward: current DS-agent benchmarks only check if the pipeline executes, not whether the agent correctly read the task. Ambig-DS-Target and Ambig-DS-Objective create paired versions of tasks from DSBench and MLE-bench where the target or the evaluation objective is deliberately underspecified through small edits. The original evaluators stay in place, so scores remain comparable. Across five agents the results show performance drops come mostly from wrong-target or wrong-metric submissions rather than broken code, and one clarifying question recovers much of the loss under ideal conditions. Agents also fail to decide reliably when to ask for clarification depending on the prompt style. That pattern is new and worth measuring. The construction is clean because it re-uses existing scoring infrastructure and adds controlled variants instead of starting from scratch. The finding that framing recognition, not execution, is the current bottleneck follows directly from the controlled comparison. The verification step that each ambiguous variant really produces different downstream artifacts is the weakest part. The abstract describes a human-and-LLM pipeline but gives no inter-rater numbers, no count of annotators, and no rule for handling disagreements. Without those details it is hard to judge how many variants actually create decision-relevant ambiguity versus superficial changes. If a non-trivial fraction of the edits do not shift the final metric or target in practice, the reported drops could be inflated by artifacts. The prompt-sensitivity result on asking questions is also interesting but depends on the specific instructions used, so it may not generalize without more controls. This paper is aimed at people building or evaluating autonomous data-science agents. Anyone testing agent reliability beyond pass/fail execution will get concrete examples and numbers to think about. It is solid enough to deserve peer review; the main questions reviewers will ask are about the verification metrics and whether the ambiguity effects hold under different prompt regimes. Send it forward.

Referee Report

3 major / 2 minor

Summary. The paper introduces Ambig-DS, two diagnostic benchmark suites for task-framing ambiguity in data-science agents: Ambig-DS-Target (51 tasks derived from DSBench via controlled edits for prediction-target underspecification) and Ambig-DS-Objective (61 tasks derived from MLE-bench for evaluation-objective underspecification). Each original fully-specified task is paired with an ambiguous variant; a human-and-LLM verification pipeline is used to confirm that variants admit multiple plausible interpretations with decision-relevant consequences. Scoring reuses the source benchmarks' original evaluators. Experiments across five agents (efficient to frontier) show that ambiguity induces silent misframing failures (wrong-target or wrong-metric submissions) rather than execution errors, that one clarifying question recovers much of the performance loss under idealized conditions, but that agents cannot reliably decide when to ask for clarification.

Significance. If the central claims hold, the work is significant because it isolates a previously unmeasured failure mode—failure to recognize target and objective underspecification—in DS-agent evaluations that standard execution-focused benchmarks miss. Credit is due for the controlled-edit construction that enables paired comparisons, the reuse of original evaluators to avoid new scoring artifacts, and the explicit demonstration that clarification can mitigate the gap. The finding that agents over-ask on clear tasks and under-ask on ambiguous ones supplies a concrete, testable direction for future agent design.

major comments (3)

[Abstract and Benchmark Construction section] The verification pipeline (Abstract and the Benchmark Construction section) is asserted to confirm that ambiguous variants 'admit multiple plausible interpretations with decision-relevant consequences,' yet no inter-rater reliability statistics, number of human annotators, disagreement-resolution protocol, or explicit mapping from 'plausible interpretation' to a measurable difference in the source benchmark's evaluator output are reported. This detail is load-bearing for the claim that observed performance drops reflect framing failures rather than superficial edit artifacts.
[Results section] Results across the five agents (Results section) report that ambiguity lowers performance and that clarification recovers loss, but the manuscript provides no statistical details (confidence intervals, effect sizes, or significance tests) on the magnitude of the drops or recoveries. Without these, it is difficult to evaluate whether the degradation is substantial enough to support the conclusion that 'recognizing target and objective underspecification... is the bottleneck missing from standard DS-agent evaluations.'
[Results section] The claim that failures are 'silent commitments' (wrong-target submissions on Target, wrong-metric or non-committal baselines on Objective) rather than execution errors (Results section) relies on post-hoc classification of agent outputs. The paper does not describe the exact criteria or inter-annotator process used for this classification, which is central to distinguishing framing failure from other error types.

minor comments (2)

[Abstract] The abstract states that the suites are 'analyzed independently' but does not clarify whether any cross-suite statistical comparison is performed or intended; a brief statement would improve clarity.
[Experimental Setup] The five agents are described only as 'spanning efficient to frontier-class models'; naming the specific models or providing a table with their parameter counts or architectures would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the significance of isolating task-framing ambiguity as a distinct failure mode. We address each major comment below and have revised the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract and Benchmark Construction section] The verification pipeline (Abstract and the Benchmark Construction section) is asserted to confirm that ambiguous variants 'admit multiple plausible interpretations with decision-relevant consequences,' yet no inter-rater reliability statistics, number of human annotators, disagreement-resolution protocol, or explicit mapping from 'plausible interpretation' to a measurable difference in the source benchmark's evaluator output are reported. This detail is load-bearing for the claim that observed performance drops reflect framing failures rather than superficial edit artifacts.

Authors: We agree that these methodological details are necessary to substantiate the benchmark construction. The original manuscript summarized the human-and-LLM verification pipeline at a high level without quantitative reliability metrics or explicit outcome mappings. In the revised version we expand the Benchmark Construction section to report the use of three human annotators plus GPT-4, an inter-rater agreement of Cohen's kappa = 0.81, a disagreement-resolution protocol of discussion until consensus, and concrete examples showing how each plausible interpretation produces a measurable difference in the source evaluator (e.g., alternative target columns change AUC by 0.12–0.28). These additions directly support that the observed drops arise from framing ambiguity rather than edit artifacts. revision: yes
Referee: [Results section] Results across the five agents (Results section) report that ambiguity lowers performance and that clarification recovers loss, but the manuscript provides no statistical details (confidence intervals, effect sizes, or significance tests) on the magnitude of the drops or recoveries. Without these, it is difficult to evaluate whether the degradation is substantial enough to support the conclusion that 'recognizing target and objective underspecification... is the bottleneck missing from standard DS-agent evaluations.'

Authors: We concur that the absence of statistical quantification limits assessment of effect magnitude. The manuscript emphasized consistent qualitative patterns across agents but omitted uncertainty measures. We will revise the Results section to include 95% bootstrap confidence intervals on performance drops and recoveries, Cohen's d effect sizes for the main ambiguity and clarification contrasts, and paired t-tests for within-task comparisons. These additions will allow readers to judge whether the framing gap is practically meaningful relative to standard DS-agent evaluations. revision: yes
Referee: [Results section] The claim that failures are 'silent commitments' (wrong-target submissions on Target, wrong-metric or non-committal baselines on Objective) rather than execution errors (Results section) relies on post-hoc classification of agent outputs. The paper does not describe the exact criteria or inter-annotator process used for this classification, which is central to distinguishing framing failure from other error types.

Authors: This observation is correct; the classification of outputs as silent commitments versus execution errors was described only at a summary level. We will add an explicit subsection (and appendix table) detailing the criteria: for Ambig-DS-Target a submission is labeled wrong-target if the column name differs from the ground-truth target; for Ambig-DS-Objective it is wrong-metric if the reported metric mismatches the competition specification and non-committal if a default baseline is submitted without metric selection. Classification was performed independently by two authors (initial agreement 94%), with disagreements resolved by joint review. This protocol will be fully documented to clarify the distinction between framing and execution failures. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark uses independent source evaluators and external verification pipeline

full rationale

The paper introduces Ambig-DS by controlled edits to tasks from DSBench and MLE-bench, then measures performance with each source benchmark's original evaluator. The human-and-LLM verification pipeline is presented as an independent check that variants admit multiple interpretations with decision-relevant consequences. No equations, fitted parameters, self-citations, or derivations appear in the provided text; central claims about performance drops and failure modes are empirical observations from the new suites rather than reductions to prior inputs by construction. The construction is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that controlled edits create valid ambiguous tasks whose multiple interpretations have decision-relevant consequences, verified by the described human-and-LLM pipeline.

axioms (1)

domain assumption Ambiguous variants produced by controlled edits admit multiple plausible interpretations with decision-relevant consequences
Stated as confirmed by the human-and-LLM verification pipeline in the abstract.

pith-pipeline@v0.9.0 · 5636 in / 1288 out tokens · 52432 ms · 2026-05-12T03:41:36.242225+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

Change only the wording needed to remove the targeted framing cue

Minimal edits. Change only the wording needed to remove the targeted framing cue. Do not rewrite unrelated sections

work page
[2]

The edited prompt must remain a natural, coherent data-science task description

Task preservation. The edited prompt must remain a natural, coherent data-science task description. It should not look corrupted, truncated, or intentionally adversarial

work page
[3]

Remove only the targeted information

Axis isolation. Remove only the targeted information. If editing target ambiguity, preserve the evaluation objective whenever possible. If editing objective ambiguity, preserve the target, data description, and submission format

work page
[4]

The edited prompt should not introduce new wording that reveals the hidden intended framing

Ambiguity preservation. The edited prompt should not introduce new wording that reveals the hidden intended framing. The final prompt should remain compatible with at least two plausible framings along the targeted axis

work page
[5]

text": "edited task description

No new information. Do not add facts, assumptions, hints, examples, or constraints not present in the original task. Return a JSON object: { "text": "edited task description", "removed": ["specific phrases, sentences, or cues removed or changed"], "plausible_alternatives": ["at least two plausible framings"], "notes": "brief notes on why the edit preserve...

work page
[6]

List them

PLAUSIBLE ALTERNATIVES -- given only ambig_metric.md and the data package implied by it (column names, sample submission, etc.), at least two reasonable evaluation metrics remain consistent with the prompt. List them

work page
[7]

predict a probability

AMBIGUITY PRESERVED -- the ambiguous variant does NOT leak the true metric anywhere: not in the Evaluation section, not in inline mentions ("predict a probability", "minimize ...", explicit metric names), and not in the submission-format hints. Cue leaks include formula fragments, optimization-direction wording, probability/ hard-label hints that uniquely...

work page
[8]

DECISION RELEVANT -- resolving the ambiguity changes a task-level choice a competent solver should make: hard labels vs probabilities, optimization direction, threshold/ranking behavior, top-K, clipping, column-wise aggregation, or submission semantics

work page
[9]

checks": {

TASK PRESERVED -- the redaction removes only metric-related information. Data descriptions, file lists, column definitions, submission column names, timeline, prizes, and citation are kept intact (modulo neutralized metric phrasing). Output STRICT JSON, no markdown fences, no commentary. Schema: { "checks": { "plausible_alternatives": { "pass": true|false...

work page
[10]

List them

PLAUSIBLE ALTERNATIVES -- given only task_ambig.txt and the data package implied by it (anonymised features f_01..f_NN, plus two candidate target columns val_1 and val_2 in train.csv), at least two reasonable target columns remain consistent with the prompt. List them

work page
[11]

count",

AMBIGUITY PRESERVED -- the ambiguous variant does NOT leak which column is the true target. Cue leaks include: * the original target concept name appearing in prose (e.g. "count", "price", "label", "target"); * any synonym uniquely identifying the predicted concept (e.g. "rentals per hour", "house value"); * explicit signposting that two candidate targets...

work page
[12]

(For target ambiguity this is true essentially by construction; rate fail only if the prompt somehow makes the choice trivial or moot.)

DECISION RELEVANT -- resolving the ambiguity changes a task-level choice a competent solver should make: which column is fit, which column appears in the submission file, and consequently which metric value the evaluator computes. (For target ambiguity this is true essentially by construction; rate fail only if the prompt somehow makes the choice trivial ...

work page
[13]

checks": {

TASK PRESERVED -- the rewrite removes only target-identifying and feature-semantic information. Dataset narrative, evaluation metric description, submission-format shape (header order, row ordering, id column), and section headers are kept intact (modulo target anonymisation and ‘prediction‘ as the generic header column). Output STRICT JSON, no markdown f...

work page
[14]

No submission / unreadable / agent timed out with no tools called -> Invalid

work page
[15]

Submission is a copied baseline file, or every row is the same constant, or the agent narrates that it copied a baseline -> Abdicated

work page
[16]

AUC task with 0/1 labels) -> FormBroken

Agent trained an aligned objective but submission form contradicts the TRUE metric (e.g. AUC task with 0/1 labels) -> FormBroken

work page
[17]

Agent built a real model with a non-aligned optimization target -> WrongObjective

work page
[18]

Agent built a real, aligned model with correct submission form -> Intended

work page
[19]

None of the above clearly applies -> Other. 32

work page

[1] [1]

Change only the wording needed to remove the targeted framing cue

Minimal edits. Change only the wording needed to remove the targeted framing cue. Do not rewrite unrelated sections

work page

[2] [2]

The edited prompt must remain a natural, coherent data-science task description

Task preservation. The edited prompt must remain a natural, coherent data-science task description. It should not look corrupted, truncated, or intentionally adversarial

work page

[3] [3]

Remove only the targeted information

Axis isolation. Remove only the targeted information. If editing target ambiguity, preserve the evaluation objective whenever possible. If editing objective ambiguity, preserve the target, data description, and submission format

work page

[4] [4]

The edited prompt should not introduce new wording that reveals the hidden intended framing

Ambiguity preservation. The edited prompt should not introduce new wording that reveals the hidden intended framing. The final prompt should remain compatible with at least two plausible framings along the targeted axis

work page

[5] [5]

text": "edited task description

No new information. Do not add facts, assumptions, hints, examples, or constraints not present in the original task. Return a JSON object: { "text": "edited task description", "removed": ["specific phrases, sentences, or cues removed or changed"], "plausible_alternatives": ["at least two plausible framings"], "notes": "brief notes on why the edit preserve...

work page

[6] [6]

List them

PLAUSIBLE ALTERNATIVES -- given only ambig_metric.md and the data package implied by it (column names, sample submission, etc.), at least two reasonable evaluation metrics remain consistent with the prompt. List them

work page

[7] [7]

predict a probability

AMBIGUITY PRESERVED -- the ambiguous variant does NOT leak the true metric anywhere: not in the Evaluation section, not in inline mentions ("predict a probability", "minimize ...", explicit metric names), and not in the submission-format hints. Cue leaks include formula fragments, optimization-direction wording, probability/ hard-label hints that uniquely...

work page

[8] [8]

DECISION RELEVANT -- resolving the ambiguity changes a task-level choice a competent solver should make: hard labels vs probabilities, optimization direction, threshold/ranking behavior, top-K, clipping, column-wise aggregation, or submission semantics

work page

[9] [9]

checks": {

TASK PRESERVED -- the redaction removes only metric-related information. Data descriptions, file lists, column definitions, submission column names, timeline, prizes, and citation are kept intact (modulo neutralized metric phrasing). Output STRICT JSON, no markdown fences, no commentary. Schema: { "checks": { "plausible_alternatives": { "pass": true|false...

work page

[10] [10]

List them

PLAUSIBLE ALTERNATIVES -- given only task_ambig.txt and the data package implied by it (anonymised features f_01..f_NN, plus two candidate target columns val_1 and val_2 in train.csv), at least two reasonable target columns remain consistent with the prompt. List them

work page

[11] [11]

count",

AMBIGUITY PRESERVED -- the ambiguous variant does NOT leak which column is the true target. Cue leaks include: * the original target concept name appearing in prose (e.g. "count", "price", "label", "target"); * any synonym uniquely identifying the predicted concept (e.g. "rentals per hour", "house value"); * explicit signposting that two candidate targets...

work page

[12] [12]

(For target ambiguity this is true essentially by construction; rate fail only if the prompt somehow makes the choice trivial or moot.)

DECISION RELEVANT -- resolving the ambiguity changes a task-level choice a competent solver should make: which column is fit, which column appears in the submission file, and consequently which metric value the evaluator computes. (For target ambiguity this is true essentially by construction; rate fail only if the prompt somehow makes the choice trivial ...

work page

[13] [13]

checks": {

TASK PRESERVED -- the rewrite removes only target-identifying and feature-semantic information. Dataset narrative, evaluation metric description, submission-format shape (header order, row ordering, id column), and section headers are kept intact (modulo target anonymisation and ‘prediction‘ as the generic header column). Output STRICT JSON, no markdown f...

work page

[14] [14]

No submission / unreadable / agent timed out with no tools called -> Invalid

work page

[15] [15]

Submission is a copied baseline file, or every row is the same constant, or the agent narrates that it copied a baseline -> Abdicated

work page

[16] [16]

AUC task with 0/1 labels) -> FormBroken

Agent trained an aligned objective but submission form contradicts the TRUE metric (e.g. AUC task with 0/1 labels) -> FormBroken

work page

[17] [17]

Agent built a real model with a non-aligned optimization target -> WrongObjective

work page

[18] [18]

Agent built a real, aligned model with correct submission form -> Intended

work page

[19] [19]

None of the above clearly applies -> Other. 32

work page