Sanity Checks for Agentic Data Science
Pith reviewed 2026-05-10 15:46 UTC · model grok-4.3
The pith
Lightweight sanity checks can identify when agentic data science conclusions lack stable signal support.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that a pair of perturbation-based sanity checks grounded in the Predictability-Computability-Stability framework can serve as a falsifiability constraint for outputs from agentic data science pipelines. By applying controlled perturbations, the checks determine whether an agent reliably separates signal from noise across variations, thereby assessing the trustworthiness of any affirmative conclusion. On synthetic data the checks align with controlled signal-to-noise ratios. When demonstrated on eleven real-world datasets using OpenAI Codex, the checks indicate that affirmative conclusions in six datasets lack sufficient support even though a single run may appear valid.
What carries the argument
Perturbation-based sanity checks that act as a falsifiability test for whether an agentic data science conclusion reflects stable signal rather than noise or incidental features.
If this is right
- Users can apply the checks to screen ADS outputs for trustworthiness before relying on them.
- Single ADS runs frequently produce affirmative conclusions that the checks show lack stable support.
- ADS self-reported confidence does not reliably indicate the empirical stability of its conclusions.
- The checks can categorize outputs as based on stable signal, responsive to noise, or sensitive to input details.
Where Pith is reading between the lines
- Embedding the checks directly into ADS interfaces could reduce overreliance on unverified single runs.
- The same perturbation approach could extend to other AI-driven quantitative tasks beyond data science.
- Poor confidence calibration points to a need for built-in stability measures in future ADS systems.
Load-bearing premise
The chosen perturbations are sufficient and reasonable to expose lack of signal without introducing new artifacts, and that the agent's responses to perturbed inputs accurately reflect the presence or absence of stable signal in the original data.
What would settle it
Applying the checks to synthetic datasets with independently known signal strengths and finding that the checks do not correctly classify the presence or absence of signal according to ground truth.
Figures
read the original abstract
Agentic data science (ADS) pipelines have grown rapidly in both capability and adoption, with systems such as OpenAI Codex now able to directly analyze datasets and produce answers to statistical questions. However, these systems can reach falsely optimistic conclusions that are difficult for users to detect. To address this, we propose a pair of lightweight sanity checks grounded in the Predictability-Computability-Stability (PCS) framework for veridical data science. These checks use reasonable perturbations to screen whether an agent can reliably distinguish signal from noise, acting as a falsifiability constraint that can expose affirmative conclusions as unsupported. Together, the two checks characterize the trustworthiness of an ADS output, e.g. whether it has found stable signal, is responding to noise, or is sensitive to incidental aspects of the input. We validate the approach on synthetic data with controlled signal-to-noise ratios, confirming that the sanity checks track ground-truth signal strength. We then demonstrate the checks on 11 real-world datasets using OpenAI Codex, characterizing the trustworthiness of each conclusion and finding that in 6 of the datasets an affirmative conclusion is not well-supported, even though a single ADS run may support one. We further analyze failure modes of ADS systems and find that ADS self-reported confidence is poorly calibrated to the empirical stability of its conclusions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes two lightweight sanity checks grounded in the Predictability-Computability-Stability (PCS) framework to evaluate trustworthiness of outputs from agentic data science (ADS) pipelines such as OpenAI Codex. The checks apply perturbations (noise injection, subsampling) to test whether an agent reliably distinguishes signal from noise, acting as a falsifiability screen for unsupported affirmative conclusions. Validation uses synthetic data with controlled SNR to confirm the checks track ground-truth signal strength. The checks are then demonstrated on 11 real-world datasets, finding that affirmative conclusions lack support in 6 cases despite single ADS runs appearing supportive, and that ADS self-reported confidence is poorly calibrated to empirical stability.
Significance. If the checks are shown to be unconfounded, the work provides a practical, PCS-grounded tool for detecting overconfident ADS conclusions in a rapidly adopted domain. The synthetic validation with known SNR is a clear strength, offering direct empirical grounding that the checks correlate with signal presence. The 11-dataset demonstration highlights real risks of unsupported claims in current systems, with broader implications for veridical data science. Significance is tempered by the need to confirm that perturbations isolate data properties rather than LLM-specific behaviors.
major comments (1)
- [Real-world experiments and results] Real-world application and results section: The central claim that 6 of 11 datasets have unsupported affirmative conclusions rests on the perturbations (noise injection, subsampling) isolating stable signal in the data rather than LLM output variability. Synthetic validation confirms tracking of ground-truth SNR, but does not establish that the same perturbations remain unconfounded on real datasets with OpenAI Codex, where output changes could arise from prompt sensitivity or tokenization artifacts. This directly affects the trustworthiness characterization and the 6/11 finding.
minor comments (2)
- [Methods] The two sanity checks are introduced at a conceptual level in the abstract and introduction; explicit pseudocode, algorithmic steps, or mathematical formalization (e.g., stability metric definitions) in the methods would aid reproducibility.
- [Failure modes analysis] The failure modes analysis and calibration discussion would benefit from additional quantitative metrics or statistical comparisons to strengthen the claims about poor self-confidence calibration.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We address the major comment point by point below, providing clarifications on our experimental design and outlining planned revisions.
read point-by-point responses
-
Referee: The central claim that 6 of 11 datasets have unsupported affirmative conclusions rests on the perturbations (noise injection, subsampling) isolating stable signal in the data rather than LLM output variability. Synthetic validation confirms tracking of ground-truth SNR, but does not establish that the same perturbations remain unconfounded on real datasets with OpenAI Codex, where output changes could arise from prompt sensitivity or tokenization artifacts. This directly affects the trustworthiness characterization and the 6/11 finding.
Authors: We agree that demonstrating the perturbations primarily isolate data signal properties (rather than LLM-specific variability) is essential to support the 6/11 finding. Our synthetic validation applies the identical ADS pipeline, including OpenAI Codex, to datasets with controlled SNR; the sanity checks' stability metrics track ground-truth signal strength, indicating the perturbations affect conclusions in a signal-dependent manner. On the real datasets, the same data perturbations and consistent prompt templates are used, and we observe differential stability across datasets (stable in 5 cases, unstable in 6), which would be unlikely if LLM output variability dominated uniformly. That said, we acknowledge the synthetic results do not fully rule out confounds such as prompt sensitivity or tokenization changes on real data. To strengthen this, we will add a new analysis measuring baseline ADS variability by repeating runs on the original unperturbed data for each of the 11 datasets. This will be incorporated into the revised manuscript, along with expanded discussion of potential LLM artifacts in the limitations section. revision: yes
Circularity Check
PCS framework cited but new checks defined and validated independently on external synthetic ground truth
full rationale
The paper defines the two sanity checks explicitly via perturbations (noise injection, subsampling) to test whether an ADS agent distinguishes signal from noise. These definitions stand alone and are validated against controlled SNR on synthetic data, which serves as independent external ground truth. The 6/11 real-data finding is obtained by direct application of the same defined checks. A citation to the PCS framework (co-authored by Bin Yu) provides background but is not invoked as a uniqueness theorem or to derive the checks themselves. No equation, prediction, or central claim reduces by construction to a fitted parameter or self-referential quantity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The Predictability-Computability-Stability (PCS) framework provides a valid basis for assessing veridical data science outputs
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[3]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[4]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.