Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models

Francis Ferraro; Manas Gaur; Seyedali Mohammadi

arxiv: 2604.18786 · v1 · submitted 2026-04-20 · 💻 cs.CL · cs.AI

Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models

Seyedali Mohammadi , Manas Gaur , Francis Ferraro This is my paper

Pith reviewed 2026-05-10 04:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords scientific feasibilitylarge language modelsoutcome evidenceexperiment descriptionsrobustnesscontext removaldiagnostic reasoningfeasibility assessment

0 comments

The pith

Outcome evidence is more reliable than experiment descriptions for large language models assessing scientific feasibility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models assess scientific feasibility by predicting if a hypothesis is feasible or infeasible based on given information. The study tests them under conditions with only the hypothesis, plus experiment descriptions, plus outcomes, or both, and removes parts of the added context to check robustness. Results show that outcomes generally raise accuracy above what the model knows internally, while experiment text often lowers performance when incomplete. This matters for using AI to evaluate research claims, as it indicates which kind of evidence supports better decisions and which introduces errors. The patterns appear across several models and two different datasets.

Core claim

The paper establishes that in scientific feasibility assessment framed as a diagnostic reasoning task, providing outcome evidence is generally more reliable than providing experiment descriptions. Outcomes tend to improve accuracy beyond internal knowledge alone, whereas experimental text can be brittle and may degrade performance when the context is incomplete. This holds across multiple LLMs and two datasets under controlled knowledge conditions with progressive context removal.

What carries the argument

Controlled knowledge conditions (hypothesis-only, with experiments, with outcomes, or both) and progressive removal of experimental and/or outcome context to probe robustness in feasibility prediction.

Load-bearing premise

The two datasets and the controlled knowledge conditions with progressive context removal accurately simulate real-world scientific feasibility assessment without introducing dataset-specific artifacts or model-specific quirks.

What would settle it

Finding that complete experiment descriptions consistently lead to higher accuracy than outcome evidence, or that removing context does not degrade experiment-based performance, would challenge the claim.

Figures

Figures reproduced from arXiv: 2604.18786 by Francis Ferraro, Manas Gaur, Seyedali Mohammadi.

**Figure 3.** Figure 3: Abstracted prompt for feasibility prediction [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 2.** Figure 2: Abstracted prompt for extracting experiments [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 4.** Figure 4: Reveal-level stability curves derived from Table [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

Scientific feasibility assessment asks whether a claim is consistent with established knowledge and whether experimental evidence could support or refute it. We frame feasibility assessment as a diagnostic reasoning task in which, given a hypothesis, a model predicts feasible or infeasible and justifies its decision. We evaluate large language models (LLMs) under controlled knowledge conditions (hypothesis-only, with experiments, with outcomes, or both) and probe robustness by progressively removing portions of the experimental and/or outcome context. Across multiple LLMs and two datasets, providing outcome evidence is generally more reliable than providing experiment descriptions. Outcomes tend to improve accuracy beyond what internal knowledge alone provides, whereas experimental text can be brittle and may degrade performance when the context is incomplete. These findings clarify when experimental evidence benefits LLM-based feasibility assessment and when it introduces fragility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Outcomes help LLMs more than experiment descriptions for feasibility judgments, but the removal probes likely mix information type with prompt length.

read the letter

The central finding is that LLMs assess scientific feasibility more reliably when given outcome evidence than when given experiment descriptions. Outcomes lift accuracy above the hypothesis-only baseline, while partial experiment text tends to hurt performance. That contrast is the main new piece here, and it comes from a controlled setup that varies knowledge conditions across hypothesis-only, experiments, outcomes, and both, then strips context progressively to check brittleness. They run this on multiple models and two datasets, which gives the comparison some breadth. The work is straightforward empirical evaluation with no circular definitions or fitted parameters, so the results are at least falsifiable in principle. The abstract and methods appear to lay out the conditions clearly enough that a reader can see what was tested. The main soft spot is the length confound you noted. Progressive removal shortens the prompt and changes information density, and LLMs are known to be sensitive to both total tokens and position. Without explicit controls like length-matched neutral padding or truncation of other sections, the brittleness attributed to experiment text could partly reflect those surface features instead. That does not kill the result, but it does mean the claim about experiment descriptions being specifically fragile needs tighter isolation in a revision. The datasets and statistical details are not visible in the abstract, so the effect sizes and variance across models remain hard to judge from what is here. This paper is aimed at researchers building or evaluating LLM tools for hypothesis screening and scientific reasoning support. Someone already running similar diagnostic tasks would find the outcome-versus-experiment split useful to consider, even if they end up disagreeing with how much weight to put on it. It is not reshaping core methods, but the probe is focused and the execution looks honest. I would send it to peer review. The question is practical, the design is replicable in outline, and the length issue is fixable with targeted controls rather than a fatal flaw.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates LLMs on scientific feasibility assessment framed as a diagnostic reasoning task. Models predict feasible/infeasible for hypotheses under controlled conditions (hypothesis-only, plus experiments, plus outcomes, or both) across multiple models and two datasets. A robustness probe progressively removes portions of experimental and/or outcome context. The central claim is that outcome evidence reliably improves accuracy beyond internal knowledge, whereas experiment descriptions are brittle and can degrade performance when context is incomplete.

Significance. If the results hold after addressing potential confounds, the work usefully distinguishes the reliability of different evidence types for LLM scientific reasoning and clarifies conditions under which experimental text helps or harms feasibility judgments. Strengths include the multi-model, multi-dataset design and explicit robustness probes; these provide a concrete empirical basis for prompt-design recommendations in AI-assisted hypothesis evaluation.

major comments (2)

[Robustness probe] Robustness probe (methods/results sections): Progressive context removal alters total prompt length and token count without apparent controls such as neutral padding, truncation of other sections, or token-matched conditions. LLM performance is known to vary with sequence length and position; thus differences attributed to 'brittle' experimental text versus reliable outcomes may instead reflect these factors. This directly threatens the load-bearing claim that outcomes are generally more reliable and that experimental text specifically degrades when incomplete.
[Evaluation setup] Evaluation setup: Exact dataset names, sizes, selection criteria, full prompt templates, and statistical tests (e.g., significance of accuracy differences across conditions) are not fully detailed. Without these, it is difficult to rule out dataset-specific artifacts or prompting choices that could affect the reported superiority of outcome evidence.

minor comments (2)

[Abstract] Abstract: The phrase 'two datasets' is repeated without naming them; adding the names would improve clarity for readers.
[Methods] Notation: The four knowledge conditions are described clearly in the abstract but would benefit from an explicit table or diagram in the methods to avoid any ambiguity when results are presented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us identify areas for improvement in our manuscript. We address each major comment below and outline the revisions we plan to make.

read point-by-point responses

Referee: [Robustness probe] Robustness probe (methods/results sections): Progressive context removal alters total prompt length and token count without apparent controls such as neutral padding, truncation of other sections, or token-matched conditions. LLM performance is known to vary with sequence length and position; thus differences attributed to 'brittle' experimental text versus reliable outcomes may instead reflect these factors. This directly threatens the load-bearing claim that outcomes are generally more reliable and that experimental text specifically degrades when incomplete.

Authors: This is a valid concern, as LLMs can indeed be sensitive to prompt length and token position. Our original robustness probe did not include explicit controls for total sequence length. In the revised manuscript, we will add new experiments that maintain constant prompt lengths across conditions by using neutral padding tokens or by truncating non-essential parts of the hypothesis to match token counts. This will allow us to better attribute performance differences to the nature of the evidence (outcomes vs. experiments) rather than length artifacts. We expect this to confirm our central claim while addressing the potential confound. revision: yes
Referee: [Evaluation setup] Evaluation setup: Exact dataset names, sizes, selection criteria, full prompt templates, and statistical tests (e.g., significance of accuracy differences across conditions) are not fully detailed. Without these, it is difficult to rule out dataset-specific artifacts or prompting choices that could affect the reported superiority of outcome evidence.

Authors: We agree that additional details are essential for full reproducibility and to rule out artifacts. The revised version will specify the exact dataset names (including their arXiv or other sources if applicable), sizes, and selection criteria. We will also provide the complete prompt templates for all conditions (hypothesis-only, experiments, outcomes, both) and report statistical significance tests for the accuracy differences, such as bootstrap confidence intervals or paired statistical tests. Furthermore, we will release the full evaluation code and prompts to facilitate verification. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation is self-contained

full rationale

The paper conducts an empirical evaluation of LLMs on a diagnostic reasoning task for scientific feasibility, testing controlled knowledge conditions (hypothesis-only, experiments, outcomes, or both) and robustness via progressive context removal across two datasets and multiple models. No equations, parameters, or derivations are present. Claims rest on observed accuracy differences rather than any self-definitional mapping, fitted-input prediction, or load-bearing self-citation chain. The design does not reduce to its inputs by construction; results are falsifiable against external benchmarks and independent of any prior author work invoked as an unverified premise.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the chosen datasets and controlled input conditions serve as valid proxies for scientific feasibility assessment in general.

axioms (2)

domain assumption The two datasets represent general scientific claims suitable for feasibility assessment.
Evaluation is performed on these datasets under the stated conditions.
domain assumption Progressive removal of experiment or outcome context simulates realistic incomplete information scenarios.
Used to test robustness of model decisions.

pith-pipeline@v0.9.0 · 5432 in / 1244 out tokens · 54852 ms · 2026-05-10T04:32:59.792091+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages

[1]

2025 , eprint=

Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation , author=. 2025 , eprint=

work page 2025
[2]

Matter-of-Fact: A Benchmark for Verifying the Feasibility of Literature-Supported Claims in Materials Science

Jansen, Peter and Hassan, Samiah and Wang, Ruoyao. Matter-of-Fact: A Benchmark for Verifying the Feasibility of Literature-Supported Claims in Materials Science. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.203

work page doi:10.18653/v1/2025.emnlp-main.203 2025
[3]

Polak and So Yeon Kim and MD Al Amin Shuvo and Hrishikesh Shridhar Deodhar and Jeongsoo Han and Dane Morgan and Hyunseok Oh , keywords =

Quanliang Liu and Maciej P. Polak and So Yeon Kim and MD Al Amin Shuvo and Hrishikesh Shridhar Deodhar and Jeongsoo Han and Dane Morgan and Hyunseok Oh , keywords =. Beyond designer's knowledge: Generating materials design hypotheses via a large language model , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.actamat.2025.121307 , url =

work page doi:10.1016/j.actamat.2025.121307 2025
[4]

NSF - S ci F y: Mining the NSF Awards Database for Scientific Claims

Rao, Delip and You, Weiqiu and Wong, Eric and Callison-Burch, Chris. NSF - S ci F y: Mining the NSF Awards Database for Scientific Claims. Proceedings of The 5th New Frontiers in Summarization Workshop. 2025. doi:10.18653/v1/2025.newsum-main.13

work page doi:10.18653/v1/2025.newsum-main.13 2025
[5]

2025 , eprint=

Attribution in Scientific Literature: New Benchmark and Methods , author=. 2025 , eprint=

work page 2025
[6]

Large Language Models for Automated Open-domain Scientific Hypotheses Discovery

Yang, Zonglin and Du, Xinya and Li, Junxian and Zheng, Jie and Poria, Soujanya and Cambria, Erik. Large Language Models for Automated Open-domain Scientific Hypotheses Discovery. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.804

work page doi:10.18653/v1/2024.findings-acl.804 2024
[7]

From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery

Zheng, Tianshi and Deng, Zheye and Tsang, Hong Ting and Wang, Weiqi and Bai, Jiaxin and Wang, Zihao and Song, Yangqiu. From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.895

work page doi:10.18653/v1/2025.emnlp-main.895 2025
[8]

arXiv preprint arXiv:2505.13259 , year =

From automation to autonomy: A survey on large language models in scientific discovery , author=. arXiv preprint arXiv:2505.13259 , year=

work page arXiv
[9]

Nutrients , VOLUME =

Nakamura, Yumi and Watanabe, Hiroshi and Tanaka, Aiko and Yasui, Masato and Nishihira, Jun and Murayama, Norihito , TITLE =. Nutrients , VOLUME =. 2020 , NUMBER =

work page 2020
[10]

Do LLM s Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions

Mohammadi, Seyedali and Vedula, Bhaskara Hanuma and Lamba, Hemank and Raff, Edward and Kumaraguru, Ponnurangam and Ferraro, Francis and Gaur, Manas. Do LLM s Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2...

work page doi:10.18653/v1/2025.emnlp-main.1648 2025
[11]

NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=

Large Language Models are Zero Shot Hypothesis Proposers , author=. NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=

work page 2023

[1] [1]

2025 , eprint=

Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation , author=. 2025 , eprint=

work page 2025

[2] [2]

Matter-of-Fact: A Benchmark for Verifying the Feasibility of Literature-Supported Claims in Materials Science

Jansen, Peter and Hassan, Samiah and Wang, Ruoyao. Matter-of-Fact: A Benchmark for Verifying the Feasibility of Literature-Supported Claims in Materials Science. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.203

work page doi:10.18653/v1/2025.emnlp-main.203 2025

[3] [3]

Polak and So Yeon Kim and MD Al Amin Shuvo and Hrishikesh Shridhar Deodhar and Jeongsoo Han and Dane Morgan and Hyunseok Oh , keywords =

Quanliang Liu and Maciej P. Polak and So Yeon Kim and MD Al Amin Shuvo and Hrishikesh Shridhar Deodhar and Jeongsoo Han and Dane Morgan and Hyunseok Oh , keywords =. Beyond designer's knowledge: Generating materials design hypotheses via a large language model , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.actamat.2025.121307 , url =

work page doi:10.1016/j.actamat.2025.121307 2025

[4] [4]

NSF - S ci F y: Mining the NSF Awards Database for Scientific Claims

Rao, Delip and You, Weiqiu and Wong, Eric and Callison-Burch, Chris. NSF - S ci F y: Mining the NSF Awards Database for Scientific Claims. Proceedings of The 5th New Frontiers in Summarization Workshop. 2025. doi:10.18653/v1/2025.newsum-main.13

work page doi:10.18653/v1/2025.newsum-main.13 2025

[5] [5]

2025 , eprint=

Attribution in Scientific Literature: New Benchmark and Methods , author=. 2025 , eprint=

work page 2025

[6] [6]

Large Language Models for Automated Open-domain Scientific Hypotheses Discovery

Yang, Zonglin and Du, Xinya and Li, Junxian and Zheng, Jie and Poria, Soujanya and Cambria, Erik. Large Language Models for Automated Open-domain Scientific Hypotheses Discovery. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.804

work page doi:10.18653/v1/2024.findings-acl.804 2024

[7] [7]

From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery

Zheng, Tianshi and Deng, Zheye and Tsang, Hong Ting and Wang, Weiqi and Bai, Jiaxin and Wang, Zihao and Song, Yangqiu. From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.895

work page doi:10.18653/v1/2025.emnlp-main.895 2025

[8] [8]

arXiv preprint arXiv:2505.13259 , year =

From automation to autonomy: A survey on large language models in scientific discovery , author=. arXiv preprint arXiv:2505.13259 , year=

work page arXiv

[9] [9]

Nutrients , VOLUME =

Nakamura, Yumi and Watanabe, Hiroshi and Tanaka, Aiko and Yasui, Masato and Nishihira, Jun and Murayama, Norihito , TITLE =. Nutrients , VOLUME =. 2020 , NUMBER =

work page 2020

[10] [10]

Do LLM s Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions

Mohammadi, Seyedali and Vedula, Bhaskara Hanuma and Lamba, Hemank and Raff, Edward and Kumaraguru, Ponnurangam and Ferraro, Francis and Gaur, Manas. Do LLM s Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2...

work page doi:10.18653/v1/2025.emnlp-main.1648 2025

[11] [11]

NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=

Large Language Models are Zero Shot Hypothesis Proposers , author=. NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=

work page 2023