Contemporary AI lacks the imagination to diverge or negate in science

Honglin Bao; James A. Evans; Shiyun Cao; Sida Li; Siyang Wu; Xiao Liu

REVIEW 2 major objections 2 minor 1 cited by

Large language models fail to spontaneously propose null hypotheses when generating scientific ideas.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-06-27 19:05 UTC pith:O24ZW2LV

load-bearing objection Large rating dataset and post-trained reward model are the real assets; the null-hypothesis contrast lacks a matched human generation arm. the 2 major comments →

arxiv 2606.08251 v2 pith:O24ZW2LV submitted 2026-06-06 cs.CY cs.AI

Contemporary AI lacks the imagination to diverge or negate in science

Honglin Bao , Siyang Wu , Xiao Liu , Sida Li , Shiyun Cao , James A. Evans This is my paper

classification cs.CY cs.AI

keywords large language modelsscientific discoverynull hypotheseshypothesis generationexpert evaluationAI collaborationidea novelty

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts the largest evaluation to date in which working scientists judge ideas generated by LLMs from the context of their own recent preprints. It finds that models either collapse to similar ideas or explore wider spaces without ever suggesting null hypotheses, a move human scientists make more freely. Scientists consistently favor probable ideas over novel ones, rate LLM outputs more harshly in pluralistic fields, and show only weak agreement with automated evaluators. A reward model trained on the collected ratings narrows the gap to human inter-rater consistency.

Core claim

In ratings from 6,749 scientists on 25,139 LLM-generated ideas drawn from 121,640 preprints, no model class proposes null hypotheses on its own. Non-reasoning models produce narrow clusters of similar ideas while reasoning models range more widely, yet both avoid negation. Scientists reward resemblance to their own work and probability of being true over novelty, with social scientists showing greater risk tolerance; automated judges align only weakly with these expert assessments.

What carries the argument

Spontaneous proposal of null hypotheses as a marker of the ability to diverge or negate within a hypothesis space.

Load-bearing premise

The scientists who responded and the preprints they supplied form a representative sample of scientific reasoning without systematic selection bias.

What would settle it

A new model class that, in a blinded replication using fresh preprints, proposes null hypotheses at rates comparable to the human authors would falsify the central claim.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

LLM outputs and judgments in science require ongoing human grounding to compensate for limited imaginative divergence.
Post-training a reward model on human ratings improves capture of field-specific tastes by up to 27 percent over prior automated evaluators.
Social scientists tolerate riskier ideas more than life scientists, and senior social scientists apply the strictest standards.
Retrieval augmentation and persona prompting produce only marginal gains in alignment with expert judgment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Explicit training for contradiction or falsification may be needed before models can reliably explore negation in hypothesis generation.
The performance gap in pluralistic fields points to a broader limit on AI handling interpretive or theory-evolving domains.
Reward models tuned to human ratings could serve as scalable proxies for expert review in early-stage idea filtering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

Large rating dataset and post-trained reward model are the real assets; the null-hypothesis contrast lacks a matched human generation arm.

read the letter

The paper's clearest value is the scale of the rating exercise: 25k judgments from nearly 7k scientists on ideas pulled from their own preprints. That volume lets them show field differences in risk tolerance and that senior social scientists are stricter raters. The post-trained Qwen reward model is the other concrete output; it beats the SOTA baselines they tested by up to 27% and moves closer to the consistency of independent human reviewers.

The headline claim that no model class spontaneously offers null hypotheses while humans do so more freely is harder to pin down. The design only collects ratings on LLM outputs. There is no parallel condition in which the same scientists generate ideas from identical contexts and prompts, so the difference cannot be cleanly attributed to model architecture or training rather than prompt wording or post-generation filtering.

Response rate sits around 5-6%, which raises the usual selection questions even if the authors acknowledge it. Prompt details and any filtering steps also matter a lot for claims about what models "spontaneously" do, and those are not fully visible from the abstract.

The work is aimed at groups building scientific reward models or automated hypothesis generators. Readers who need large human preference data on research ideas will find the dataset useful even if they disagree with some interpretations.

It deserves peer review. The empirical volume and the reward-model result are substantial enough to justify referee attention, provided the baseline issue is addressed in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript reports results from inviting authors of 121,640 recent preprints to rate LLM-generated ideas derived from their own papers, yielding 6,749 respondents and 25,139 rating sets. It identifies three patterns: non-reasoning LLMs produce narrow idea sets while reasoning models explore more broadly but none spontaneously generate null hypotheses (unlike humans); scientists favor ideas resembling their own and prioritize probability over novelty, with field and seniority differences; and automated evaluators (including LLM-as-judge) show weak agreement with experts, though a post-trained Qwen3-14B reward model improves alignment with human ratings by up to 27%.

Significance. If the empirical patterns hold after addressing baseline issues, the work supplies the largest scientist-in-the-loop dataset to date on AI ideation limitations in science, particularly the absence of spontaneous negation, and demonstrates a practical path for training reward models that better capture expert preferences across fields. This supplies falsifiable, quantitative evidence against claims of imminent AI-driven discovery acceleration.

major comments (2)

[Abstract] Abstract: the central claim that 'no model class spontaneously proposes null hypotheses -- a move humans make more freely' is unsupported by any matched human generation arm; the design collects only ratings of LLM outputs and provides no parallel condition in which the same scientists (or matched authors) generate ideas from identical paper contexts and prompts, preventing attribution of the differential to model architecture rather than prompt framing or training objectives.
[Abstract] Abstract and implied Methods: the three reported patterns rest on unexamined selection and prompting assumptions, with no reported details on idea-generation prompts, response-rate bias controls, inter-rater reliability calculations, or statistical adjustments for field and seniority; these omissions are load-bearing because the patterns (hivemind collapse, field differences in risk tolerance, and evaluator disagreement) cannot be interpreted without them.

minor comments (2)

[Abstract] Abstract: the response rate (6,749/121,640) and exact sample composition by field should be stated explicitly to allow readers to assess representativeness.
The description of the post-trained reward model would benefit from a brief statement of the training objective and loss function used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments. We agree that the abstract claim regarding null hypotheses requires qualification given the study design, and that additional methodological details are needed for full interpretability. We outline revisions below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'no model class spontaneously proposes null hypotheses -- a move humans make more freely' is unsupported by any matched human generation arm; the design collects only ratings of LLM outputs and provides no parallel condition in which the same scientists (or matched authors) generate ideas from identical paper contexts and prompts, preventing attribution of the differential to model architecture rather than prompt framing or training objectives.

Authors: We acknowledge the limitation: our data consist solely of ratings on LLM-generated ideas and contain no matched human generation condition using identical prompts and paper contexts. The phrasing 'a move humans make more freely' therefore cannot be directly attributed to this experiment. We will revise the abstract and discussion to state that no LLM condition in the study produced null hypotheses, while noting that the contrast with human scientific practice draws from established literature on hypothesis generation rather than a within-study comparison. This removes the unsupported attribution. revision: yes
Referee: [Abstract] Abstract and implied Methods: the three reported patterns rest on unexamined selection and prompting assumptions, with no reported details on idea-generation prompts, response-rate bias controls, inter-rater reliability calculations, or statistical adjustments for field and seniority; these omissions are load-bearing because the patterns (hivemind collapse, field differences in risk tolerance, and evaluator disagreement) cannot be interpreted without them.

Authors: We will expand the Methods section with the requested details: verbatim idea-generation prompts, response-rate calculations and any non-response bias checks performed, inter-rater reliability metrics (e.g., agreement coefficients across the 25,139 rating sets), and the statistical models (including covariates for field and seniority) used to support the reported patterns. These additions will make the three patterns fully interpretable without altering the core findings. revision: yes

Circularity Check

0 steps flagged

Purely empirical rating study with no derivations or self-referential predictions

full rationale

The paper reports results from inviting 6,749 scientists to rate 25,139 LLM-generated ideas drawn from their own preprints on novelty, feasibility, truth probability, and adoption favorability. No equations, fitted parameters, or first-principles derivations appear; the three patterns are direct summaries of human ratings. The post-trained Qwen3-14B reward model is trained on the collected ratings and evaluated against held-out human judgments, which is standard supervised learning rather than a circular prediction. No self-citation chains or uniqueness theorems are invoked to justify core claims. The study is therefore self-contained against external human benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study rests on standard assumptions of survey methodology and statistical aggregation of ratings; no free parameters, ad-hoc axioms, or invented entities are introduced in the abstract.

axioms (2)

domain assumption Scientist ratings constitute valid ground truth for novelty, feasibility, and adoption potential
Invoked when the paper treats the 25,139 ratings as the benchmark against which LLMs and automated judges are evaluated.
domain assumption The invited preprint authors form an unbiased sample of working scientists in the four fields
Required for generalizing the three patterns beyond the 6,749 respondents.

pith-pipeline@v0.9.1-grok · 5877 in / 1298 out tokens · 21435 ms · 2026-06-27T19:05:19.911656+00:00 · methodology

0 comments

read the original abstract

Bold projections that artificial intelligence will accelerate scientific discovery have raced ahead of evidence from working scientists, and the field still lacks large-scale, scientist-in-the-loop tests of these claims. Here we mount the largest such evaluation to date and map what AI cannot yet do for science. We invited authors of 121,640 recent preprints across biology, medicine, chemistry, and the social sciences to judge ideas that large language models (LLMs) generated from the context and puzzles of their own papers. 6,749 scientists returned 25,139 sets of ratings on novelty, empirical feasibility, probability of being true, and favorability of adoption. Three patterns emerge. First, non-reasoning LLMs collapse into a narrow "hivemind" of similar ideas; reasoning models roam a wider hypothesis space, yet no model class spontaneously proposes null hypotheses -- a move humans make more freely. Second, scientists reward ideas that resemble their own and prize probability over novelty, though social scientists tolerate risk more readily than life scientists. Senior social scientists are the harshest critics, and their skepticism is well-earned: LLMs falter most in pluralistic fields like the social sciences that demand context-aware interpretation and evolving theories. Third, automated evaluators on which the community currently relies -- LLM-as-a-judge, artificial metrics, and even state-of-the-art (SOTA) models -- agree only weakly with expert judgment, and retrieval augmentation and scientist persona prompting yield only marginal gains. A Qwen3-14B reward model we post-trained on human ratings captures field taste nuances, beats SOTA models by up to 27%, and closes the gap to the inter-rater consistency of independent peer reviewers. For all the hype, today's scientific AI still represents a collaborator whose imagination, outputs and judgment benefit from human grounding.

Figures

Figures reproduced from arXiv: 2606.08251 by Honglin Bao, James A. Evans, Shiyun Cao, Sida Li, Siyang Wu, Xiao Liu.

**Figure 1.** Figure 1: An expert-audit pipeline for AI-generated research ideas. Full-text preprints (n = 121,640) from six non-arXiv platforms feed an extraction stage that recovers (i) the author’s hypotheses, (ii) the surrounding factual context, and (iii) the core scientific puzzle, with paraphrase-based leakage detection between (i), (ii), and (iii). LLMs propose hypotheses from the context-and-puzzle alone; a custom set of… view at source ↗

**Figure 2.** Figure 2: Reasoning broadens the hypothesis space; null reasoning rarely fills it. a, Pairwise cosine similarity of hypotheses generated for the same paper, by group. Nonreasoning LLMs are more similar to each other (the “artificial hivemind”, p<0.001); reasoning models diverge from non-reasoning models, humans, and each other. b, geometrically, we treat each hypothesis as a displacement from a common context-and-p… view at source ↗

**Figure 3.** Figure 3: Scientists discount novelty, prefer ideas resembling their own, and split by field and seniority. Marginal predictions from the Mundlak adoption model are shown, controlling for rated quality. a, within-scientist similarity to the author’s own ideas is the strongest single driver of adoption. b, status, represented by within-field citation/publication percentile, lowers adoption. c, seniority, represented… view at source ↗

**Figure 4.** Figure 4: Automated evaluators do not yet measure scientific quality. a, Pearson correlation between LLM-judge ratings and human ratings, by dimension and judge, with and without injected scientist persona. The retrieval-augmented Deep Research judge is the best. Persona injection slightly helps. Yet no setting exceeds r = 0.35. b, Calibration of LLM judges against human ratings. If LLM ratings track humans, points … view at source ↗

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Fixed Representations: The Vocabulary and Verifier Gaps in Open-Ended AI
cs.AI 2026-07 conditional novelty 6.0

Open-ended AI is blocked by a vocabulary gap (inventing reusable primitives) and a verifier gap (valuing them when payoff is delayed), unified under cognitive discrepancy reduction and a four-level autonomy ladder.

Reference graph

Works this paper leans on

14 extracted references · cited by 1 Pith paper

[1]

**Exclude** background/factual claims, method descriptions, presumptions, mathematical prerequisites/preconditions, and narrow inferences/interpretations drawn from tables and figures
[2]

**Apply strict selection criteria - do not over-generate**. Extract only those hypotheses that the authors explicitly motivate and place at the core of the paper’s main argument (typically introduced early, e.g., in the Introduction or thereafter). Omit minor and peripheral hypotheses confined to specific method, experiment, or result subsections
[3]

xxx is a valuable model for studying xxx

**Exclude vague directional statements** such as "xxx is a valuable model for studying xxx"; "The proposed model provides a promising and crucial direction for xxx"; and "This technology can be applied to improve xxx," as these are summaries of the paper’s overarching narrative rather than hypotheses
[4]

Do not generate or infer hypotheses on your own

Extract hypotheses based strictly on the **raw content**. Do not generate or infer hypotheses on your own. Preserve the **original meaning** of the authors’ hypotheses
[5]

––- PAPER STARTS ––- f{text} ––- PAPER ENDS ––- Context/Puzzle Extraction Prompt You are a helpful research assistant

If no relevant, explicit hypotheses are found, output an empty string "". ––- PAPER STARTS ––- f{text} ––- PAPER ENDS ––- Context/Puzzle Extraction Prompt You are a helpful research assistant. You will be given the introduction of a scientific paper. Your task is to identify and extract two kinds of structured information from the text:
[6]

These include big pictures and related works, etc

The **broad scientific context** of the work: **Context** consists of only *factual*, *non-speculative, non-reasoning* statements. These include big pictures and related works, etc. This should read like what a researcher might see *before* 50 they propose a theory. You extract hypothesis-agnostic explanations or definitions of key terms and concepts that...
[7]

Focus on the **high-level picture** of the work

At the **end of each context**, add a sentence that explicitly states *the core question or puzzle* that the paper addresses and can be derived from the context. Focus on the **high-level picture** of the work. **Avoid**: - Specific hypotheses or findings - Technical/experimental details, methods, or datasets - Any mention of author-proposed solutions You...
[8]

**Be concise**: exclude **background/factual claims or methodological/experimental descriptions**
[9]

Keep the hypotheses **relevant** to the core of the puzzle
[10]

xxx is a valuable model for studying xxx

**Do not generate vague directional statements** such as "xxx is a valuable model for studying xxx"; "The proposed model provides a promising and crucial direction for xxx"; and "This technology can be applied to improve xxx", as these are summaries of the paper’s overarching narrative rather than hypotheses
[11]

**Be creative** - reach beyond your existing knowledge base to propose **untested** ideas
[12]

fast-response cognitive mode

Ensure that the generated hypotheses are **clear, specified, well-reasoned, valid, and actionable**. ––- CONTEXTUAL PUZZLE STARTS ––- f{text} ––- CONTEXTUAL PUZZLE ENDS ––- Now, your output of the hypotheses: Model Evaluation Prompt You are an experienced scientist who is judging ideas (hypotheses) proposed from the same context and puzzle as your own pap...
[13]

the context: background information and the research puzzle of the paper
[14]

Evaluate the extent to which the hypotheses, generated based on the given context, introduce new ideas beyond the context

two proposed hypotheses: Hypothesis A and Hypothesis B proposed based on the given context Your task: Compare the two hypotheses on novelty according to the following criteria: "Evaluate the extent to which the hypotheses, generated based on the given context, introduce new ideas beyond the context.” [Replace novelty with other dimensions:] Feasibility:"E...

[1] [1]

**Exclude** background/factual claims, method descriptions, presumptions, mathematical prerequisites/preconditions, and narrow inferences/interpretations drawn from tables and figures

[2] [2]

**Apply strict selection criteria - do not over-generate**. Extract only those hypotheses that the authors explicitly motivate and place at the core of the paper’s main argument (typically introduced early, e.g., in the Introduction or thereafter). Omit minor and peripheral hypotheses confined to specific method, experiment, or result subsections

[3] [3]

xxx is a valuable model for studying xxx

**Exclude vague directional statements** such as "xxx is a valuable model for studying xxx"; "The proposed model provides a promising and crucial direction for xxx"; and "This technology can be applied to improve xxx," as these are summaries of the paper’s overarching narrative rather than hypotheses

[4] [4]

Do not generate or infer hypotheses on your own

Extract hypotheses based strictly on the **raw content**. Do not generate or infer hypotheses on your own. Preserve the **original meaning** of the authors’ hypotheses

[5] [5]

––- PAPER STARTS ––- f{text} ––- PAPER ENDS ––- Context/Puzzle Extraction Prompt You are a helpful research assistant

If no relevant, explicit hypotheses are found, output an empty string "". ––- PAPER STARTS ––- f{text} ––- PAPER ENDS ––- Context/Puzzle Extraction Prompt You are a helpful research assistant. You will be given the introduction of a scientific paper. Your task is to identify and extract two kinds of structured information from the text:

[6] [6]

These include big pictures and related works, etc

The **broad scientific context** of the work: **Context** consists of only *factual*, *non-speculative, non-reasoning* statements. These include big pictures and related works, etc. This should read like what a researcher might see *before* 50 they propose a theory. You extract hypothesis-agnostic explanations or definitions of key terms and concepts that...

[7] [7]

Focus on the **high-level picture** of the work

At the **end of each context**, add a sentence that explicitly states *the core question or puzzle* that the paper addresses and can be derived from the context. Focus on the **high-level picture** of the work. **Avoid**: - Specific hypotheses or findings - Technical/experimental details, methods, or datasets - Any mention of author-proposed solutions You...

[8] [8]

**Be concise**: exclude **background/factual claims or methodological/experimental descriptions**

[9] [9]

Keep the hypotheses **relevant** to the core of the puzzle

[10] [10]

xxx is a valuable model for studying xxx

**Do not generate vague directional statements** such as "xxx is a valuable model for studying xxx"; "The proposed model provides a promising and crucial direction for xxx"; and "This technology can be applied to improve xxx", as these are summaries of the paper’s overarching narrative rather than hypotheses

[11] [11]

**Be creative** - reach beyond your existing knowledge base to propose **untested** ideas

[12] [12]

fast-response cognitive mode

Ensure that the generated hypotheses are **clear, specified, well-reasoned, valid, and actionable**. ––- CONTEXTUAL PUZZLE STARTS ––- f{text} ––- CONTEXTUAL PUZZLE ENDS ––- Now, your output of the hypotheses: Model Evaluation Prompt You are an experienced scientist who is judging ideas (hypotheses) proposed from the same context and puzzle as your own pap...

[13] [13]

the context: background information and the research puzzle of the paper

[14] [14]

Evaluate the extent to which the hypotheses, generated based on the given context, introduce new ideas beyond the context

two proposed hypotheses: Hypothesis A and Hypothesis B proposed based on the given context Your task: Compare the two hypotheses on novelty according to the following criteria: "Evaluate the extent to which the hypotheses, generated based on the given context, introduce new ideas beyond the context.” [Replace novelty with other dimensions:] Feasibility:"E...