How Open Must Language Models be to Enable Reliable Scientific Inference?

Benjamin K. Bergen; Cameron R. Jones; Catherine Arnett; James A. Michaelov; Micah Altman; Pamela D. Rivi\`ere; Roger P. Levy; Samuel M. Taylor; Sean Trott; Tyler A. Chang

arxiv: 2603.26539 · v2 · pith:JW6CQALKnew · submitted 2026-03-27 · 💻 cs.CL · cs.AI

How Open Must Language Models be to Enable Reliable Scientific Inference?

James A. Michaelov , Catherine Arnett , Tyler A. Chang , Pamela D. Rivi\`ere , Samuel M. Taylor , Cameron R. Jones , Sean Trott , Roger P. Levy

show 2 more authors

Benjamin K. Bergen Micah Altman

This is my paper

Pith reviewed 2026-05-21 09:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords language modelsscientific inferencemodel opennessclosed modelsreliable inferenceAI in sciencetransparencyreproducibility

0 comments

The pith

Restrictions on information about closed language models threaten reliable scientific inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper analyzes how the degree of openness in language models affects the trustworthiness of scientific conclusions drawn from research that uses them. It focuses on how limited details about model construction and deployment create risks to valid inference in most scientific applications. The authors conclude that closed models are generally unsuitable for science, though exceptions exist, and they outline mitigation approaches. A sympathetic reader would care because flawed inferences from opaque models could propagate errors across fields that increasingly rely on these tools for analysis and discovery. The work urges explicit justification of model choices and systematic checks for inference threats in any study involving them.

Core claim

The paper claims that restrictions on information about model construction and deployment constitute threats to reliable inference, making current closed models generally ill-suited for scientific purposes with some notable exceptions. It discusses ways these issues can be resolved or mitigated and recommends that researchers using models in research systematically identify potential threats to inference along with the steps taken to address them, while also providing specific justifications for their model selection.

What carries the argument

Analysis of threats to reliable inference arising from restrictions on information about model construction and deployment.

If this is right

Researchers must identify threats to inference and mitigation steps whenever using language models in scientific work.
Papers should include explicit justifications for choosing one model over others.
Mitigation strategies can address some reliability problems even with closed models.
Open models reduce inference threats and may be preferable for many scientific uses.
Exceptions allow certain closed models to support reliable inference under specific conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread adoption of these identification and justification practices could raise standards for reproducibility in AI-supported research.
The argument suggests a testable prediction: studies using open models should show higher rates of successful replication than matched studies using closed models.
The same transparency concerns likely extend to other AI systems used in scientific pipelines beyond language models.
Funding and publishing policies might shift to require openness disclosures as a condition for using models in submitted work.

Load-bearing premise

Restrictions on information about model construction and deployment are the primary and sufficiently severe threats to reliable inference when these models are used in scientific research.

What would settle it

An empirical demonstration that scientific inferences drawn from a closed model achieve the same reliability as those from an equivalent open model, even when no details of construction or deployment are available to the researchers.

read the original abstract

How does the extent to which a model is open or closed impact the scientific inferences that can be drawn from research that involves it? In this paper, we analyze how restrictions on information about model construction and deployment threaten reliable inference. We argue that current closed models are generally ill-suited for scientific purposes, with some notable exceptions, and discuss ways in which the issues they present to reliable inference can be resolved or mitigated. We recommend that when models are used in research, potential threats to inference should be systematically identified along with the steps taken to mitigate them, and that specific justifications for model selection should be provided.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Closed models create real inference risks in science, but the paper gives a usable taxonomy and mitigations without demanding total openness.

read the letter

The main takeaway is that restrictions on model details threaten reproducibility, mechanistic insight, and bias checks when language models are used for research, and that most current closed models are poorly suited as a result. The authors lay this out with a clear distinction between types of openness and their epistemic effects, then show that some risks can be addressed without full source access. They also flag exceptions where closed models can still work if other documentation is supplied. This is the part worth paying attention to: the argument stays practical rather than absolutist. What the paper does well is organize the threats into concrete categories and pair each with targeted mitigations. The recommendation to document model choices and threat-mitigation steps explicitly is straightforward and actionable. The stress-test note is right that the structure avoids internal contradiction and treats full openness as one option among others. The taxonomy and examples appear to give the claims some grounding beyond abstract principles. A minor soft spot is that the case still rests more on logical enumeration than on documented instances where closed-model opacity actually produced wrong scientific conclusions. If the full text has only a handful of such cases, the central claim will feel stronger in principle than in demonstrated impact. The paper is aimed at researchers who already use language models in empirical work and want to think through reproducibility standards. Readers working in cognitive science, linguistics, or any field that treats model outputs as evidence will get the most from it. It is not a new empirical result or formal derivation, but the framing is timely and the recommendations are concrete enough to be worth referee attention. I would send it to peer review.

Referee Report

0 major / 3 minor

Summary. The paper analyzes how restrictions on information about language model construction and deployment threaten reliable scientific inference. It distinguishes types of openness and their epistemic consequences, enumerates specific risks including reproducibility, mechanistic understanding, and bias auditing, argues that current closed models are generally ill-suited for scientific purposes with some notable exceptions, outlines mitigations that do not always require full openness, and recommends systematic threat identification plus explicit justifications for model selection in research.

Significance. If the analysis holds, the work supplies a practical framework for assessing epistemic risks when using language models in science. By linking specific openness dimensions to concrete inference threats and acknowledging workable mitigations short of complete openness, it offers actionable guidance that could improve transparency and credibility in NLP and broader AI-assisted research. The structured taxonomy and emphasis on documented threat-mitigation pairs are strengths that distinguish it from purely normative calls for openness.

minor comments (3)

The abstract states the central argument clearly but does not preview the taxonomy of openness or the specific inference risks that structure the body; adding one sentence would improve reader orientation.
In the section enumerating mitigations, the mapping from each risk (reproducibility, mechanistic understanding, bias auditing) to the proposed partial mitigations could be presented in a table for easier reference and to make the claim that full openness is not always required more transparent.
The discussion of 'notable exceptions' would benefit from one or two concrete published examples (with citations) where closed models were used successfully in scientific work after documented mitigations; this would ground the qualification and reduce the risk of overgeneralization.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and constructive review, which accurately summarizes the paper's analysis of openness dimensions, epistemic risks to scientific inference, and practical mitigations. The recommendation for minor revision is appreciated. As the report lists no specific major comments under the MAJOR COMMENTS section, we have no individual points requiring detailed rebuttal or disagreement. We will perform a minor revision to enhance clarity and address any editorial suggestions.

Circularity Check

0 steps flagged

No significant circularity; argument from general scientific principles

full rationale

The paper develops a taxonomy of openness levels and enumerates specific inference risks (reproducibility, mechanistic understanding, bias auditing) along with mitigations that do not require full openness. The central claim that information restrictions threaten reliable inference is advanced through logical analysis of epistemic consequences rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation chain. No equations, ansatzes, or uniqueness theorems are invoked that reduce the result to the paper's own inputs. This matches the reader's assessment of a low (1.0) circularity score and the skeptic's finding of an independent analytical framework.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a discussion of scientific practice and does not introduce technical derivations, so it contains no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5659 in / 1025 out tokens · 35246 ms · 2026-05-21T09:28:28.485592+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We identify inferential threats associated with using such closed proprietary models, considering in particular the degree to which they limit the reliability of any evaluation, comparison, and interpretability research

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.