pith. sign in

arxiv: 2605.13245 · v1 · pith:24LU7BDEnew · submitted 2026-05-13 · 💻 cs.AI

It's not the Language Model, it's the Tool: Deterministic Mediation for Scientific Workflows

Pith reviewed 2026-05-14 19:52 UTC · model grok-4.3

classification 💻 cs.AI
keywords deterministic mediationscientific workflowsreproducibilitylanguage modelstyped toolsphotoluminescence analysisinstrument procedures
0
0 comments X

The pith

Language models achieve identical scientific analysis results every time by calling fixed deterministic tools rather than generating code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that language models produce varying outputs when asked to analyze the same scientific data repeatedly because they regenerate different procedures each time. To fix this, the authors introduce typed mediation: the model chooses and parameterizes pre-built deterministic tools that each encode one researcher's exact method for one instrument. The tool itself runs the analysis and always returns the same numerical and methodological result. This pattern was tested on photoluminescence data across multiple platforms and deployed on two real instruments handling proprietary formats. The approach keeps the model in the loop for orchestration while guaranteeing reproducibility that direct generation cannot provide.

Core claim

Typed mediation lets the language model select and parameterize deterministic tools that encode a researcher's exact procedure for a specific instrument; the tool executes the analysis and returns identical outputs on every run, unlike direct generation by commercial foundation models which vary in numerical values or methodology.

What carries the argument

Typed mediation, a pattern in which the model orchestrates pre-defined deterministic tools that each encode one exact researcher procedure for one instrument.

If this is right

  • The same input data and prompt will always produce the same output regardless of how many times the query is repeated.
  • Analysis time for supported instruments drops from weeks to minutes while preserving the researcher's exact methodology.
  • The pattern works with proprietary binary data formats and licensed software that must stay on local infrastructure.
  • Reproducibility becomes a structural property of the workflow rather than a post-hoc verification task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mediation pattern could apply to any workflow where consistency across runs is required, such as repeated measurements on shared lab equipment.
  • By fixing the tool definitions, researchers gain an auditable record of the exact procedure used without needing to inspect generated code each time.
  • Deployment on local infrastructure alongside the instrument becomes necessary for any analysis involving restricted data or software licenses.

Load-bearing premise

A researcher's exact analytical procedure for an instrument can be fully captured in a deterministic tool through structured interviews without losing necessary nuance or requiring ongoing human judgment.

What would settle it

Execute the same photoluminescence analysis prompt four or more times on the typed mediation system and check whether every run produces exactly the same numerical results and analysis steps, while direct model generation on identical prompts shows variation.

Figures

Figures reproduced from arXiv: 2605.13245 by Danae Katrisioti, Emmanuel Stratakis, Marios Adamidis, Yannis Tzitzikas.

Figure 1
Figure 1. Figure 1: Three approaches to scientific data analysis. Data collection is identical in all cases. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Typed mediation architecture. The model orchestrates tool calls through a typed [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Typed mediation pipeline applied to photoluminescence spectroscopy. Left: the user [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Reproducibility evaluation across four runs per platform. (a) Intensity exponent [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
read the original abstract

Language models can produce convincing scientific analyses, but repeated generations on the same data do not guarantee the same result. A researcher may regenerate an identical query and receive a different fit, a different peak position or a different analysis procedure, without an obvious way to decide which output to trust. We propose typed mediation, a pattern in which the model orchestrates deterministic tools rather than generating analytical code. Each tool encodes one researcher's exact procedure for one instrument, ported through structured interviews. The model selects which tool to call and with what parameters. The tool produces the result. Regeneration does not change it. We evaluate this claim by running the same photoluminescence analysis on four platforms, including three commercial foundation models, four times each with the same prompt. The typed tool produces identical results across all runs. The commercial platforms either vary in numerical output and analytical methodology across runs, or fail to produce valid results on the task. We deploy this pattern on two instruments serving users over approximately six months, with very positive user feedback. Both cases are very challenging: they involve proprietary binary formats and per-seat licensed software, which force the tool to remain on local infrastructure alongside the data and the instrument it operates. We argue that deployment topology is not just a preference, but a structural requirement of scientific tool mediation. The result is a practical pattern for deploying language models in scientific workflows where reproducibility is mandatory, reducing analysis time from weeks to minutes while guaranteeing identical outputs across runs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes 'typed mediation,' a pattern in which language models orchestrate deterministic tools that encode a researcher's exact procedure for a given instrument (captured via structured interviews) rather than generating analytical code. The central claim is that this yields identical outputs on repeated identical prompts, unlike direct use of commercial foundation models. Evidence consists of a controlled four-run photoluminescence analysis across four platforms (typed tool plus three commercial models) showing identical results for the tool and variation or invalid outputs for the others, plus six-month deployments on two instruments with proprietary binary formats and licensed software, reporting positive user feedback. The paper argues that local deployment topology is a structural requirement for such mediation.

Significance. If the determinism claim holds under the reported conditions, the work provides a practical, deployable pattern for integrating LLMs into scientific workflows where reproducibility is mandatory. It shifts the reproducibility burden from stochastic generation to deterministic tool execution, with demonstrated time savings and applicability to challenging local-infrastructure settings. The direct multi-platform comparison and real-world deployment data are concrete strengths; the emphasis on topology as a requirement highlights an under-discussed engineering constraint in scientific AI applications.

major comments (3)
  1. [Evaluation] Evaluation section: the claim that the typed tool produces identical results while commercial platforms vary or fail lacks quantitative metrics for output variation (e.g., numerical differences in peak positions or fit parameters), exact definitions of 'valid results,' and any statistical analysis across the four runs per platform. These details are needed to substantiate the cross-platform comparison.
  2. [Methods / Tool Design] Tool construction description: the assertion that structured interviews fully capture 'one researcher's exact procedure' does not address potential loss of nuance, edge cases requiring ongoing judgment, or validation that the resulting tool matches the original analyst's intent on held-out data. This bears on whether the determinism is scientifically complete.
  3. [Discussion] Discussion of deployment topology: the claim that local infrastructure is a 'structural requirement' (rather than a preference) for proprietary formats and licensed software is asserted but not supported by concrete failure cases of cloud alternatives or quantitative comparison of latency/security constraints.
minor comments (2)
  1. [Abstract] The abstract introduces 'typed mediation' without a concise one-sentence definition; adding this would improve immediate clarity for readers.
  2. [Deployment] The six-month deployment results would benefit from at least summary statistics on usage volume, error rates, or specific user feedback themes to strengthen the 'very positive' claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and indicate the revisions we will make to strengthen the presentation of our results and claims.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the claim that the typed tool produces identical results while commercial platforms vary or fail lacks quantitative metrics for output variation (e.g., numerical differences in peak positions or fit parameters), exact definitions of 'valid results,' and any statistical analysis across the four runs per platform. These details are needed to substantiate the cross-platform comparison.

    Authors: We agree that quantitative details will strengthen the evaluation. In the revised manuscript we will add a table reporting the exact peak positions and fit parameters obtained in each of the four runs for every platform. For the typed tool these values are identical (standard deviation of zero). For the commercial models we will show the observed numerical spreads. We define 'valid results' as outputs that complete without runtime errors and yield physically plausible parameters (peak positions within the expected spectral range for the sample). With only four runs we present the raw values rather than formal statistical tests, but the contrast between zero variation and the observed spreads is clear. revision: yes

  2. Referee: [Methods / Tool Design] Tool construction description: the assertion that structured interviews fully capture 'one researcher's exact procedure' does not address potential loss of nuance, edge cases requiring ongoing judgment, or validation that the resulting tool matches the original analyst's intent on held-out data. This bears on whether the determinism is scientifically complete.

    Authors: The interviews were iterative and explicitly elicited decision rules and edge-case handling, which were then encoded directly into the tool's deterministic logic. We acknowledge that some analyst nuance may still be difficult to capture exhaustively. Validation occurred through six months of live use by the original researchers and additional users, whose feedback confirmed that tool outputs matched their expectations. We will add a paragraph in the Methods section discussing these limitations and the practical validation provided by deployment. revision: partial

  3. Referee: [Discussion] Discussion of deployment topology: the claim that local infrastructure is a 'structural requirement' (rather than a preference) for proprietary formats and licensed software is asserted but not supported by concrete failure cases of cloud alternatives or quantitative comparison of latency/security constraints.

    Authors: Local deployment is required because the instruments produce proprietary binary formats that need specialized local parsers and because the analysis software is available only under per-seat licenses that prohibit cloud execution. We will expand the Discussion to give concrete examples from the two deployed instruments, including the specific licensed packages and format constraints that make cloud alternatives impossible without license violation. While we did not run explicit cloud experiments, the licensing and format barriers are inherent and preclude quantitative cloud comparisons. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper advances a practical pattern of typed mediation in which an LLM orchestrates pre-defined deterministic tools rather than generating analysis code. Its central claim—that the tool yields identical outputs on repeated identical prompts—is supported by direct empirical comparison: the same photoluminescence prompt is executed four times on the typed tool and on three commercial platforms, with the tool producing identical numerical and methodological results while the others vary or fail. No equations, fitted parameters, or derivations appear in the manuscript; the determinism follows from the tool's implementation as fixed code rather than from any self-referential definition or prediction that reduces to its own inputs. Self-citations are absent from the load-bearing sections, and the six-month deployment observations are presented as corroborating usage data rather than as a mathematical necessity. The evaluation therefore remains an independent test of the proposed workflow.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that exact researcher procedures can be encoded into deterministic, locally executable tools without loss of fidelity, especially for proprietary instrument formats.

axioms (1)
  • domain assumption A researcher's exact procedure for one instrument can be ported into a deterministic tool via structured interviews
    Invoked when describing how tools are created to guarantee identical outputs.
invented entities (1)
  • typed mediation pattern no independent evidence
    purpose: Orchestrating deterministic tools via language model selection to enforce reproducibility
    New pattern introduced to separate model orchestration from execution.

pith-pipeline@v0.9.0 · 5574 in / 1215 out tokens · 64420 ms · 2026-05-14T19:52:31.635103+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    Defeating nondeterminism in LLM inference, September 2025

    Horace He and Thinking Machines Lab. Defeating nondeterminism in LLM inference, September 2025. Connectionism blog

  2. [2]

    Measuring AI agent autonomy in practice, February 2026

    Miles McCain, Thomas Millar, Saffron Huang, Jake Eaton, Kunal Handa, Michael Stern, Alex Tamkin, Matt Kearney, Esin Durmus, Judy Shen, Jerry Hong, Brian Calvert, Jun Sh- ern Chan, Francesco Mosconi, David Saunders, Tyler Neylon, Gabriel Nicholas, Sarah Pol- lack, Jack Clark, and Deep Ganguli. Measuring AI agent autonomy in practice, February 2026

  3. [3]

    John R. Kitchin. The evolving role of programming and LLMs in the development of self-driving laboratories.APL Machine Learning, 3(2):026111, 2025

  4. [4]

    Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs

    Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, and Ondřej Dušek. Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs. InProceed- ings of the 18th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2024

  5. [5]

    Leemann, Antonin Sulc, and Marco Venturini

    Thorsten Hellert, Drew Bertwistle, Simon C. Leemann, Antonin Sulc, and Marco Venturini. Agentic artificial intelligence for multistage physics experiments at a large-scale user facility particle accelerator.Physical Review Research, 8:L012017, 2025

  6. [6]

    Prince, Tao Zhou, Henry Chan, and Mathew J

    Aikaterini Vriza, Michael H. Prince, Tao Zhou, Henry Chan, and Mathew J. Cherukara. Operating advanced scientific instruments with AI agents that learn on the job.npj Com- putational Materials, 2026

  7. [7]

    Toward full autonomous laboratory instrumentation control with large language models.Small Structures, 2026

    Yong Xie, Kexin He, and Andres Castellanos-Gomez. Toward full autonomous laboratory instrumentation control with large language models.Small Structures, 2026

  8. [8]

    Cooper, Mengjia Zhu, Xenophon Evangelopoulos, and Andrew I

    Abdoulatif Cissé, Max E. Cooper, Mengjia Zhu, Xenophon Evangelopoulos, and Andrew I. Cooper. Can we automate scientific reasoning in closed-loop experiments using large lan- guage models?Digital Discovery, 5:1132–1160, 2026

  9. [9]

    Same prompt, different outcomes: Evaluating the repro- ducibility of data analysis by LLMs

    Jiaxin Cui and Rohan Alexander. Same prompt, different outcomes: Evaluating the repro- ducibility of data analysis by LLMs. 2026. Preprint

  10. [10]

    ChatSpatial: Schema-enforced agentic orches- tration for reproducible and cross-platform spatial transcriptomics.bioRxiv, 2026

    Chen Yang, Xianyang Zhang, and Jun Chen. ChatSpatial: Schema-enforced agentic orches- tration for reproducible and cross-platform spatial transcriptomics.bioRxiv, 2026. Preprint

  11. [11]

    Scaling reproducibility: An AI-assisted workflow for large- scale replication and reanalysis

    Yiqing Xu and Leo Yang Yang. Scaling reproducibility: An AI-assisted workflow for large- scale replication and reanalysis. 2026. Preprint

  12. [12]

    Experiences with model context protocol servers for science and high performance computing

    Haochen Pan, Ryan Chard, Reid Mello, Christopher Grams, Tanjin He, Alexander Brace, Owen Price Skelly, Will Engler, Hayden Holbrook, Song Young Oh, Maxime Gonthier, Michael Papka, Ben Blaiszik, Kyle Chard, and Ian Foster. Experiences with model context protocol servers for science and high performance computing. 2025. Argonne National Laboratory / Globus....

  13. [13]

    Talk freely, execute strictly: Schema-gated agentic AI for flexible and reproducible scientific workflows, March 2026

    Joel Strickland, Arjun Vijeta, Chris Moores, Oliwia Bodek, Bogdan Nenchev, Thomas Whitehead, Charles Phillips, Karl Tassenberg, Gareth Conduit, and Ben Pellegrini. Talk freely, execute strictly: Schema-gated agentic AI for flexible and reproducible scientific workflows, March 2026. Preprint

  14. [14]

    Towards verifiably safe tool use for LLM agents

    Aarya Doshi, Yining Hong, Congying Xu, Eunsuk Kang, Alexandros Kapravelos, and Chris- tian Kästner. Towards verifiably safe tool use for LLM agents. InProceedings of the 2026 15 IEEE/ACM 48th International Conference on Software Engineering: New Ideas and Emerg- ing Results (ICSE-NIER ’26), pages 1–5, Rio de Janeiro, Brazil, April 2026. ACM

  15. [15]

    Owl-AuraID 1.0: An intelligent sys- tem for autonomous scientific instrumentation and scientific data analysis

    Han Deng, Anqi Zou, Hanling Zhang, Ben Fei, Chengyu Zhang, Haobo Wang, Xinru Guo, Zhenyu Li, Xuzhu Wang, Peng Yang, Fujian Zhang, Weiyu Guo, Xiaohong Shao, Zhaoyang Liu, Shixiang Tang, Zhihui Wang, and Wanli Ouyang. Owl-AuraID 1.0: An intelligent sys- tem for autonomous scientific instrumentation and scientific data analysis. 2026. Preprint

  16. [16]

    Introducing the model context protocol, November 2024

    Anthropic. Introducing the model context protocol, November 2024

  17. [17]

    Integrating domain-specialized language models with AI measurement tools for deterministic atomic-resolution experimentation, February 2026

    Zhuo Diao, Kouma Matsumoto, Linfeng Hou, Masahiro Ohara, Hayato Yamashita, and Masayuki Abe. Integrating domain-specialized language models with AI measurement tools for deterministic atomic-resolution experimentation, February 2026. Preprint

  18. [18]

    Mohammed Mehedi Hasan, Hao Li, Gopi Krishnan Rajbahadur, Bram Adams, and Ahmed E. Hassan. Model context protocol (MCP) tool descriptions are smelly! towards improving AI agent efficiency with augmented MCP tool descriptions. 2026. Preprint

  19. [19]

    From docs to descriptions: Smell-aware evaluation of MCP server descriptions

    Peiran Wang, Ying Li, Yuqiang Sun, Chengwei Liu, Yang Liu, and Yuan Tian. From docs to descriptions: Smell-aware evaluation of MCP server descriptions. 2026. Preprint. A An Example Skill Document for SEM Periodicity Analysis Below is a lightly edited excerpt from the skill document for the SEM periodicity tool (Section 4). The model reads this document be...

  20. [20]

    Read magnification from the SEM info bar (e.g.x40000)

  21. [21]

    Identify the uploaded file from its metadata

  22. [22]

    Decide analysis type: periodicity, particle sizing, or both

  23. [23]

    Call the tool with exact parameters

  24. [24]

    analyze this

    Present results with embedded figures. Analysis Type Selection User requestparticle_analysis Periodicity, spacing, LIPSS, FFT omit (defaultfalse) Particle/grain size, distributiontrue General “analyze this”true(runs both) Tool Invocation Periodicity only: run("sem_fft", { "file_id": "<UUID>", "mag_label": "x40000" }) With particle sizing: run("sem_fft", {...