Experiments in Agentic AI for Science

Geoffrey Fox; Judy Fox

arxiv: 2605.26305 · v2 · pith:274RC4Q5new · submitted 2026-05-25 · 💻 cs.AI · cs.SY· eess.SY· hep-ph

Experiments in Agentic AI for Science

Judy Fox , Geoffrey Fox This is my paper

Pith reviewed 2026-06-29 21:07 UTC · model grok-4.3

classification 💻 cs.AI cs.SYeess.SYhep-ph

keywords agentic AIscientific workflowstime-series curationlecture analysisdata deduplicationknowledge graphshybrid architecturehigh-energy physics

0 comments

The pith

Agentic AI with hybrid local-remote architecture automates time-series data curation and complex lecture analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that agentic AI systems built with a hybrid setup of local control and remote model calls can carry out demanding scientific tasks that exceed the reliable context and reasoning capacity of current large language models. It demonstrates this through two working systems that handle dataset curation with deduplication and turn dense physics lectures into structured reports. A reader would care because these methods rely on concrete engineering steps rather than waiting for larger models. The work also sketches how the same approach could extend to organizing scientific knowledge in graph form and to specific domains like high-energy physics.

Core claim

The central claim is that a hybrid architecture in which local Python orchestrators invoke remote language-model backends, combined with granular attribute extraction, remote data inspection, and distributed concurrency controls, allows agentic systems to overcome the context and reasoning limits of standalone models and thereby support rigorous scientific workflows such as large-scale time-series curation and conversion of mathematically complex lectures into reports.

What carries the argument

The hybrid local body and remote brain architecture in which local orchestrators coordinate calls to language-model backends while applying granular attribute extraction, remote inspection, and concurrency controls.

If this is right

Large-scale curation, extraction, and deduplication of time-series datasets can be performed autonomously.
Visually dense and mathematically complex physics lectures can be converted into structured scientific reports without manual intervention.
The same methods generalize to construction of deep knowledge graphs from scientific material.
The approach applies to high-energy physics data organization and analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The engineering practices could be tested on presentation material from fields other than physics to check transferability.
Integration of the curation system with existing scientific databases might improve deduplication accuracy beyond what the paper demonstrates.
Focusing on system-level controls rather than model size alone suggests a route to more reliable AI agents in other data-heavy research areas.

Load-bearing premise

The hybrid local-remote setup can coordinate the language-model backends at the required scale without introducing unhandled errors or context losses.

What would settle it

Executing the curation system on a dataset large enough to trigger context overflow and then measuring whether duplicates are missed or data is lost would show whether the engineering steps actually solve the claimed limitations.

read the original abstract

This paper details two novel frameworks for developing autonomous, agentic AI in scientific workflows. Both systems leverage a hybrid Local Body, Remote Brain architecture via Google Colab, utilizing Python-based local orchestrators to invoke large language model (LLM) cloud backends. The first agent, DeepTS/DeepCollector, automates the large-scale curation, extraction, and deduplication of time-series datasets. The second, DeepScribe, is an autonomous presentation analyzer that converts visually dense, mathematically complex physics lectures into structured scientific reports. Through practical systems engineering-such as granular attribute extraction (Cellular RAG), remote data inspection, and distributed concurrency controls-we demonstrate how agentic AI can overcome the context and reasoning limitations of current state-of-the-art systems to rigorously support scientific workflows. Finally, we outline a generalization of DeepTS to support deep knowledge graphs and discuss the application of this conceptual approach to high-energy physics (DeepQCD).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Two agent frameworks for time-series curation and lecture analysis are described in detail but rest on architecture claims without any metrics, error rates, or baseline comparisons.

read the letter

The paper walks through two systems built on a hybrid local Python orchestrator with remote LLM backends. DeepTS/DeepCollector handles large-scale time-series data extraction, deduplication, and curation. DeepScribe turns dense physics lectures into structured reports. It spells out engineering moves such as granular attribute extraction, remote data inspection, and concurrency controls to manage context windows.

Those details are the main concrete part. The description of the Colab setup and the mention of Cellular RAG give a reader a sense of how the authors wired the pieces together. The brief nod to generalizing toward knowledge graphs and high-energy physics applications shows they are thinking about next steps.

The problem is that the central claim—that these choices overcome LLM context and reasoning limits to rigorously support scientific workflows—has no supporting numbers. No error rates, no deduplication precision, no report completeness scores, no head-to-head results against plain LLM calls or other agent patterns. The abstract and the outline treat the architecture itself as the demonstration.

This leaves the work as a systems description rather than an evaluated result. Without code, data, or quantitative checks, it is hard to tell whether the added machinery reduces errors or simply shifts them. The circularity is standard for framework papers that define success by their own design choices.

People already building similar LLM agents for data pipelines might pick up a few implementation ideas. Anyone looking for evidence that agentic setups measurably improve scientific tasks will not find it here. The paper does not reach the threshold for a serious referee process in its present form; it would need at least an evaluation section with reproducible measurements before that makes sense.

Referee Report

2 major / 2 minor

Summary. The paper claims to detail two novel agentic AI frameworks for scientific workflows. DeepTS/DeepCollector automates large-scale curation, extraction, and deduplication of time-series datasets using a hybrid Local Body/Remote Brain architecture (Google Colab with Python orchestrators calling LLM backends). DeepScribe autonomously converts visually dense, mathematically complex physics lectures into structured reports. The authors assert that techniques such as granular attribute extraction (Cellular RAG), remote data inspection, and distributed concurrency controls overcome the context and reasoning limitations of current LLMs to rigorously support scientific tasks. The manuscript also outlines a generalization of DeepTS to deep knowledge graphs and discusses applications to high-energy physics (DeepQCD).

Significance. If the central claims were supported by quantitative evidence, the work would be significant for demonstrating practical systems engineering approaches to reliable agentic AI in data curation and scientific document analysis. The hybrid architecture and Cellular RAG concept could offer reusable patterns for addressing LLM limitations in long-context scientific applications, with potential impact on automation in physics and related fields.

major comments (2)

[Abstract] Abstract: The claim that the systems 'overcome the context and reasoning limitations of current state-of-the-art systems to rigorously support scientific workflows' is load-bearing for the paper's contribution but is unsupported by any reported metrics (e.g., deduplication precision, report completeness, error rates, context utilization, or baseline comparisons), leaving the assertion unverified.
[Abstract] Abstract (DeepScribe and DeepTS descriptions): No quantitative evaluation or failure analysis is provided for the reliability of the Colab/Python-orchestrator hybrid architecture in coordinating LLM backends at the scale needed for large-scale curation and lecture analysis, which is required to substantiate that no new unhandled errors or context losses are introduced.

minor comments (2)

The manuscript would benefit from explicit references to prior work on agentic AI frameworks and RAG variants to better situate the Cellular RAG contribution.
Notation for 'Cellular RAG' and the Local Body/Remote Brain architecture should be defined more formally on first use to improve clarity for readers unfamiliar with the specific implementation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments. We address each of the major comments below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the systems 'overcome the context and reasoning limitations of current state-of-the-art systems to rigorously support scientific workflows' is load-bearing for the paper's contribution but is unsupported by any reported metrics (e.g., deduplication precision, report completeness, error rates, context utilization, or baseline comparisons), leaving the assertion unverified.

Authors: We agree that the abstract's claim would benefit from supporting quantitative evidence to strengthen the paper's contribution. The current manuscript focuses on describing the novel agentic frameworks and their architectural innovations. In the revised version, we will tone down the claim in the abstract to emphasize the design and implementation of the systems, and add a dedicated evaluation section reporting metrics such as deduplication precision for DeepTS and completeness scores for DeepScribe reports, along with baseline comparisons where feasible. revision: yes
Referee: [Abstract] Abstract (DeepScribe and DeepTS descriptions): No quantitative evaluation or failure analysis is provided for the reliability of the Colab/Python-orchestrator hybrid architecture in coordinating LLM backends at the scale needed for large-scale curation and lecture analysis, which is required to substantiate that no new unhandled errors or context losses are introduced.

Authors: The referee correctly identifies the absence of quantitative reliability assessments in the manuscript. While the paper details the engineering approaches like Cellular RAG and concurrency controls to mitigate known LLM limitations, we did not include systematic failure analysis or error rate measurements. We will revise the manuscript to include such an analysis, drawing from our experimental runs, including any observed context losses or orchestration errors, to better substantiate the claims. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive systems paper with no derivation chain or self-referential reductions.

full rationale

The provided text consists of an abstract and high-level description of two agentic AI frameworks (DeepTS/DeepCollector and DeepScribe) using a hybrid Local Body/Remote Brain architecture. No equations, fitted parameters, predictions, or uniqueness theorems appear. Claims of overcoming LLM limitations are presented as outcomes of the described engineering practices (granular attribute extraction, concurrency controls), but these are not shown to reduce to inputs by construction, self-citation, or renaming. This matches the expected pattern for an engineering/experiments paper that is self-contained in its implementation narrative without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities. The central claims rest on unstated assumptions about LLM orchestration reliability and the effectiveness of the named engineering techniques.

pith-pipeline@v0.9.1-grok · 5684 in / 1182 out tokens · 21908 ms · 2026-06-29T21:07:37.180090+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 2 internal anchors

[1]

A survey on large language model based autonomous agents.Frontiers Comput

MLCommons, University of Virginia, Geoffrey Fox, “Catalog of about 700 Deduplicated Time Series datasets produced by Prototype AI Time Series assistant DeepTS,” 21-Mar-2026. [Online]. Available: https://docs.google.com/spreadsheets/d/1-PuWrHO30E4WPM-rOed03n42gfo5AlEtscKqqtjznA0/edit?usp=sharing. [Accessed: 22-Mar-2026] [2] Geoffrey Fox, “Benchmarking for ...

work page doi:10.1007/s11704-024-40231-1 2026
[2]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Rahul Kumar, “Beyond LLMs: Building a Graph-RAG Agentic Architecture for 70% Faster ECM Automation,” 04-Nov-2025. [Online]. Available: https://medium.com/@hellorahulk/beyond-llms-building-a-graph-rag-agentic-architecture-for-70-faster-ecm-automation-299b05d026fb. [Accessed: 24-Nov-2025] [60] Subrata Samanta, “Hybrid Graph RAG: Harnessing Graph and Vector ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3777378 2025
[3]

Sundial: A Family of Highly Capable Time Series Foundation Models

Guardrails AI, “Validators: Guardrails components that are used to validate an aspect of an LLM workflow.” [Online]. Available: https://guardrailsai.com/hub. [Accessed: 29-Nov-2025] [96] Pydantic, “Pydantic Validation: the most widely used data validation library for Python.” [Online]. Available: https://docs.pydantic.dev/latest/. [Accessed: 29-Nov-2025] ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

A survey on large language model based autonomous agents.Frontiers Comput

MLCommons, University of Virginia, Geoffrey Fox, “Catalog of about 700 Deduplicated Time Series datasets produced by Prototype AI Time Series assistant DeepTS,” 21-Mar-2026. [Online]. Available: https://docs.google.com/spreadsheets/d/1-PuWrHO30E4WPM-rOed03n42gfo5AlEtscKqqtjznA0/edit?usp=sharing. [Accessed: 22-Mar-2026] [2] Geoffrey Fox, “Benchmarking for ...

work page doi:10.1007/s11704-024-40231-1 2026

[2] [2]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Rahul Kumar, “Beyond LLMs: Building a Graph-RAG Agentic Architecture for 70% Faster ECM Automation,” 04-Nov-2025. [Online]. Available: https://medium.com/@hellorahulk/beyond-llms-building-a-graph-rag-agentic-architecture-for-70-faster-ecm-automation-299b05d026fb. [Accessed: 24-Nov-2025] [60] Subrata Samanta, “Hybrid Graph RAG: Harnessing Graph and Vector ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3777378 2025

[3] [3]

Sundial: A Family of Highly Capable Time Series Foundation Models

Guardrails AI, “Validators: Guardrails components that are used to validate an aspect of an LLM workflow.” [Online]. Available: https://guardrailsai.com/hub. [Accessed: 29-Nov-2025] [96] Pydantic, “Pydantic Validation: the most widely used data validation library for Python.” [Online]. Available: https://docs.pydantic.dev/latest/. [Accessed: 29-Nov-2025] ...

work page internal anchor Pith review Pith/arXiv arXiv 2025