Agents for Experiments, Experiments for Agents: A Design Grammar for AI-Enabled Experimental Science
Pith reviewed 2026-05-20 11:18 UTC · model grok-4.3
The pith
SEED encodes experimental conditions as typed actor-flow graphs to describe, evaluate, and generate AI-human workflow designs under governance constraints.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Experimental conditions for AI-enabled studies can be represented as typed actor-flow graphs in the SEED framework. This representation supports describing the structure of interactions among actors, evaluating the structural novelty of a candidate design against a library of prior encodings, and generating new candidate designs subject to explicit feasibility and governance constraints. In a diagnostic test contrasting graph-blind and SEED-guided generation for a medical-triage task, the SEED-guided outputs displayed clearer documentation of actor-flow modifications, stated assumptions, and governance validations.
What carries the argument
SEED (Structural Encoding for Experimental Discovery) as typed actor-flow graphs that encode actors, directed flows, and constraint annotations to enable description, novelty evaluation, and constrained generation of experimental designs.
If this is right
- Experimental conditions become comparable and reusable across different studies through shared graph encodings.
- Generation of new designs can systematically incorporate explicit checks for governance and feasibility.
- Structural novelty becomes a measurable property relative to an encoded set of prior designs.
- Accountability improves because assumptions and control points are surfaced in the graph representation.
Where Pith is reading between the lines
- The graph approach could be applied to generate and audit experiments in domains such as education or organizational decision-making beyond the medical example.
- Libraries of reusable graph templates might emerge for common experiment patterns, reducing the cost of designing new tests.
- Tensions around replication and validity identified in the commentary could be addressed by versioning the graph encodings themselves.
Load-bearing premise
Representing experimental conditions as typed actor-flow graphs captures the key mechanisms of delegation, feedback, and control in human-AI arrangements without significant loss of relevant detail.
What would settle it
An independent replication of the medical-triage design task in which blinded evaluators find no difference in clarity of actor-flow changes, assumptions, or governance checks between SEED-guided and unstructured candidate designs.
read the original abstract
AI systems are becoming active participants in organizational and knowledge work. They increasingly interact with humans, coordinate workflows, and operate in multi-agent arrangements. Understanding their effects therefore requires more than measuring output accuracy; it requires evidence about mechanisms, delegation, feedback, and control. Experiments remain central to this task, but they also face a recursive challenge: we need experiments for agents to study these arrangements, and we may need agents for experiments to help search the expanding space of possible designs. Yet experimental conditions for human-AI and agentic workflows are still largely specified in prose, making them difficult to compare, reuse, or audit. We frame this as a problem of workflow representation, traceability, and governance in AI-enabled knowledge production. We introduce SEED (Structural Encoding for Experimental Discovery), a framework that represents experimental conditions as typed actor-flow graphs. SEED supports three design functions: describing conditions as interaction structures, evaluating structural novelty relative to encoded prior designs, and generating candidate designs under feasibility and governance constraints. We report a lightweight empirical feasibility test that compares graph-blind and SEEDguided generation in a medical-triage design task. In this diagnostic contrast, SEED-guided candidate designs show clearer actor-flow changes, assumptions, and governance checks, supporting the feasibility of the grammar as a design aid. The commentary closes by identifying governance tensions around novelty, replication, validity, diversity of inquiry, and accountability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SEED (Structural Encoding for Experimental Discovery), a framework representing experimental conditions as typed actor-flow graphs to enable description of interaction structures, evaluation of structural novelty relative to prior designs, and generation of candidate designs under feasibility and governance constraints. It reports a lightweight empirical feasibility test contrasting graph-blind and SEED-guided generation in a single medical-triage design task, claiming that SEED outputs exhibit clearer actor-flow changes, assumptions, and governance checks.
Significance. If the representation and generation functions prove robust, SEED could advance traceability, comparability, and auditability of complex human-AI workflow experiments, addressing a timely need as AI agents increasingly participate in organizational and knowledge-production settings. The framing of experiments as design problems with explicit governance checks is a constructive contribution to AI-enabled science methodology.
major comments (2)
- [empirical feasibility test] Empirical feasibility test section: The diagnostic contrast relies on an informal qualitative judgment that SEED-guided designs show 'clearer actor-flow changes, assumptions, and governance checks' without reporting quantitative metrics, pre-specified scoring criteria, blinding procedures, inter-rater reliability, or statistical comparison to the graph-blind baseline. This leaves the central feasibility claim dependent on unverified author assessment rather than reproducible evidence.
- [SEED framework] Framework definition: The claim that typed actor-flow graphs adequately capture mechanisms such as delegation, feedback, and control is asserted without a systematic analysis of representational fidelity or loss of relevant detail; the single-task contrast does not test whether the graph encoding preserves or distorts these dynamics across varied experimental settings.
minor comments (1)
- [conclusion] The abstract and closing commentary reference governance tensions (novelty, replication, validity, diversity, accountability) but the main text would benefit from explicit mapping of how specific SEED operations (description, novelty evaluation, constrained generation) mitigate each tension.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which help clarify the scope and evidentiary standards appropriate for introducing a design grammar. We address each major comment below and indicate where revisions will be incorporated to improve transparency and precision.
read point-by-point responses
-
Referee: [empirical feasibility test] Empirical feasibility test section: The diagnostic contrast relies on an informal qualitative judgment that SEED-guided designs show 'clearer actor-flow changes, assumptions, and governance checks' without reporting quantitative metrics, pre-specified scoring criteria, blinding procedures, inter-rater reliability, or statistical comparison to the graph-blind baseline. This leaves the central feasibility claim dependent on unverified author assessment rather than reproducible evidence.
Authors: We agree that the presentation relies on qualitative author judgment without formal metrics or procedures. The test was designed as a lightweight diagnostic contrast to illustrate feasibility rather than as a controlled empirical study. In revision we will (1) articulate explicit qualitative criteria used to assess clarity of actor-flow changes, assumptions, and governance checks, (2) include the actual generated designs as supplementary material so readers can inspect them directly, and (3) add an explicit limitations paragraph acknowledging the absence of blinding, inter-rater reliability, and statistical testing. These changes will make the evidence more transparent while preserving the illustrative intent of the section. revision: yes
-
Referee: [SEED framework] Framework definition: The claim that typed actor-flow graphs adequately capture mechanisms such as delegation, feedback, and control is asserted without a systematic analysis of representational fidelity or loss of relevant detail; the single-task contrast does not test whether the graph encoding preserves or distorts these dynamics across varied experimental settings.
Authors: The manuscript presents SEED as an initial structural grammar and employs the medical-triage task as a single illustrative case. We accept that a broader systematic analysis of representational fidelity would strengthen the framework claims. In the revised manuscript we will add a dedicated subsection discussing how delegation, feedback, and control are encoded, together with acknowledged limitations such as the loss of fine-grained temporal sequencing or implicit contextual cues. We will also state explicitly that the single-task contrast is not offered as exhaustive validation and will outline directions for multi-domain testing in future work. revision: partial
Circularity Check
No significant circularity; framework introduced as independent encoding without reduction to inputs or self-referential definitions.
full rationale
The paper presents SEED as a novel structural encoding framework that represents experimental conditions as typed actor-flow graphs to support description, novelty evaluation, and constrained generation. The feasibility claim rests on a qualitative contrast between graph-blind and SEED-guided outputs in a single medical-triage task, described as showing clearer actor-flow changes and governance checks. No equations, fitted parameters, or derivations are provided that would make any result equivalent to its inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked in the reported chain. The derivation remains self-contained as an independent design grammar and diagnostic test.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Experimental conditions for human-AI and agentic workflows can be adequately represented as typed actor-flow graphs.
invented entities (1)
-
Typed actor-flow graphs
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SEED represents experimental conditions as typed actor-flow graphs G=(V,E) with actor types Δ/⃝ and flow types →/⇒/↔n
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Novelty score D(G_new, G_ref) = w_s · δ_struct + w_p · δ_param using graph edit distance
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[3]
Hemant K Bhargava, Susan Brown, Anindya Ghose, Alok Gupta, Dorothy Leidner, and DJ Wu. 2025. Exploring Generative AI’s Impact on Research: Perspectives from Senior Scholars in Management Information Systems.ACM Transactions on Management Information Systems16, 2, Article 19 (2025), 9 pages. doi:10.1145/3721846
-
[4]
E Brynjolfsson, D Li, and LR Raymond. 2025. Generative AI at work.The Quarterly Journal of Economics140, 2 (2025), 889–942. doi:10.1093/qje/qjae044
-
[5]
Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. 2024. Large Language Models as Tool Makers. InInternational Conference on Learning Representations (ICLR). OpenReview.net, Vienna, Austria, 23 pages. https://openreview.net/forum?id=qV83K9d5WB
work page 2024
-
[9]
Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. 2024. Large Language Models Cannot Self-Correct Reasoning Yet. InInternational Conference on Learning Representations (ICLR). OpenReview.net, Vienna, Austria, 17 pages. https://openreview.net/forum?id=IkmD3fKBPQ
work page 2024
-
[10]
Anna Kawakami, Venkatesh Sivaraman, Hao-Fei Cheng, Logan Stapleton, Yanghuidi Cheng, Diana Qing, Adam Perer, Zhiwei Steven Wu, Haiyi Zhu, and Kenneth Holstein. 2022. Improving Human-AI Partnerships in Child Welfare: Understanding Worker Practices, Challenges, and Desires for Algorithmic Decision Support. InProceedings of the 2022 CHI Conference on Human F...
-
[11]
Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. 2013. Online Controlled Experiments at Large Scale. InProceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). Association for Computing Machinery, New York, NY, USA, 1168–1176. doi:10.1145/2487575.2488217
-
[12]
Pan Li and Alexander Tuzhilin. 2020. Ddtcdr: Deep dual transfer cross domain recommendation. InProceedings of the 13th international conference on web search and data mining. 331–339. doi:10.1145/3336191.3371793
-
[13]
Jessy Lin, Nicholas Tomlin, Jacob Andreas, and Jason Eisner. 2024. Decision-Oriented Dialogue for Human-AI Collaboration.Transactions of the Association for Computational Linguistics12 (2024), 892–911. doi:10.1162/tacl_a_00679
-
[14]
Jens Ludwig and Sendhil Mullainathan. 2024. Machine learning as a tool for hypothesis generation.The Quarterly Journal of Economics139, 2 (2024), 751–827. doi:10.1093/qje/qjad055
-
[16]
Open Science Collaboration. 2015. Estimating the reproducibility of psychological science.Science349, 6251 (2015), aac4716
work page 2015
-
[17]
Phanish Puranam. 2021. Human–AI Collaborative Decision-Making as an Organization Design Problem.Journal of Organization Design10 (2021), 75–80. doi:10.1007/s41469-021-00095-2
-
[18]
Deciding fast and slow: The role of cognitive biases in ai-assisted decision-making
Charvi Rastogi, Yunfeng Zhang, Dennis Wei, Kush R. Varshney, Amit Dhurandhar, and Richard Tomsett. 2022. Deciding Fast and Slow: The Role of Cognitive Biases in AI-assisted Decision-Making.Proceedings of the ACM on Human-Computer Interaction6, CSCW1 (2022), 1–22. doi:10.1145/3512930
-
[21]
Anjana Susarla, Ram Gopal, Jason Bennett Thatcher, and Suprateek Sarker. 2023. The Janus effect of generative AI: Charting the path for responsible conduct of scholarly activities in information systems.Information Systems Research34, 2 (2023), 399–408. doi:10.1287/isre.2023.ed.v34.n2
-
[22]
Kyle Swanson, Wesley Wu, Nash L Bulaong, John E Pak, and James Zou. 2025. The Virtual Lab of AI agents designs new SARS-CoV-2 nanobodies. Nature646, 8085 (2025), 716–723. doi:10.1038/s41586-025-09442-9
-
[23]
Michael Vössing, Niklas Kühl, Matteo Lind, and Gerhard Satzger. 2022. Designing Transparency for Effective Human-AI Collaboration.Information Systems Frontiers24 (2022), 877–895. doi:10.1007/s10796-022-10284-3
-
[24]
Lingli Wang, Ni Huang, Yumei He, De Liu, Xunhua Guo, Yan Sun, and Guoqing Chen. 2025. Artificial Intelligence (AI) Assistant in Online Shopping: A Randomized Field Experiment on a Livestream Selling Platform.Information Systems Research36, 4 (2025), 2358–2374. doi:10.1287/isre.2023.0103
-
[25]
Heng Xu and Nan Zhang. 2022. From Contextualizing to Context Theorizing: Assessing Context Effects in Privacy Research.Management Science 68, 10 (2022), 7383–7401. doi:10.1287/mnsc.2021.4249
-
[26]
Yuqian Xu, Hongyan Dai, and Wanfeng Yan. 2024. Identity Disclosure and Anthropomorphism in Voice Chatbot Design: A Field Experiment. Management Science72, 1 (2024), 223–241. doi:10.1287/mnsc.2022.03833
-
[29]
Bennett, Kori Inkpen, Jaime Tee- van, Ruth Kikin-Gil, and Eric Horvitz
Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human-AI Interaction. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI). Association for Computing Machine...
-
[30]
Proceedings of the AAAI Conference on Human Computation and Crowdsourcing , author=
Gagan Bansal, Besmira Nushi, Ece Kamar, Walter S. Lasecki, Daniel S. Weld, and Eric Horvitz. 2019. Beyond Accuracy: The Role of Mental Models in Human-AI Team Performance.Proceedings of the AAAI Conference on Human Computation and Crowdsourcing7, 1 (2019), 2–11. doi:10.1609/hcomp.v7i1.5285
-
[31]
Alan Chan, Carson Ezell, Max Kaufmann, Kevin Wei, Lewis Hammond, Herbie Bradley, Emma Bluemke, Nitarshan Rajkumar, David Krueger, Noam Kolt, Lennart Heim, and Markus Anderljung. 2024. Visibility into AI Agents. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT). Association for Computing Machinery, New York, NY,...
-
[32]
Zenan Chen and Jason Chan. 2024. Large Language Model in Creative Work: The Role of Collaboration Modality and User Expertise.Management Science70, 12 (2024), 9101–9117. doi:10.1287/mnsc.2023.03014
-
[33]
A Fügener, J Grahl, A Gupta, and W Ketter. 2022. Cognitive Challenges in Human–Artificial Intelligence Collaboration: Investigating the Path Toward Productive Delegation.Information Systems Research33, 2 (2022), 678–696. doi:10.1287/isre.2021.1079
-
[34]
Gallo, Eric Strong, Yingjie Weng, Hannah Kerman, Jason A
Ethan Goh, Robert J. Gallo, Eric Strong, Yingjie Weng, Hannah Kerman, Jason A. Freed, Joséphine A. Cool, Zahir Kanjee, Kathleen P. Lane, Andrew S. Parsons, Neera Ahuja, Eric Horvitz, Daniel Yang, Arnold Milstein, Andrew P. J. Olson, Jason Hom, Jonathan H. Chen, and Adam Rodman. 2025. GPT-4 Assistance for Improvement of Physician Performance on Patient Car...
-
[35]
Eeshaan Jain, Indradyumna Roy, Saswat Meher, Soumen Chakrabarti, and Abir De. 2024. Graph Edit Distance with General Costs Using Neural Set Divergence. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 37. Curran Associates, Inc., Red Hook, NY, USA, 40 pages. Manuscript submitted to ACM Agents for Experiments, Experiments for Agents: A D...
-
[36]
Rafal Kocielnik, Saleema Amershi, and Paul N. Bennett. 2019. Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI). Association for Computing Machinery, New York, NY, USA, 14 pages. doi:10.1145/3290605.3300641
-
[37]
Raphael Koster, Jan Balaguer, Andrea Tacchetti, Ari Weinstein, Tina Zhu, Oliver Hauser, Duncan Williams, Lucy Campbell-Gillingham, Phoebe Thacker, Matthew Botvinick, et al. 2022. Human-centred mechanism design with Democratic AI.Nature Human Behaviour6, 10 (2022), 1398–1407. doi:10.1038/s41562-022-01383-x
-
[38]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 33. Curran Associates, Inc....
work page 2020
-
[39]
Hussein Mozannar and David Sontag. 2020. Consistent Estimators for Learning to Defer to an Expert. InProceedings of the 37th International Conference on Machine Learning (ICML) (Proceedings of Machine Learning Research, Vol. 119). PMLR, Virtual, 7076–7087. https://proceedings.mlr. press/v119/mozannar20b.html
work page 2020
-
[40]
2019.Reproducibility and Replicability in Science
National Academies of Sciences, Engineering, and Medicine. 2019.Reproducibility and Replicability in Science. The National Academies Press, Washington, DC
work page 2019
-
[41]
Patil, Tianjun Zhang, Xin Wang, and Joseph E
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2024. Gorilla: Large Language Model Connected with Massive APIs. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 37. Curran Associates, Inc., Red Hook, NY, USA, 22 pages. doi:10.52202/079017-4020
-
[42]
Rishabh Ranjan, Siddharth Grover, Sourav Medya, Venkatesan Chakaravarthy, Yogish Sabharwal, and Sayan Ranu. 2022. GREED: A Neural Framework for Learning Graph Distance Functions. InAdvances in Neural Information Processing Systems (NeurIPS). Curran Associates, Inc., Red Hook, NY, USA, 13 pages. https://proceedings.neurips.cc/paper_files/paper/2022/hash/8d...
work page 2022
-
[43]
Elena Revilla, María Jesús Saenz, Matthias Seifert, and Ye Ma. 2023. Human–artificial intelligence collaboration in prediction: A field experiment in the retail industry.Journal of Management Information Systems40, 4 (2023), 1071–1098. doi:10.1080/07421222.2023.2267317
-
[44]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom
-
[45]
InAdvances in Neural Information Processing Systems (NeurIPS), Vol
Toolformer: Language Models Can Teach Themselves to Use Tools. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36. Cur- ran Associates, Inc., Red Hook, NY, USA, 13 pages. https://proceedings.neurips.cc/paper_files/paper/2023/hash/d842425e4bf79ba039352da0f658a906- Abstract-Conference.html
work page 2023
-
[46]
Weiyan Shi, Xuewei Wang, Yoo Jung Oh, Jingwen Zhang, Saurav Sahay, and Zhou Yu. 2020. Effects of Persuasive Dialogues: Testing Bot Identities and Inquiry Strategies. InProceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI). Association for Computing Machinery, New York, NY, USA. doi:10.1145/3313831.3376843
-
[47]
Marta Stelmaszak, Mareike Möhlmann, and Carsten Sørensen. 2025. When Algorithms Delegate to Humans: Exploring Human-Algorithm Interaction at Uber.MIS Quarterly49, 1 (2025), 305–330. doi:10.25300/MISQ/2024/17911
-
[48]
Richard H Thaler and Cass R Sunstein. 2021.Nudge: The final edition. Penguin. doi:10.1017/err.2021.61
-
[49]
Cathy Yang, Kevin Bauer, Xitong Li, and Oliver Hinz. 2025. My Advisor, Her AI, and Me: Evidence from a Field Experiment on Human–AI Collaboration and Investment Decisions.Management Science72, 1 (2025), 242–264. doi:10.1287/mnsc.2022.03918
-
[50]
Ming Yin, Jennifer Wortman Vaughan, and Hanna Wallach. 2019. Understanding the Effect of Accuracy on Trust in Machine Learning Models. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI). Association for Computing Machinery, New York, NY, USA, 1–12. doi:10.1145/3290605.3300509
-
[51]
Sangseok You, Cathy Liu Yang, and Xitong Li. 2022. Algorithmic versus Human Advice: Does Presenting Prediction Performance Matter for Algorithm Appreciation?Journal of Management Information Systems39, 2 (2022), 336–365. doi:10.1080/07421222.2022.2063553
-
[52]
Yunfeng Zhang, Q. Vera Liao, and Rachel K. E. Bellamy. 2020. Effect of Confidence and Explanation on Accuracy and Trust Calibration in AI-assisted Decision Making. InProceedings of the 2020 Conference on Fairness, Accountability, and Transparency (FAT*). Association for Computing Machinery, New York, NY, USA, 11 pages. doi:10.1145/3351095.3372852 Manuscri...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.