Agents for Experiments, Experiments for Agents: A Design Grammar for AI-Enabled Experimental Science

Chun Feng; Tianshu Sun; Weizhang Zhu; Yingjie Zhang

arxiv: 2605.17746 · v1 · pith:LT3FSHOTnew · submitted 2026-05-18 · 💻 cs.AI · cs.HC

Agents for Experiments, Experiments for Agents: A Design Grammar for AI-Enabled Experimental Science

Yingjie Zhang , Chun Feng , Weizhang Zhu , Tianshu Sun This is my paper

Pith reviewed 2026-05-20 11:18 UTC · model grok-4.3

classification 💻 cs.AI cs.HC

keywords SEED frameworkactor-flow graphsAI agentsexperimental designworkflow representationgovernance in AIhuman-AI interactiondesign grammar

0 comments

The pith

SEED encodes experimental conditions as typed actor-flow graphs to describe, evaluate, and generate AI-human workflow designs under governance constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SEED as a way to represent experimental setups involving AI agents and humans as structured graphs of actors and information flows. This addresses the difficulty of specifying such experiments in plain prose, which hinders comparison, reuse, and checks on how decisions are delegated or controlled. SEED enables three operations: mapping out the interaction structure, measuring how new designs differ from previous ones, and creating candidate setups that satisfy feasibility and oversight rules. A small test on medical triage workflows found that designs produced with SEED made the actor changes, assumptions, and governance elements more explicit than designs made without the structure. Readers should care because AI systems are increasingly embedded in knowledge work, and better tools for testing those arrangements could improve accountability without requiring entirely new methods.

Core claim

Experimental conditions for AI-enabled studies can be represented as typed actor-flow graphs in the SEED framework. This representation supports describing the structure of interactions among actors, evaluating the structural novelty of a candidate design against a library of prior encodings, and generating new candidate designs subject to explicit feasibility and governance constraints. In a diagnostic test contrasting graph-blind and SEED-guided generation for a medical-triage task, the SEED-guided outputs displayed clearer documentation of actor-flow modifications, stated assumptions, and governance validations.

What carries the argument

SEED (Structural Encoding for Experimental Discovery) as typed actor-flow graphs that encode actors, directed flows, and constraint annotations to enable description, novelty evaluation, and constrained generation of experimental designs.

If this is right

Experimental conditions become comparable and reusable across different studies through shared graph encodings.
Generation of new designs can systematically incorporate explicit checks for governance and feasibility.
Structural novelty becomes a measurable property relative to an encoded set of prior designs.
Accountability improves because assumptions and control points are surfaced in the graph representation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The graph approach could be applied to generate and audit experiments in domains such as education or organizational decision-making beyond the medical example.
Libraries of reusable graph templates might emerge for common experiment patterns, reducing the cost of designing new tests.
Tensions around replication and validity identified in the commentary could be addressed by versioning the graph encodings themselves.

Load-bearing premise

Representing experimental conditions as typed actor-flow graphs captures the key mechanisms of delegation, feedback, and control in human-AI arrangements without significant loss of relevant detail.

What would settle it

An independent replication of the medical-triage design task in which blinded evaluators find no difference in clarity of actor-flow changes, assumptions, or governance checks between SEED-guided and unstructured candidate designs.

read the original abstract

AI systems are becoming active participants in organizational and knowledge work. They increasingly interact with humans, coordinate workflows, and operate in multi-agent arrangements. Understanding their effects therefore requires more than measuring output accuracy; it requires evidence about mechanisms, delegation, feedback, and control. Experiments remain central to this task, but they also face a recursive challenge: we need experiments for agents to study these arrangements, and we may need agents for experiments to help search the expanding space of possible designs. Yet experimental conditions for human-AI and agentic workflows are still largely specified in prose, making them difficult to compare, reuse, or audit. We frame this as a problem of workflow representation, traceability, and governance in AI-enabled knowledge production. We introduce SEED (Structural Encoding for Experimental Discovery), a framework that represents experimental conditions as typed actor-flow graphs. SEED supports three design functions: describing conditions as interaction structures, evaluating structural novelty relative to encoded prior designs, and generating candidate designs under feasibility and governance constraints. We report a lightweight empirical feasibility test that compares graph-blind and SEEDguided generation in a medical-triage design task. In this diagnostic contrast, SEED-guided candidate designs show clearer actor-flow changes, assumptions, and governance checks, supporting the feasibility of the grammar as a design aid. The commentary closes by identifying governance tensions around novelty, replication, validity, diversity of inquiry, and accountability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SEED gives a graph grammar for representing AI experiment designs that could help with comparison and generation, but the single informal test leaves the practical payoff unclear.

read the letter

The paper's core move is to treat experimental conditions involving AI agents as typed actor-flow graphs. This supports three operations: describing the structure, scoring novelty against prior encoded designs, and generating new candidates under feasibility and governance rules. That framing directly tackles the problem of comparing and auditing human-AI workflows that currently live in loose prose descriptions. The medical-triage example illustrates how the graphs can surface actor changes, assumptions, and checks more explicitly than a text version does. Those pieces are useful as a design aid for anyone running experiments on delegation or multi-agent setups. The main limitation is the feasibility test itself. It covers only one task, offers no quantitative scores, no blinding, and no comparison against a stronger baseline, so the claim that SEED produces clearer designs rests on an author judgment rather than reproducible evidence. The assumption that actor-flow graphs capture the important mechanisms without losing detail also sits untested beyond that single case. Readers working on experimental methods for AI in organizations or knowledge work would find the representation idea worth considering. It is not yet a finished method, but the problem is timely and the approach is distinct from existing workflow notations. I would send it to peer review so the authors can get concrete suggestions on how to strengthen the evaluation and check the representation's coverage.

Referee Report

2 major / 1 minor

Summary. The paper introduces SEED (Structural Encoding for Experimental Discovery), a framework representing experimental conditions as typed actor-flow graphs to enable description of interaction structures, evaluation of structural novelty relative to prior designs, and generation of candidate designs under feasibility and governance constraints. It reports a lightweight empirical feasibility test contrasting graph-blind and SEED-guided generation in a single medical-triage design task, claiming that SEED outputs exhibit clearer actor-flow changes, assumptions, and governance checks.

Significance. If the representation and generation functions prove robust, SEED could advance traceability, comparability, and auditability of complex human-AI workflow experiments, addressing a timely need as AI agents increasingly participate in organizational and knowledge-production settings. The framing of experiments as design problems with explicit governance checks is a constructive contribution to AI-enabled science methodology.

major comments (2)

[empirical feasibility test] Empirical feasibility test section: The diagnostic contrast relies on an informal qualitative judgment that SEED-guided designs show 'clearer actor-flow changes, assumptions, and governance checks' without reporting quantitative metrics, pre-specified scoring criteria, blinding procedures, inter-rater reliability, or statistical comparison to the graph-blind baseline. This leaves the central feasibility claim dependent on unverified author assessment rather than reproducible evidence.
[SEED framework] Framework definition: The claim that typed actor-flow graphs adequately capture mechanisms such as delegation, feedback, and control is asserted without a systematic analysis of representational fidelity or loss of relevant detail; the single-task contrast does not test whether the graph encoding preserves or distorts these dynamics across varied experimental settings.

minor comments (1)

[conclusion] The abstract and closing commentary reference governance tensions (novelty, replication, validity, diversity, accountability) but the main text would benefit from explicit mapping of how specific SEED operations (description, novelty evaluation, constrained generation) mitigate each tension.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the scope and evidentiary standards appropriate for introducing a design grammar. We address each major comment below and indicate where revisions will be incorporated to improve transparency and precision.

read point-by-point responses

Referee: [empirical feasibility test] Empirical feasibility test section: The diagnostic contrast relies on an informal qualitative judgment that SEED-guided designs show 'clearer actor-flow changes, assumptions, and governance checks' without reporting quantitative metrics, pre-specified scoring criteria, blinding procedures, inter-rater reliability, or statistical comparison to the graph-blind baseline. This leaves the central feasibility claim dependent on unverified author assessment rather than reproducible evidence.

Authors: We agree that the presentation relies on qualitative author judgment without formal metrics or procedures. The test was designed as a lightweight diagnostic contrast to illustrate feasibility rather than as a controlled empirical study. In revision we will (1) articulate explicit qualitative criteria used to assess clarity of actor-flow changes, assumptions, and governance checks, (2) include the actual generated designs as supplementary material so readers can inspect them directly, and (3) add an explicit limitations paragraph acknowledging the absence of blinding, inter-rater reliability, and statistical testing. These changes will make the evidence more transparent while preserving the illustrative intent of the section. revision: yes
Referee: [SEED framework] Framework definition: The claim that typed actor-flow graphs adequately capture mechanisms such as delegation, feedback, and control is asserted without a systematic analysis of representational fidelity or loss of relevant detail; the single-task contrast does not test whether the graph encoding preserves or distorts these dynamics across varied experimental settings.

Authors: The manuscript presents SEED as an initial structural grammar and employs the medical-triage task as a single illustrative case. We accept that a broader systematic analysis of representational fidelity would strengthen the framework claims. In the revised manuscript we will add a dedicated subsection discussing how delegation, feedback, and control are encoded, together with acknowledged limitations such as the loss of fine-grained temporal sequencing or implicit contextual cues. We will also state explicitly that the single-task contrast is not offered as exhaustive validation and will outline directions for multi-domain testing in future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; framework introduced as independent encoding without reduction to inputs or self-referential definitions.

full rationale

The paper presents SEED as a novel structural encoding framework that represents experimental conditions as typed actor-flow graphs to support description, novelty evaluation, and constrained generation. The feasibility claim rests on a qualitative contrast between graph-blind and SEED-guided outputs in a single medical-triage task, described as showing clearer actor-flow changes and governance checks. No equations, fitted parameters, or derivations are provided that would make any result equivalent to its inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked in the reported chain. The derivation remains self-contained as an independent design grammar and diagnostic test.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the modeling choice that actor-flow graphs can represent experimental conditions; no free parameters or invented entities with independent evidence are described.

axioms (1)

domain assumption Experimental conditions for human-AI and agentic workflows can be adequately represented as typed actor-flow graphs.
This is the foundational representation choice invoked to enable the three design functions.

invented entities (1)

Typed actor-flow graphs no independent evidence
purpose: To encode experimental conditions structurally for description, comparison, and generation.
New representational construct introduced by the framework.

pith-pipeline@v0.9.0 · 5789 in / 1132 out tokens · 34914 ms · 2026-05-20T11:18:43.792886+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SEED represents experimental conditions as typed actor-flow graphs G=(V,E) with actor types Δ/⃝ and flow types →/⇒/↔n
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Novelty score D(G_new, G_ref) = w_s · δ_struct + w_p · δ_param using graph edit distance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

[3]

Hemant K Bhargava, Susan Brown, Anindya Ghose, Alok Gupta, Dorothy Leidner, and DJ Wu. 2025. Exploring Generative AI’s Impact on Research: Perspectives from Senior Scholars in Management Information Systems.ACM Transactions on Management Information Systems16, 2, Article 19 (2025), 9 pages. doi:10.1145/3721846

work page doi:10.1145/3721846 2025
[4]

E Brynjolfsson, D Li, and LR Raymond. 2025. Generative AI at work.The Quarterly Journal of Economics140, 2 (2025), 889–942. doi:10.1093/qje/qjae044

work page doi:10.1093/qje/qjae044 2025
[5]

Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. 2024. Large Language Models as Tool Makers. InInternational Conference on Learning Representations (ICLR). OpenReview.net, Vienna, Austria, 23 pages. https://openreview.net/forum?id=qV83K9d5WB

work page 2024
[9]

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. 2024. Large Language Models Cannot Self-Correct Reasoning Yet. InInternational Conference on Learning Representations (ICLR). OpenReview.net, Vienna, Austria, 17 pages. https://openreview.net/forum?id=IkmD3fKBPQ

work page 2024
[10]

Anna Kawakami, Venkatesh Sivaraman, Hao-Fei Cheng, Logan Stapleton, Yanghuidi Cheng, Diana Qing, Adam Perer, Zhiwei Steven Wu, Haiyi Zhu, and Kenneth Holstein. 2022. Improving Human-AI Partnerships in Child Welfare: Understanding Worker Practices, Challenges, and Desires for Algorithmic Decision Support. InProceedings of the 2022 CHI Conference on Human F...

work page doi:10.1145/3491102.3517439 2022
[11]

Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. 2013. Online Controlled Experiments at Large Scale. InProceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). Association for Computing Machinery, New York, NY, USA, 1168–1176. doi:10.1145/2487575.2488217

work page doi:10.1145/2487575.2488217 2013
[12]

Pan Li and Alexander Tuzhilin. 2020. Ddtcdr: Deep dual transfer cross domain recommendation. InProceedings of the 13th international conference on web search and data mining. 331–339. doi:10.1145/3336191.3371793

work page doi:10.1145/3336191.3371793 2020
[13]

Jessy Lin, Nicholas Tomlin, Jacob Andreas, and Jason Eisner. 2024. Decision-Oriented Dialogue for Human-AI Collaboration.Transactions of the Association for Computational Linguistics12 (2024), 892–911. doi:10.1162/tacl_a_00679

work page doi:10.1162/tacl_a_00679 2024
[14]

Jens Ludwig and Sendhil Mullainathan. 2024. Machine learning as a tool for hypothesis generation.The Quarterly Journal of Economics139, 2 (2024), 751–827. doi:10.1093/qje/qjad055

work page doi:10.1093/qje/qjad055 2024
[16]

Open Science Collaboration. 2015. Estimating the reproducibility of psychological science.Science349, 6251 (2015), aac4716

work page 2015
[17]

Phanish Puranam. 2021. Human–AI Collaborative Decision-Making as an Organization Design Problem.Journal of Organization Design10 (2021), 75–80. doi:10.1007/s41469-021-00095-2

work page doi:10.1007/s41469-021-00095-2 2021
[18]

Deciding fast and slow: The role of cognitive biases in ai-assisted decision-making

Charvi Rastogi, Yunfeng Zhang, Dennis Wei, Kush R. Varshney, Amit Dhurandhar, and Richard Tomsett. 2022. Deciding Fast and Slow: The Role of Cognitive Biases in AI-assisted Decision-Making.Proceedings of the ACM on Human-Computer Interaction6, CSCW1 (2022), 1–22. doi:10.1145/3512930

work page doi:10.1145/3512930 2022
[21]

Anjana Susarla, Ram Gopal, Jason Bennett Thatcher, and Suprateek Sarker. 2023. The Janus effect of generative AI: Charting the path for responsible conduct of scholarly activities in information systems.Information Systems Research34, 2 (2023), 399–408. doi:10.1287/isre.2023.ed.v34.n2

work page doi:10.1287/isre.2023.ed.v34.n2 2023
[22]

Kyle Swanson, Wesley Wu, Nash L Bulaong, John E Pak, and James Zou. 2025. The Virtual Lab of AI agents designs new SARS-CoV-2 nanobodies. Nature646, 8085 (2025), 716–723. doi:10.1038/s41586-025-09442-9

work page doi:10.1038/s41586-025-09442-9 2025
[23]

Michael Vössing, Niklas Kühl, Matteo Lind, and Gerhard Satzger. 2022. Designing Transparency for Effective Human-AI Collaboration.Information Systems Frontiers24 (2022), 877–895. doi:10.1007/s10796-022-10284-3

work page doi:10.1007/s10796-022-10284-3 2022
[24]

Lingli Wang, Ni Huang, Yumei He, De Liu, Xunhua Guo, Yan Sun, and Guoqing Chen. 2025. Artificial Intelligence (AI) Assistant in Online Shopping: A Randomized Field Experiment on a Livestream Selling Platform.Information Systems Research36, 4 (2025), 2358–2374. doi:10.1287/isre.2023.0103

work page doi:10.1287/isre.2023.0103 2025
[25]

Heng Xu and Nan Zhang. 2022. From Contextualizing to Context Theorizing: Assessing Context Effects in Privacy Research.Management Science 68, 10 (2022), 7383–7401. doi:10.1287/mnsc.2021.4249

work page doi:10.1287/mnsc.2021.4249 2022
[26]

Yuqian Xu, Hongyan Dai, and Wanfeng Yan. 2024. Identity Disclosure and Anthropomorphism in Voice Chatbot Design: A Field Experiment. Management Science72, 1 (2024), 223–241. doi:10.1287/mnsc.2022.03833

work page doi:10.1287/mnsc.2022.03833 2024
[29]

Bennett, Kori Inkpen, Jaime Tee- van, Ruth Kikin-Gil, and Eric Horvitz

Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human-AI Interaction. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI). Association for Computing Machine...

work page doi:10.1145/3290605.3300233 2019
[30]

Proceedings of the AAAI Conference on Human Computation and Crowdsourcing , author=

Gagan Bansal, Besmira Nushi, Ece Kamar, Walter S. Lasecki, Daniel S. Weld, and Eric Horvitz. 2019. Beyond Accuracy: The Role of Mental Models in Human-AI Team Performance.Proceedings of the AAAI Conference on Human Computation and Crowdsourcing7, 1 (2019), 2–11. doi:10.1609/hcomp.v7i1.5285

work page doi:10.1609/hcomp.v7i1.5285 2019
[31]

Alan Chan, Carson Ezell, Max Kaufmann, Kevin Wei, Lewis Hammond, Herbie Bradley, Emma Bluemke, Nitarshan Rajkumar, David Krueger, Noam Kolt, Lennart Heim, and Markus Anderljung. 2024. Visibility into AI Agents. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT). Association for Computing Machinery, New York, NY,...

work page doi:10.1145/3630106.3658948 2024
[32]

Zenan Chen and Jason Chan. 2024. Large Language Model in Creative Work: The Role of Collaboration Modality and User Expertise.Management Science70, 12 (2024), 9101–9117. doi:10.1287/mnsc.2023.03014

work page doi:10.1287/mnsc.2023.03014 2024
[33]

A Fügener, J Grahl, A Gupta, and W Ketter. 2022. Cognitive Challenges in Human–Artificial Intelligence Collaboration: Investigating the Path Toward Productive Delegation.Information Systems Research33, 2 (2022), 678–696. doi:10.1287/isre.2021.1079

work page doi:10.1287/isre.2021.1079 2022
[34]

Gallo, Eric Strong, Yingjie Weng, Hannah Kerman, Jason A

Ethan Goh, Robert J. Gallo, Eric Strong, Yingjie Weng, Hannah Kerman, Jason A. Freed, Joséphine A. Cool, Zahir Kanjee, Kathleen P. Lane, Andrew S. Parsons, Neera Ahuja, Eric Horvitz, Daniel Yang, Arnold Milstein, Andrew P. J. Olson, Jason Hom, Jonathan H. Chen, and Adam Rodman. 2025. GPT-4 Assistance for Improvement of Physician Performance on Patient Car...

work page doi:10.1038/s41591-024-03456-y 2025
[35]

Eeshaan Jain, Indradyumna Roy, Saswat Meher, Soumen Chakrabarti, and Abir De. 2024. Graph Edit Distance with General Costs Using Neural Set Divergence. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 37. Curran Associates, Inc., Red Hook, NY, USA, 40 pages. Manuscript submitted to ACM Agents for Experiments, Experiments for Agents: A D...

work page doi:10.52202/079017-2335 2024
[36]

Rafal Kocielnik, Saleema Amershi, and Paul N. Bennett. 2019. Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI). Association for Computing Machinery, New York, NY, USA, 14 pages. doi:10.1145/3290605.3300641

work page doi:10.1145/3290605.3300641 2019
[37]

Raphael Koster, Jan Balaguer, Andrea Tacchetti, Ari Weinstein, Tina Zhu, Oliver Hauser, Duncan Williams, Lucy Campbell-Gillingham, Phoebe Thacker, Matthew Botvinick, et al. 2022. Human-centred mechanism design with Democratic AI.Nature Human Behaviour6, 10 (2022), 1398–1407. doi:10.1038/s41562-022-01383-x

work page doi:10.1038/s41562-022-01383-x 2022
[38]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 33. Curran Associates, Inc....

work page 2020
[39]

Hussein Mozannar and David Sontag. 2020. Consistent Estimators for Learning to Defer to an Expert. InProceedings of the 37th International Conference on Machine Learning (ICML) (Proceedings of Machine Learning Research, Vol. 119). PMLR, Virtual, 7076–7087. https://proceedings.mlr. press/v119/mozannar20b.html

work page 2020
[40]

2019.Reproducibility and Replicability in Science

National Academies of Sciences, Engineering, and Medicine. 2019.Reproducibility and Replicability in Science. The National Academies Press, Washington, DC

work page 2019
[41]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2024. Gorilla: Large Language Model Connected with Massive APIs. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 37. Curran Associates, Inc., Red Hook, NY, USA, 22 pages. doi:10.52202/079017-4020

work page doi:10.52202/079017-4020 2024
[42]

Rishabh Ranjan, Siddharth Grover, Sourav Medya, Venkatesan Chakaravarthy, Yogish Sabharwal, and Sayan Ranu. 2022. GREED: A Neural Framework for Learning Graph Distance Functions. InAdvances in Neural Information Processing Systems (NeurIPS). Curran Associates, Inc., Red Hook, NY, USA, 13 pages. https://proceedings.neurips.cc/paper_files/paper/2022/hash/8d...

work page 2022
[43]

Elena Revilla, María Jesús Saenz, Matthias Seifert, and Ye Ma. 2023. Human–artificial intelligence collaboration in prediction: A field experiment in the retail industry.Journal of Management Information Systems40, 4 (2023), 1071–1098. doi:10.1080/07421222.2023.2267317

work page doi:10.1080/07421222.2023.2267317 2023
[44]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom

work page
[45]

InAdvances in Neural Information Processing Systems (NeurIPS), Vol

Toolformer: Language Models Can Teach Themselves to Use Tools. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36. Cur- ran Associates, Inc., Red Hook, NY, USA, 13 pages. https://proceedings.neurips.cc/paper_files/paper/2023/hash/d842425e4bf79ba039352da0f658a906- Abstract-Conference.html

work page 2023
[46]

Weiyan Shi, Xuewei Wang, Yoo Jung Oh, Jingwen Zhang, Saurav Sahay, and Zhou Yu. 2020. Effects of Persuasive Dialogues: Testing Bot Identities and Inquiry Strategies. InProceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI). Association for Computing Machinery, New York, NY, USA. doi:10.1145/3313831.3376843

work page doi:10.1145/3313831.3376843 2020
[47]

Marta Stelmaszak, Mareike Möhlmann, and Carsten Sørensen. 2025. When Algorithms Delegate to Humans: Exploring Human-Algorithm Interaction at Uber.MIS Quarterly49, 1 (2025), 305–330. doi:10.25300/MISQ/2024/17911

work page doi:10.25300/misq/2024/17911 2025
[48]

2021.Nudge: The final edition

Richard H Thaler and Cass R Sunstein. 2021.Nudge: The final edition. Penguin. doi:10.1017/err.2021.61

work page doi:10.1017/err.2021.61 2021
[49]

Cathy Yang, Kevin Bauer, Xitong Li, and Oliver Hinz. 2025. My Advisor, Her AI, and Me: Evidence from a Field Experiment on Human–AI Collaboration and Investment Decisions.Management Science72, 1 (2025), 242–264. doi:10.1287/mnsc.2022.03918

work page doi:10.1287/mnsc.2022.03918 2025
[50]

Ming Yin, Jennifer Wortman Vaughan, and Hanna Wallach. 2019. Understanding the Effect of Accuracy on Trust in Machine Learning Models. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI). Association for Computing Machinery, New York, NY, USA, 1–12. doi:10.1145/3290605.3300509

work page doi:10.1145/3290605.3300509 2019
[51]

Sangseok You, Cathy Liu Yang, and Xitong Li. 2022. Algorithmic versus Human Advice: Does Presenting Prediction Performance Matter for Algorithm Appreciation?Journal of Management Information Systems39, 2 (2022), 336–365. doi:10.1080/07421222.2022.2063553

work page doi:10.1080/07421222.2022.2063553 2022
[52]

Vera and Bellamy, Rachel K

Yunfeng Zhang, Q. Vera Liao, and Rachel K. E. Bellamy. 2020. Effect of Confidence and Explanation on Accuracy and Trust Calibration in AI-assisted Decision Making. InProceedings of the 2020 Conference on Fairness, Accountability, and Transparency (FAT*). Association for Computing Machinery, New York, NY, USA, 11 pages. doi:10.1145/3351095.3372852 Manuscri...

work page doi:10.1145/3351095.3372852 2020

[1] [3]

Hemant K Bhargava, Susan Brown, Anindya Ghose, Alok Gupta, Dorothy Leidner, and DJ Wu. 2025. Exploring Generative AI’s Impact on Research: Perspectives from Senior Scholars in Management Information Systems.ACM Transactions on Management Information Systems16, 2, Article 19 (2025), 9 pages. doi:10.1145/3721846

work page doi:10.1145/3721846 2025

[2] [4]

E Brynjolfsson, D Li, and LR Raymond. 2025. Generative AI at work.The Quarterly Journal of Economics140, 2 (2025), 889–942. doi:10.1093/qje/qjae044

work page doi:10.1093/qje/qjae044 2025

[3] [5]

Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. 2024. Large Language Models as Tool Makers. InInternational Conference on Learning Representations (ICLR). OpenReview.net, Vienna, Austria, 23 pages. https://openreview.net/forum?id=qV83K9d5WB

work page 2024

[4] [9]

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. 2024. Large Language Models Cannot Self-Correct Reasoning Yet. InInternational Conference on Learning Representations (ICLR). OpenReview.net, Vienna, Austria, 17 pages. https://openreview.net/forum?id=IkmD3fKBPQ

work page 2024

[5] [10]

Anna Kawakami, Venkatesh Sivaraman, Hao-Fei Cheng, Logan Stapleton, Yanghuidi Cheng, Diana Qing, Adam Perer, Zhiwei Steven Wu, Haiyi Zhu, and Kenneth Holstein. 2022. Improving Human-AI Partnerships in Child Welfare: Understanding Worker Practices, Challenges, and Desires for Algorithmic Decision Support. InProceedings of the 2022 CHI Conference on Human F...

work page doi:10.1145/3491102.3517439 2022

[6] [11]

Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. 2013. Online Controlled Experiments at Large Scale. InProceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). Association for Computing Machinery, New York, NY, USA, 1168–1176. doi:10.1145/2487575.2488217

work page doi:10.1145/2487575.2488217 2013

[7] [12]

Pan Li and Alexander Tuzhilin. 2020. Ddtcdr: Deep dual transfer cross domain recommendation. InProceedings of the 13th international conference on web search and data mining. 331–339. doi:10.1145/3336191.3371793

work page doi:10.1145/3336191.3371793 2020

[8] [13]

Jessy Lin, Nicholas Tomlin, Jacob Andreas, and Jason Eisner. 2024. Decision-Oriented Dialogue for Human-AI Collaboration.Transactions of the Association for Computational Linguistics12 (2024), 892–911. doi:10.1162/tacl_a_00679

work page doi:10.1162/tacl_a_00679 2024

[9] [14]

Jens Ludwig and Sendhil Mullainathan. 2024. Machine learning as a tool for hypothesis generation.The Quarterly Journal of Economics139, 2 (2024), 751–827. doi:10.1093/qje/qjad055

work page doi:10.1093/qje/qjad055 2024

[10] [16]

Open Science Collaboration. 2015. Estimating the reproducibility of psychological science.Science349, 6251 (2015), aac4716

work page 2015

[11] [17]

Phanish Puranam. 2021. Human–AI Collaborative Decision-Making as an Organization Design Problem.Journal of Organization Design10 (2021), 75–80. doi:10.1007/s41469-021-00095-2

work page doi:10.1007/s41469-021-00095-2 2021

[12] [18]

Deciding fast and slow: The role of cognitive biases in ai-assisted decision-making

Charvi Rastogi, Yunfeng Zhang, Dennis Wei, Kush R. Varshney, Amit Dhurandhar, and Richard Tomsett. 2022. Deciding Fast and Slow: The Role of Cognitive Biases in AI-assisted Decision-Making.Proceedings of the ACM on Human-Computer Interaction6, CSCW1 (2022), 1–22. doi:10.1145/3512930

work page doi:10.1145/3512930 2022

[13] [21]

Anjana Susarla, Ram Gopal, Jason Bennett Thatcher, and Suprateek Sarker. 2023. The Janus effect of generative AI: Charting the path for responsible conduct of scholarly activities in information systems.Information Systems Research34, 2 (2023), 399–408. doi:10.1287/isre.2023.ed.v34.n2

work page doi:10.1287/isre.2023.ed.v34.n2 2023

[14] [22]

Kyle Swanson, Wesley Wu, Nash L Bulaong, John E Pak, and James Zou. 2025. The Virtual Lab of AI agents designs new SARS-CoV-2 nanobodies. Nature646, 8085 (2025), 716–723. doi:10.1038/s41586-025-09442-9

work page doi:10.1038/s41586-025-09442-9 2025

[15] [23]

Michael Vössing, Niklas Kühl, Matteo Lind, and Gerhard Satzger. 2022. Designing Transparency for Effective Human-AI Collaboration.Information Systems Frontiers24 (2022), 877–895. doi:10.1007/s10796-022-10284-3

work page doi:10.1007/s10796-022-10284-3 2022

[16] [24]

Lingli Wang, Ni Huang, Yumei He, De Liu, Xunhua Guo, Yan Sun, and Guoqing Chen. 2025. Artificial Intelligence (AI) Assistant in Online Shopping: A Randomized Field Experiment on a Livestream Selling Platform.Information Systems Research36, 4 (2025), 2358–2374. doi:10.1287/isre.2023.0103

work page doi:10.1287/isre.2023.0103 2025

[17] [25]

Heng Xu and Nan Zhang. 2022. From Contextualizing to Context Theorizing: Assessing Context Effects in Privacy Research.Management Science 68, 10 (2022), 7383–7401. doi:10.1287/mnsc.2021.4249

work page doi:10.1287/mnsc.2021.4249 2022

[18] [26]

Yuqian Xu, Hongyan Dai, and Wanfeng Yan. 2024. Identity Disclosure and Anthropomorphism in Voice Chatbot Design: A Field Experiment. Management Science72, 1 (2024), 223–241. doi:10.1287/mnsc.2022.03833

work page doi:10.1287/mnsc.2022.03833 2024

[19] [29]

Bennett, Kori Inkpen, Jaime Tee- van, Ruth Kikin-Gil, and Eric Horvitz

Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human-AI Interaction. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI). Association for Computing Machine...

work page doi:10.1145/3290605.3300233 2019

[20] [30]

Proceedings of the AAAI Conference on Human Computation and Crowdsourcing , author=

Gagan Bansal, Besmira Nushi, Ece Kamar, Walter S. Lasecki, Daniel S. Weld, and Eric Horvitz. 2019. Beyond Accuracy: The Role of Mental Models in Human-AI Team Performance.Proceedings of the AAAI Conference on Human Computation and Crowdsourcing7, 1 (2019), 2–11. doi:10.1609/hcomp.v7i1.5285

work page doi:10.1609/hcomp.v7i1.5285 2019

[21] [31]

Alan Chan, Carson Ezell, Max Kaufmann, Kevin Wei, Lewis Hammond, Herbie Bradley, Emma Bluemke, Nitarshan Rajkumar, David Krueger, Noam Kolt, Lennart Heim, and Markus Anderljung. 2024. Visibility into AI Agents. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT). Association for Computing Machinery, New York, NY,...

work page doi:10.1145/3630106.3658948 2024

[22] [32]

Zenan Chen and Jason Chan. 2024. Large Language Model in Creative Work: The Role of Collaboration Modality and User Expertise.Management Science70, 12 (2024), 9101–9117. doi:10.1287/mnsc.2023.03014

work page doi:10.1287/mnsc.2023.03014 2024

[23] [33]

A Fügener, J Grahl, A Gupta, and W Ketter. 2022. Cognitive Challenges in Human–Artificial Intelligence Collaboration: Investigating the Path Toward Productive Delegation.Information Systems Research33, 2 (2022), 678–696. doi:10.1287/isre.2021.1079

work page doi:10.1287/isre.2021.1079 2022

[24] [34]

Gallo, Eric Strong, Yingjie Weng, Hannah Kerman, Jason A

Ethan Goh, Robert J. Gallo, Eric Strong, Yingjie Weng, Hannah Kerman, Jason A. Freed, Joséphine A. Cool, Zahir Kanjee, Kathleen P. Lane, Andrew S. Parsons, Neera Ahuja, Eric Horvitz, Daniel Yang, Arnold Milstein, Andrew P. J. Olson, Jason Hom, Jonathan H. Chen, and Adam Rodman. 2025. GPT-4 Assistance for Improvement of Physician Performance on Patient Car...

work page doi:10.1038/s41591-024-03456-y 2025

[25] [35]

Eeshaan Jain, Indradyumna Roy, Saswat Meher, Soumen Chakrabarti, and Abir De. 2024. Graph Edit Distance with General Costs Using Neural Set Divergence. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 37. Curran Associates, Inc., Red Hook, NY, USA, 40 pages. Manuscript submitted to ACM Agents for Experiments, Experiments for Agents: A D...

work page doi:10.52202/079017-2335 2024

[26] [36]

Rafal Kocielnik, Saleema Amershi, and Paul N. Bennett. 2019. Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AI Systems. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI). Association for Computing Machinery, New York, NY, USA, 14 pages. doi:10.1145/3290605.3300641

work page doi:10.1145/3290605.3300641 2019

[27] [37]

Raphael Koster, Jan Balaguer, Andrea Tacchetti, Ari Weinstein, Tina Zhu, Oliver Hauser, Duncan Williams, Lucy Campbell-Gillingham, Phoebe Thacker, Matthew Botvinick, et al. 2022. Human-centred mechanism design with Democratic AI.Nature Human Behaviour6, 10 (2022), 1398–1407. doi:10.1038/s41562-022-01383-x

work page doi:10.1038/s41562-022-01383-x 2022

[28] [38]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 33. Curran Associates, Inc....

work page 2020

[29] [39]

Hussein Mozannar and David Sontag. 2020. Consistent Estimators for Learning to Defer to an Expert. InProceedings of the 37th International Conference on Machine Learning (ICML) (Proceedings of Machine Learning Research, Vol. 119). PMLR, Virtual, 7076–7087. https://proceedings.mlr. press/v119/mozannar20b.html

work page 2020

[30] [40]

2019.Reproducibility and Replicability in Science

National Academies of Sciences, Engineering, and Medicine. 2019.Reproducibility and Replicability in Science. The National Academies Press, Washington, DC

work page 2019

[31] [41]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2024. Gorilla: Large Language Model Connected with Massive APIs. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 37. Curran Associates, Inc., Red Hook, NY, USA, 22 pages. doi:10.52202/079017-4020

work page doi:10.52202/079017-4020 2024

[32] [42]

Rishabh Ranjan, Siddharth Grover, Sourav Medya, Venkatesan Chakaravarthy, Yogish Sabharwal, and Sayan Ranu. 2022. GREED: A Neural Framework for Learning Graph Distance Functions. InAdvances in Neural Information Processing Systems (NeurIPS). Curran Associates, Inc., Red Hook, NY, USA, 13 pages. https://proceedings.neurips.cc/paper_files/paper/2022/hash/8d...

work page 2022

[33] [43]

Elena Revilla, María Jesús Saenz, Matthias Seifert, and Ye Ma. 2023. Human–artificial intelligence collaboration in prediction: A field experiment in the retail industry.Journal of Management Information Systems40, 4 (2023), 1071–1098. doi:10.1080/07421222.2023.2267317

work page doi:10.1080/07421222.2023.2267317 2023

[34] [44]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom

work page

[35] [45]

InAdvances in Neural Information Processing Systems (NeurIPS), Vol

Toolformer: Language Models Can Teach Themselves to Use Tools. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36. Cur- ran Associates, Inc., Red Hook, NY, USA, 13 pages. https://proceedings.neurips.cc/paper_files/paper/2023/hash/d842425e4bf79ba039352da0f658a906- Abstract-Conference.html

work page 2023

[36] [46]

Weiyan Shi, Xuewei Wang, Yoo Jung Oh, Jingwen Zhang, Saurav Sahay, and Zhou Yu. 2020. Effects of Persuasive Dialogues: Testing Bot Identities and Inquiry Strategies. InProceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI). Association for Computing Machinery, New York, NY, USA. doi:10.1145/3313831.3376843

work page doi:10.1145/3313831.3376843 2020

[37] [47]

Marta Stelmaszak, Mareike Möhlmann, and Carsten Sørensen. 2025. When Algorithms Delegate to Humans: Exploring Human-Algorithm Interaction at Uber.MIS Quarterly49, 1 (2025), 305–330. doi:10.25300/MISQ/2024/17911

work page doi:10.25300/misq/2024/17911 2025

[38] [48]

2021.Nudge: The final edition

Richard H Thaler and Cass R Sunstein. 2021.Nudge: The final edition. Penguin. doi:10.1017/err.2021.61

work page doi:10.1017/err.2021.61 2021

[39] [49]

Cathy Yang, Kevin Bauer, Xitong Li, and Oliver Hinz. 2025. My Advisor, Her AI, and Me: Evidence from a Field Experiment on Human–AI Collaboration and Investment Decisions.Management Science72, 1 (2025), 242–264. doi:10.1287/mnsc.2022.03918

work page doi:10.1287/mnsc.2022.03918 2025

[40] [50]

Ming Yin, Jennifer Wortman Vaughan, and Hanna Wallach. 2019. Understanding the Effect of Accuracy on Trust in Machine Learning Models. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI). Association for Computing Machinery, New York, NY, USA, 1–12. doi:10.1145/3290605.3300509

work page doi:10.1145/3290605.3300509 2019

[41] [51]

Sangseok You, Cathy Liu Yang, and Xitong Li. 2022. Algorithmic versus Human Advice: Does Presenting Prediction Performance Matter for Algorithm Appreciation?Journal of Management Information Systems39, 2 (2022), 336–365. doi:10.1080/07421222.2022.2063553

work page doi:10.1080/07421222.2022.2063553 2022

[42] [52]

Vera and Bellamy, Rachel K

Yunfeng Zhang, Q. Vera Liao, and Rachel K. E. Bellamy. 2020. Effect of Confidence and Explanation on Accuracy and Trust Calibration in AI-assisted Decision Making. InProceedings of the 2020 Conference on Fairness, Accountability, and Transparency (FAT*). Association for Computing Machinery, New York, NY, USA, 11 pages. doi:10.1145/3351095.3372852 Manuscri...

work page doi:10.1145/3351095.3372852 2020