pith. machine review for the scientific record. sign in

arxiv: 2605.11533 · v2 · submitted 2026-05-12 · 💻 cs.CL · cs.CV

Recognition: unknown

Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:12 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords multimodal datasetaction card generationclinical check-up reportspatient-oriented actionsLLM benchmarkstructured generationmedical AI evaluation
0
0 comments X

The pith

A dataset of 2000 multimodal check-up reports benchmarks structured action card generation for patients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes Checkup2Action as a new benchmark dataset containing 2000 real-world multimodal check-up reports. It defines structured Action Cards that break down each report into individual issues with priority, department, timing, explanations, and questions. Testing various large language models on this task reveals consistent trade-offs between thorough issue coverage, action accuracy, brevity, and safety. This setup allows systematic measurement of how well AI can help patients understand and act on their check-up results without overstepping into diagnosis.

Core claim

Checkup2Action introduces a dataset of 2000 de-identified multimodal clinical check-up reports paired with structured Action Cards. Each card addresses one issue by specifying its priority level, recommended medical department, follow-up time window, a patient-accessible explanation, and relevant questions for clinicians, without including diagnoses or treatments. The benchmark evaluates models on metrics including issue coverage, priority consistency, recommendation accuracy, complexity, usefulness, readability, and safety. Experiments highlight trade-offs in performance across general and medical LLMs.

What carries the argument

The Action Card format, a structured per-issue output listing priority, department, time window, patient explanation, and clinician questions derived from multimodal report elements.

If this is right

  • Models can be compared directly on safety compliance and issue coverage using a shared protocol.
  • The dataset supports development of constrained generation methods suited to clinical evidence.
  • Evaluation metrics provide standardized ways to assess patient-oriented medical summarization.
  • Reveals specific limitations in current LLMs when handling heterogeneous inputs like tables, numbers, and images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could support apps that convert reports into personalized follow-up reminders for patients.
  • Integration with electronic health records might increase actual follow-up completion rates.
  • The per-issue card structure may extend to other document types such as discharge summaries or test result packets.

Load-bearing premise

The manually created action cards used as ground truth are clinically accurate and safe.

What would settle it

A study in which multiple independent clinicians generate action cards for the same reports and measure agreement with the dataset annotations or model outputs.

Figures

Figures reproduced from arXiv: 2605.11533 by Amir Atapour-Abarghouei, Jialin Yu, Kevin Qinghong Lin, Philip Torr, Shuang Chen, Sike Xiang, Yijia Sun.

Figure 1
Figure 1. Figure 1: Real-world motivation for Checkup2Action. Patients often receive multimodal [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the C2A dataset. The dataset is built from real-world multimodal clin [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the Checkup2Action pipeline. A multimodal check-up report is first [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Subjective evaluation on the Checkup2Action benchmark across problem rele [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 4
Figure 4. Figure 4: Subjective evaluation on the Checkup2Action benchmark across problem rele [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison between an unconstrained and a safety-constrained Checkup2Action [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison between an unconstrained and a safety-constrained Checkup2Action [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Structured metric comparison between safety-constrained and unconstrained. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Structured metric comparison between GPT-5.1, Gemini-3-pro-preview and MedGemma-27B. 6 Conclusion This paper introduces C2A (Checkup2Action), a real-world multimodal clinical check￾up report dataset and benchmark for patient-facing Action Card generation. By framing check-up interpretation as a structured report-to-action task, C2A evaluates whether systems can organise heterogeneous evidence from multi-pa… view at source ↗
read the original abstract

Clinical check-up reports are multimodal documents that combine page layouts, tables, numerical biomarkers, abnormality flags, imaging findings, and domain-specific terminology. Such heterogeneous evidence is difficult for laypersons to interpret and translate into concrete follow-up actions. Although large language models show promise in medical summarisation and triage support, their ability to generate safe, prioritised, and patient-oriented actions from multimodal check-up reports remains under-benchmarked. We present \textbf{Checkup2Action}, a multimodal clinical check-up report dataset and benchmark for structured \textit{Action Card} generation. Each card describes one clinically relevant issue and specifies its priority, recommended department, follow-up time window, patient-facing explanation, and questions for clinicians, while avoiding diagnostic or treatment-prescriptive claims. The dataset contains 2,000 de-identified real-world check-up reports covering demographic information, physical examinations, laboratory tests, cardiovascular assessments, and imaging-related evidence. We formulate checkup-to-action generation as a constrained structured generation task and introduce an evaluation protocol covering issue coverage and precision, priority consistency, department and time recommendation accuracy, action complexity, usefulness, readability, and safety compliance. Experiments with general-purpose and medical large language models reveal clear trade-offs between issue coverage, action correctness, conciseness, and safety alignment. Checkup2Action provides a new multimodal benchmark for evaluating patient-oriented reasoning over clinical check-up reports.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Checkup2Action, a multimodal dataset of 2,000 de-identified real-world clinical check-up reports paired with manually created structured Action Cards. Each card specifies one clinically relevant issue along with its priority, recommended department, follow-up time window, patient-facing explanation, and clinician questions. The work formulates checkup-to-action generation as a constrained structured generation task, defines an evaluation protocol covering issue coverage, priority consistency, department/time accuracy, action complexity, usefulness, readability, and safety compliance, and benchmarks general-purpose and medical LLMs to reveal trade-offs among coverage, correctness, conciseness, and safety alignment.

Significance. If the reference Action Cards prove clinically reliable, the dataset and benchmark would fill a clear gap in patient-oriented clinical NLP by providing the first large-scale multimodal resource for translating heterogeneous check-up reports (layouts, tables, biomarkers, imaging) into safe, prioritized, non-diagnostic actions. The explicit release of the dataset together with a reproducible evaluation protocol is a concrete strength that enables community follow-up work.

major comments (1)
  1. [§3] §3 (Dataset Construction): The 2,000 reference Action Cards are described as 'manually created' yet the manuscript supplies no annotation protocol details—number or qualifications of clinicians, blinding, adjudication process, or inter-annotator agreement statistics. Because every reported metric (issue coverage, priority consistency, safety compliance) is computed against these cards, the absence of agreement or validation data makes it impossible to assess whether systematic annotator bias or error affects the LLM comparisons.
minor comments (2)
  1. [Abstract] The abstract and §4 could more explicitly quantify the key experimental findings (e.g., which model achieved the highest safety compliance score) rather than only stating that 'clear trade-offs' exist.
  2. [§5] Figure captions and axis labels in the result plots should be enlarged for readability; several current labels are difficult to read at standard print size.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for identifying an important gap in the transparency of our dataset construction. We address the major comment below and will incorporate the requested details in the revised manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Dataset Construction): The 2,000 reference Action Cards are described as 'manually created' yet the manuscript supplies no annotation protocol details—number or qualifications of clinicians, blinding, adjudication process, or inter-annotator agreement statistics. Because every reported metric (issue coverage, priority consistency, safety compliance) is computed against these cards, the absence of agreement or validation data makes it impossible to assess whether systematic annotator bias or error affects the LLM comparisons.

    Authors: We agree that the current description of the annotation process is insufficient for readers to evaluate the reliability of the reference Action Cards. In the revised manuscript we will expand §3 with a dedicated subsection on annotation protocol. This will specify: the number and clinical qualifications of the annotators (board-certified physicians across relevant specialties), the use of blinding, the step-by-step annotation guidelines, the adjudication procedure for resolving disagreements, and inter-annotator agreement statistics (Cohen’s kappa for priority, department, and time-window fields). These details reflect the actual process used to create the 2,000 cards and will enable direct assessment of potential bias or noise in the reference set. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release and empirical benchmark with no derivations or self-referential predictions

full rationale

The paper introduces a new multimodal dataset of 2,000 check-up reports paired with manually created Action Cards and evaluates LLMs on a structured generation task using custom metrics. No equations, parameter fitting, or predictive derivations appear in the provided text. Claims rest on the dataset construction and experimental results rather than any self-definitional loop, fitted-input prediction, or load-bearing self-citation chain. The work is self-contained as an empirical benchmark release.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities; the contribution is a curated dataset and evaluation framework rather than a theoretical derivation.

pith-pipeline@v0.9.0 · 5578 in / 1015 out tokens · 39911 ms · 2026-05-14T21:12:54.601060+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 5 internal anchors

  1. [1]

    G. C. Araujo, C. B. Ribeiro, M. C. M. Costa, M. L. P. Evangelista, M. F. Lima, M. C. De Paula, V . L. Ferreira, and F. A. G. D. R. Araujo. Evidence-based periodic health examinations for adults: A practical guide.Cureus, 17(3):e79963, 2025. doi: 10.7759/ cureus.79963. URLhttps://doi.org/10.7759/cureus.79963

  2. [2]

    Scientific evidence for clinical text summarization using large language models: Scoping review.Journal of Medical Internet Research, 27:e68998, 2025

    Lydie Bednarczyk, Daniel Reichenpfader, Christophe Gaudet-Blavignac, Amon Kenna Ette, Jamil Zaghir, Yuanyuan Zheng, Adel Bensahla, Mina Bjelogrlic, and Christian Lovis. Scientific evidence for clinical text summarization using large language models: Scoping review.Journal of Medical Internet Research, 27:e68998, 2025. ISSN 1438-

  3. [3]

    URLhttps://doi.org/10.2196/68998

    doi: 10.2196/68998. URLhttps://doi.org/10.2196/68998

  4. [4]

    Lan- glotz, Michael Krauthammer, and Farhad Nooralahzadeh

    Christian Bluethgen, Dave Van Veen, Daniel Truhn, Jakob Nikolas Kather, Michael Moor, Malgorzata Polacin, Akshay Chaudhari, Thomas Frauenfelder, Curtis P. Lan- glotz, Michael Krauthammer, and Farhad Nooralahzadeh. Agentic systems in ra- diology: Design, applications, evaluation, and challenges, 2025. URLhttps: //arxiv.org/abs/2510.09404

  5. [5]

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web, 2023. URL https://arxiv.org/abs/2306.06070

  6. [6]

    Esr paper on structured report- ing in radiology-update 2023.Insights into Imaging, 14(1):199, 2023

    European Society of Radiology (ESR). Esr paper on structured report- ing in radiology-update 2023.Insights into Imaging, 14(1):199, 2023. doi: 10.1186/s13244-023-01560-0. URLhttps://doi.org/10.1186/ s13244-023-01560-0

  7. [7]

    Screening for hypertension in adults: Us preventive services task force reaffirmation recommendation statement.JAMA, 325(16):1650– 1656, 04 2021

    US Preventive Services Task Force. Screening for hypertension in adults: Us preventive services task force reaffirmation recommendation statement.JAMA, 325(16):1650– 1656, 04 2021. ISSN 0098-7484. doi: 10.1001/jama.2021.4987. URLhttps:// doi.org/10.1001/jama.2021.4987. 16STUDENT, PROF, COLLABORA TOR: BMVC AUTHOR GUIDELINES

  8. [8]

    Collaco, Nadia G

    Cesar Abraham Gomez-Cabello, Srinivasagam Prabha, Syed Ali Haider, Ariana Gen- ovese, Bernardo G. Collaco, Nadia G. Wood, Sanjay Bagaria, and Antonio Jorge Forte. Comparative evaluation of advanced chunking for retrieval-augmented gen- eration in large language models for clinical decision support.Bioengineering, 12 (11), 2025. ISSN 2306-5354. doi: 10.339...

  9. [9]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams, 2020

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams, 2020. URLhttps://arxiv.org/abs/ 2009.13081

  10. [10]

    Pub- MedQA: A dataset for biomedical research question answering

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. Pub- MedQA: A dataset for biomedical research question answering. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro...

  11. [11]

    Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports

    Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317, 2019

  12. [12]

    Kadhim, Zachary Green, Iman Nazari, Jonathan Baker, Michael George, Ash- ley Heinson, Bhumita Vadgama, Matt Stammers, Christopher M

    Alex Z. Kadhim, Zachary Green, Iman Nazari, Jonathan Baker, Michael George, Ash- ley Heinson, Bhumita Vadgama, Matt Stammers, Christopher M. Kipps, R. Mark Beat- tie, James J. Ashton, and Sarah Ennis. Application of generative artificial intelligence to utilize unstructured clinical data for acceleration of inflammatory bowel disease re- search.Med, 7(1):...

  13. [13]

    When positive is negative: Health literacy barriers to patient access to clinical laboratory test results.The Journal of Applied Laboratory Medicine, 8(6): 1133–1147, 09 2023

    Gerardo Lazaro. When positive is negative: Health literacy barriers to patient access to clinical laboratory test results.The Journal of Applied Laboratory Medicine, 8(6): 1133–1147, 09 2023. ISSN 2576-9456. doi: 10.1093/jalm/jfad045. URLhttps: //doi.org/10.1093/jalm/jfad045

  14. [14]

    TriageAgent: Towards bet- ter multi-agents collaborations for large language model-based clinical triage

    Meng Lu, Brandon Ho, Dennis Ren, and Xuan Wang. TriageAgent: Towards bet- ter multi-agents collaborations for large language model-based clinical triage. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the As- sociation for Computational Linguistics: EMNLP 2024, pages 5747–5764, Miami, Florida, USA, November 2024. Association for...

  15. [15]

    Lustria, Obianuju Aliche, Michael O

    Mia Liza A. Lustria, Obianuju Aliche, Michael O. Killian, and Zhe He. Enhancing patient engagement and understanding: Is providing direct access to laboratory results through patient portals adequate?JAMIA Open, 8(2):ooaf009, 2025. doi: 10.1093/ jamiaopen/ooaf009. STUDENT, PROF, COLLABORA TOR: BMVC AUTHOR GUIDELINES17

  16. [16]

    Meuth, Lennert Böhm, and Marc Pawlitzki

    Lars Masanneck, Linea Schmidt, Antonia Seifert, Tristan Kölsche, Niklas Huntemann, Robin Jansen, Mohammed Mehsin, Michael Bernhard, Sven G. Meuth, Lennert Böhm, and Marc Pawlitzki. Triage performance across large language models, chatgpt, and untrained doctors in emergency medicine: Comparative study.Journal of Medical Internet Research, 26:e53297, 2024. ...

  17. [17]

    Petrovskaya, A

    O. Petrovskaya, A. Karpman, J. Schilling, S. Singh, L. Wegren, V . Caine, E. Kusi- Appiah, and W. Geen. Patient and health care provider perspectives on patient access to test results via web portals: Scoping review.Journal of Medical Internet Research, 25: e43765, 2023. doi: 10.2196/43765. URLhttps://doi.org/10.2196/43765

  18. [18]

    Toolformer: Language models can teach themselves to use tools, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools, 2023. URLhttps://arxiv.org/abs/2302. 04761

  19. [19]

    MedGemma Technical Report

    Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Tra- verse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan ...

  20. [20]

    Steitz, Robert W

    Bryan D. Steitz, Robert W. Turer, Chen-Tan Lin, Scott MacDonald, Liz Salmi, Adam Wright, Christoph U. Lehmann, Karen Langford, Samuel A. McDonald, Thomas J. Reese, Paul Sternberg, Qingxia Chen, S. Trent Rosenbloom, and Catherine M. DesRoches. Perspectives of patients about immediate access to test results through an online patient portal.JAMA Network Open...

  21. [21]

    Steitz, Robert W

    Bryan D. Steitz, Robert W. Turer, Liz Salmi, Uday Suresh, Scott MacDonald, Cather- ine M. DesRoches, Adam Wright, Jeremy Louissaint, and S. Trent Rosenbloom. Re- peated access to patient portal while awaiting test results and patient-initiated mes- saging.JAMA Network Open, 8(4):e254019–e254019, 04 2025. ISSN 2574-3805. doi: 10.1001/jamanetworkopen.2025.4...

  22. [22]

    N. W. Sterling, F. Brann, S. O. Frisch, and J. D. Schrager. Patient-readable radiology report summaries generated via large language model: Safety and quality.Journal of Patient Experience, 11, 2024. doi: 10.1177/23743735241259477

  23. [23]

    Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano

    Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback, 2022. URLhttps://arxiv.org/abs/2009.01325

  24. [24]

    Cacciamani, Cong Sun, Yifan Peng, and Yan- shan Wang

    Thomas Yu Chow Tam, Sonish Sivarajkumar, Sumit Kapoor, Alisa V Stolyar, Katelyn Polanska, Karleigh R McCarthy, Hunter Osterhoudt, Xizhi Wu, Shyam Visweswaran, Sunyang Fu, Piyush Mathur, Giovanni E. Cacciamani, Cong Sun, Yifan Peng, and Yan- shan Wang. A framework for human evaluation of large language models in healthcare derived from literature review, 2...

  25. [25]

    N. E. Timbrell. The role and limitations of the reference interval within clinical chem- istry and its reliability for disease detection.British Journal of Biomedical Science, 81:12339, 2024. doi: 10.3389/bjbs.2024.12339. URLhttps://doi.org/10. 3389/bjbs.2024.12339

  26. [26]

    F. A. M. van der Mee, F. Schaper, J. Jansen, J. A. P. Bons, S. J. R. Meex, and J. W. L. Cals. Enhancing patient understanding of laboratory test results: Systematic review of presentation formats and their impact on perception, decision, action, and memory. Journal of Medical Internet Research, 26:e53993, 2024. doi: 10.2196/53993. URL https://doi.org/10.2...

  27. [27]

    Large language models in medical and healthcare fields: applications, advances, and challenges.Artificial Intelligence Review, 57(11): 299, 2024

    Dandan Wang and Shiqing Zhang. Large language models in medical and healthcare fields: applications, advances, and challenges.Artificial Intelligence Review, 57(11): 299, 2024. ISSN 1573-7462. doi: 10.1007/s10462-024-10921-0. URLhttps: //doi.org/10.1007/s10462-024-10921-0

  28. [28]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...

  29. [29]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering, 2024. URLhttps://arxiv.org/abs/2405.15793

  30. [30]

    Data augmentation for radi- ology report simplification

    Ziyu Yang, Santhosh Cherian, and Slobodan Vucetic. Data augmentation for radi- ology report simplification. In Andreas Vlachos and Isabelle Augenstein, editors, Findings of the Association for Computational Linguistics: EACL 2023, pages 1922– 1932, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-eacl...

  31. [31]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023. URL https://arxiv.org/abs/2210.03629. STUDENT, PROF, COLLABORA TOR: BMVC AUTHOR GUIDELINES19

  32. [32]

    Blecker, and Jonah Feld- man

    Jonah Zaretsky, Jeong Min Kim, Samuel Baskharoun, Yunan Zhao, Jonathan Aus- trian, Yindalon Aphinyanaphongs, Ravi Gupta, Saul B. Blecker, and Jonah Feld- man. Generative artificial intelligence to transform inpatient discharge summaries to patient-friendly language and format.JAMA Network Open, 7(3):e240357–e240357, 03 2024. ISSN 2574-3805. doi: 10.1001/j...

  33. [33]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xi- anyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URL https://arxiv.org/abs/2307.13854