pith. machine review for the scientific record. sign in

arxiv: 2312.13771 · v2 · submitted 2023-12-21 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

AppAgent: Multimodal Agents as Smartphone Users

Authors on Pith no claims yet

Pith reviewed 2026-05-17 10:10 UTC · model grok-4.3

classification 💻 cs.CV
keywords agentapplicationstaskssmartphoneacrossagentsappscomplex
0
0 comments X

The pith

AppAgent lets large language models operate diverse smartphone apps via visual interactions and learns app usage from exploration or demonstrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The agent sees the phone screen as images and decides on simple actions like taps or swipes to complete tasks. It builds knowledge either by trying apps on its own or by learning from human examples, then uses that knowledge to handle complex jobs in apps like social media, email, maps, shopping, and photo editing. Testing covered 50 tasks across 10 different applications.

Core claim

Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps.

Load-bearing premise

The agent can reliably learn to navigate and execute tasks in new apps through autonomous exploration or human demonstrations, producing a knowledge base that generalizes across applications.

read the original abstract

Recent advancements in large language models (LLMs) have led to the creation of intelligent agents capable of performing complex tasks. This paper introduces a novel LLM-based multimodal agent framework designed to operate smartphone applications. Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps. Central to our agent's functionality is its innovative learning method. The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations. This process generates a knowledge base that the agent refers to for executing complex tasks across different applications. To demonstrate the practicality of our agent, we conducted extensive testing over 50 tasks in 10 different applications, including social media, email, maps, shopping, and sophisticated image editing tools. The results affirm our agent's proficiency in handling a diverse array of high-level tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces AppAgent, a novel LLM-based multimodal agent framework for operating smartphone applications. It uses a simplified action space mimicking human interactions such as tapping and swiping, bypassing the need for system back-end access. The agent learns to navigate new apps through autonomous exploration or human demonstrations, generating a knowledge base for executing complex tasks. Extensive testing is reported over 50 tasks in 10 different applications, affirming the agent's proficiency.

Significance. If the results hold, the work would be significant in advancing multimodal agents for real-world mobile interfaces, broadening applicability without backend dependencies. The learning method for building reusable knowledge bases could enable more generalizable agents. Credit is due for the empirical construction approach and focus on practical smartphone use cases.

major comments (2)
  1. The abstract reports testing on 50 tasks in 10 apps but provides no quantitative results, error analysis, or comparison baselines; the central claim of proficiency rests on high-level description only.
  2. The generalization of the knowledge base to unseen apps is asserted but not isolated in experiments. No ablation holds the knowledge base fixed while introducing held-out apps with dissimilar UI structures; success on the original 10 does not isolate whether proficiency stems from per-app memorization or cross-app generalization.
minor comments (1)
  1. Consider adding more details on the exact composition of the 10 applications and the specific tasks to allow better assessment of diversity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the significance of our work. We address each major comment below and outline the revisions we plan to incorporate.

read point-by-point responses
  1. Referee: The abstract reports testing on 50 tasks in 10 apps but provides no quantitative results, error analysis, or comparison baselines; the central claim of proficiency rests on high-level description only.

    Authors: We agree that the abstract, being a high-level summary, omits specific quantitative metrics. The full manuscript contains detailed experimental results, including success rates across the 50 tasks, error breakdowns, and baseline comparisons in the evaluation section. We will revise the abstract to include key quantitative highlights, such as overall task success rates and a brief mention of the baselines employed, to better support the proficiency claim. revision: yes

  2. Referee: The generalization of the knowledge base to unseen apps is asserted but not isolated in experiments. No ablation holds the knowledge base fixed while introducing held-out apps with dissimilar UI structures; success on the original 10 does not isolate whether proficiency stems from per-app memorization or cross-app generalization.

    Authors: We appreciate this observation on isolating generalization effects. Our current evaluation focuses on tasks within the 10 apps after building app-specific knowledge bases via exploration or demonstrations. To address the concern, we will add a new ablation study in the revised manuscript: we will fix the knowledge base from a subset of the original apps and evaluate performance on held-out apps with dissimilar UI structures. This will help distinguish cross-app generalization from per-app memorization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical knowledge-base construction from exploration/demos

full rationale

The paper describes an LLM-based agent that builds a reusable knowledge base via autonomous exploration or human demonstrations on smartphone apps, then uses it for task execution without backend access. No equations, fitted parameters, or first-principles derivations are present that reduce any claimed result to its inputs by construction. Evaluation on 50 tasks across 10 apps is framed as direct experimental validation of the constructed system rather than a self-referential prediction. Any self-citations (if present) are not load-bearing for the core empirical claims, which rest on new interaction data rather than prior author results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that current multimodal LLMs can interpret smartphone screen images sufficiently well to generate effective actions and that exploration or demonstration data produces transferable knowledge.

axioms (1)
  • domain assumption Multimodal LLMs can map screenshots to appropriate tap/swipe actions for app navigation
    Invoked as the basis for the simplified action space and agent operation

pith-pipeline@v0.9.0 · 5476 in / 1039 out tokens · 38886 ms · 2026-05-17T10:10:48.536796+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.RealityFromDistinction reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

    cs.CR 2026-01 unverdicted novelty 8.0

    GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.

  2. MMSkills: Towards Multimodal Skills for General Visual Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    MMSkills creates compact multimodal skill packages from trajectories and uses a branch-loaded agent to improve visual decision-making on GUI and game benchmarks.

  3. OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory

    cs.CL 2026-04 unverdicted novelty 7.0

    OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.

  4. Temporal UI State Inconsistency in Desktop GUI Agents: Formalizing and Defending Against TOCTOU Attacks on Computer-Use Agents

    cs.CR 2026-04 unverdicted novelty 7.0

    Desktop GUI agents face TOCTOU attacks from UI state changes during the ~6.5s observation-to-action gap, with a three-layer pre-execution verification defense achieving 100% interception on two attack types but failin...

  5. ProAgent: Harnessing On-Demand Sensory Contexts for Proactive LLM Agent Systems in the Wild

    cs.AI 2025-12 conditional novelty 7.0

    ProAgent uses on-demand tiered perception and context-aware LLM reasoning to deliver proactive assistance on AR glasses, achieving up to 27.7% higher prediction accuracy and 20.5% lower false detections than baselines.

  6. MMSkills: Towards Multimodal Skills for General Visual Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.

  7. AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

    cs.CV 2026-04 unverdicted novelty 6.0

    AutoGUI-v2 is a new benchmark exposing that VLMs handle basic GUI grounding but struggle with complex interaction logic and state prediction.

  8. SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

    cs.CR 2026-02 unverdicted novelty 6.0

    The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.

  9. AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management

    cs.AI 2025-12 conditional novelty 6.0

    AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.

  10. UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

    cs.AI 2025-03 accept novelty 6.0

    UI-R1 shows rule-based RL with GRPO on 136 GUI tasks improves a 3B MLLM's action prediction accuracy by 6-22% over its base model and matches larger SFT-trained models.

  11. OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    cs.CL 2024-10 unverdicted novelty 6.0

    OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.

  12. Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

    cs.CL 2024-01 conditional novelty 6.0

    Mobile-Agent is a vision-centric autonomous agent that uses MLLMs to perceive UI elements, plan complex multi-step tasks, and operate mobile apps without relying on XML or system metadata, showing strong results on th...

  13. SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

    cs.HC 2024-01 unverdicted novelty 6.0

    SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.

  14. A Survey on Large Language Model based Autonomous Agents

    cs.AI 2023-08 accept novelty 6.0

    A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...

  15. Less Detail, Better Answers: Degradation-Driven Prompting for VQA

    cs.CV 2026-04 unverdicted novelty 5.0

    Degradation-Driven Prompting improves VQA by intentionally reducing image detail and using masks, lines, and examples to guide models toward essential structures.

  16. A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications

    cs.IR 2026-05 unverdicted novelty 4.0

    The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.

  17. X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction

    cs.CV 2026-05 unverdicted novelty 3.0

    X-OmniClaw presents a unified architecture for Android mobile agents using Omni Perception, Memory, and Action modules to enable efficient multimodal task handling and personalized interactions.

  18. A Survey on Multimodal Large Language Models

    cs.CV 2023-06 accept novelty 3.0

    This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

286 extracted references · 286 canonical work pages · cited by 17 Pith papers · 19 internal anchors

  1. [5]

    Meta FAIR, Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, et al. 2022. Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067--1074

  2. [6]

    Hiroki Furuta, Kuang-Huei Lee, Ofir Nachum, Yutaka Matsuo, Aleksandra Faust, Shixiang Shane Gu, and Izzeddin Gur. 2023. http://arxiv.org/abs/2305.11854 Multimodal web navigation with instruction-finetuned foundation models

  3. [7]

    Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. 2023. http://arxiv.org/abs/2307.12856 A real-world webagent with planning, long context understanding, and program synthesis

  4. [8]

    Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, and Hanwang Zhang. 2023. http://arxiv.org/abs/2311.16483 Chartllama: A multimodal llm for chart understanding and generation

  5. [9]

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. 2023. http://arxiv.org/abs/2308.00352 Metagpt: Meta programming for a multi-agent collaborative framework

  6. [11]

    Yanda Li, Chi Zhang, Gang Yu, Zhibin Wang, Bin Fu, Guosheng Lin, Chunhua Shen, Ling Chen, and Yunchao Wei. 2023. http://arxiv.org/abs/2308.10253 Stablellava: Enhanced visual instruction tuning with synthesized image-dialogue data

  7. [12]

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2023 a . Improved baselines with visual instruction tuning

  8. [13]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023 b . Visual instruction tuning

  9. [14]

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2023 c . Agent B ench: E valuating LLM s as agents. arXiv preprint arXiv: 2308.03688

  10. [15]

    OpenAI. 2021. Chatgpt. https://openai.com/research/chatgpt

  11. [16]

    OpenAI. 2023. http://arxiv.org/abs/2303.08774 Gpt-4 technical report

  12. [17]

    Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1--22

  13. [20]

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. In Advances in Neural Information Processing Systems

  14. [22]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca

  15. [23]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023 a . http://arxiv.org/abs/2302.13971 Llama: Open and efficient foundation language models

  16. [24]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

  17. [25]

    Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, and Yitao Liang. 2023. http://arxiv.org/abs/2311.05997 Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models

  18. [27]

    Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, and Tao Yu

    Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, and Tao Yu. 2023. http://arxiv.org/abs/2310.10634 Openagents: An open platform for language agents in the wild

  19. [30]

    Hui Yang, Sifu Yue, and Yunzhong He. 2023 a . http://arxiv.org/abs/2306.02224 Auto-gpt for online decision making: Benchmarks and additional opinions

  20. [33]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct : Synergizing reasoning and acting in language models. In ICLR

  21. [35]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. http://arxiv.org/abs/2306.05685 Judging llm-as-a-judge with mt-bench and chatbot arena

  22. [36]

    2023 , eprint=

    Multimodal Web Navigation with Instruction-Finetuned Foundation Models , author=. 2023 , eprint=

  23. [37]

    2023 , eprint=

    JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models , author=. 2023 , eprint=

  24. [38]

    2023 , eprint=

    A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis , author=. 2023 , eprint=

  25. [39]

    2023 , eprint=

    OpenAgents: An Open Platform for Language Agents in the Wild , author=. 2023 , eprint=

  26. [40]

    2023 , eprint=

    Auto-GPT for Online Decision Making: Benchmarks and Additional Opinions , author=. 2023 , eprint=

  27. [41]

    2023 , eprint=

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework , author=. 2023 , eprint=

  28. [42]

    2023 , eprint=

    StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data , author=. 2023 , eprint=

  29. [43]

    2023 , eprint=

    ChartLlama: A Multimodal LLM for Chart Understanding and Generation , author=. 2023 , eprint=

  30. [44]

    LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention , author=. arXiv preprint arXiv:2303.16199 , year=

  31. [45]

    BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions , author=

  32. [46]

    Mmicl: Empowering vision-language model with multi-modal in-context learn- ing

    MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning , author=. arXiv preprint arXiv:2309.07915 , year=

  33. [47]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=

  34. [48]

    2023 , eprint=

    InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition , author=. 2023 , eprint=

  35. [49]

    Improved Baselines with Visual Instruction Tuning , author=

  36. [50]

    Visual Instruction Tuning , author=

  37. [51]

    2023 , journal=

    MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning , author=. 2023 , journal=

  38. [52]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , author=. arXiv preprint arXiv:2304.10592 , year=

  39. [53]

    The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

    The dawn of lmms: Preliminary explorations with gpt-4v (ision) , author=. arXiv preprint arXiv:2309.17421 , year=

  40. [54]

    2023 , eprint=

    GPT-4 Technical Report , author=. 2023 , eprint=

  41. [55]

    2023 , eprint=

    LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

  42. [56]

    2023 , eprint=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

  43. [57]

    Hashimoto , title =

    Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

  44. [58]

    2023 , eprint=

    Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

  45. [59]

    2021 , howpublished =

    OpenAI , title =. 2021 , howpublished =

  46. [60]

    2022 , eprint=

    Training language models to follow instructions with human feedback , author=. 2022 , eprint=

  47. [61]

    GLM-130B: An Open Bilingual Pre-trained Model

    Glm-130b: An open bilingual pre-trained model , author=. arXiv preprint arXiv:2210.02414 , year=

  48. [62]

    arXiv preprint arXiv:2301.13688 , year=

    The Flan Collection: Designing Data and Methods for Effective Instruction Tuning , author=. arXiv preprint arXiv:2301.13688 , year=

  49. [63]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  50. [64]

    Publications Manual , year = "1983", publisher =

  51. [65]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  52. [66]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of. 2007 , url=

  53. [67]

    Dan Gusfield , title =. 1997

  54. [68]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  55. [69]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =. 2005 , url=

  56. [70]

    and Tukey, John W

    Cooley, James W. and Tukey, John W. , journal=. An algorithm for the machine calculation of complex. 1965 , url=

  57. [71]

    2023 , journal =

    The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) , author =. 2023 , journal =

  58. [72]

    2023 , journal =

    GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation , author =. 2023 , journal =

  59. [73]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Rt-2: Vision-language-action models transfer web knowledge to robotic control , author=. arXiv preprint arXiv:2307.15818 , year=

  60. [74]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Rt-1: Robotics transformer for real-world control at scale , author=. arXiv preprint arXiv:2212.06817 , year=

  61. [75]

    Xiao Liu and Hao Yu and Hanchen Zhang and Yifan Xu and Xuanyu Lei and Hanyu Lai and Yu Gu and Hangliang Ding and Kaiwen Men and Kejuan Yang and Shudan Zhang and Xiang Deng and Aohan Zeng and Zhengxiao Du and Chenhui Zhang and Sheng Shen and Tianjun Zhang and Yu Su and Huan Sun and Minlie Huang and Yuxiao Dong and Jie Tang , year =. Agent

  62. [76]

    Evaluating Large Language Models Trained on Code

    Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

  63. [77]

    Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

    Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

  64. [78]

    HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace , year =

    Shen, Yongliang and Song, Kaitao and Tan, Xu and Li, Dongsheng and Lu, Weiming and Zhuang, Yueting , booktitle =. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace , year =

  65. [79]

    ChatDev: Communicative Agents for Software Development

    Communicative agents for software development , author=. arXiv preprint arXiv:2307.07924 , year=

  66. [80]

    Science , volume=

    Human-level play in the game of Diplomacy by combining language models with strategic reasoning , author=. Science , volume=. 2022 , publisher=

  67. [81]

    Exploring large language models for communication games: An empirical study on werewolf,

    Exploring large language models for communication games: An empirical study on werewolf , author=. arXiv preprint arXiv:2309.04658 , year=

  68. [82]

    The Rise and Potential of Large Language Model Based Agents: A Survey

    The rise and potential of large language model based agents: A survey , author=. arXiv preprint arXiv:2309.07864 , year=

  69. [83]

    arXiv preprint arXiv:2312.05230 , year=

    Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning , author=. arXiv preprint arXiv:2312.05230 , year=

  70. [84]

    arXiv preprint arXiv:2311.07562 , year=

    GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation , author=. arXiv preprint arXiv:2311.07562 , year=

  71. [85]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Do As I Can and Not As I Say: Grounding Language in Robotic Affordances , author=. arXiv preprint arXiv:2204.01691 , year=

  72. [86]

    A Generalist Agent

    A generalist agent , author=. arXiv preprint arXiv:2205.06175 , year=

  73. [87]

    arXiv preprint arXiv:2310.12945 , year=

    3D-GPT: Procedural 3D Modeling with Large Language Models , author=. arXiv preprint arXiv:2310.12945 , year=

  74. [88]

    Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =

  75. [89]

    Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022

  76. [90]

    A Systematic Survey of Text Worlds as Embodied Natural Language Environments

    Jansen, Peter. A Systematic Survey of Text Worlds as Embodied Natural Language Environments. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.1

  77. [91]

    A Minimal Computational Improviser Based on Oral Thought

    Montfort, Nick and Bartlett Fernandez, Sebastian. A Minimal Computational Improviser Based on Oral Thought. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.2

  78. [92]

    Craft an Iron Sword: Dynamically Generating Interactive Game Characters by Prompting Large Language Models Tuned on Code

    Volum, Ryan and Rao, Sudha and Xu, Michael and DesGarennes, Gabriel and Brockett, Chris and Van Durme, Benjamin and Deng, Olivia and Malhotra, Akanksha and Dolan, Bill. Craft an Iron Sword: Dynamically Generating Interactive Game Characters by Prompting Large Language Models Tuned on Code. Proceedings of the 3rd Wordplay: When Language Meets Games Worksho...

  79. [93]

    A Sequence Modelling Approach to Question Answering in Text-Based Games

    Furman, Gregory and Toledo, Edan and Shock, Jonathan and Buys, Jan. A Sequence Modelling Approach to Question Answering in Text-Based Games. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.4

  80. [94]

    Automatic Exploration of Textual Environments with Language-Conditioned Autotelic Agents

    Teodorescu, Laetitia and Yuan, Xingdi and C \^o t \'e , Marc-Alexandre and Oudeyer, Pierre-Yves. Automatic Exploration of Textual Environments with Language-Conditioned Autotelic Agents. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.5

Showing first 80 references.