Recognition: 2 theorem links
· Lean TheoremAppAgent: Multimodal Agents as Smartphone Users
Pith reviewed 2026-05-17 10:10 UTC · model grok-4.3
The pith
AppAgent lets large language models operate diverse smartphone apps via visual interactions and learns app usage from exploration or demonstrations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps.
Load-bearing premise
The agent can reliably learn to navigate and execute tasks in new apps through autonomous exploration or human demonstrations, producing a knowledge base that generalizes across applications.
read the original abstract
Recent advancements in large language models (LLMs) have led to the creation of intelligent agents capable of performing complex tasks. This paper introduces a novel LLM-based multimodal agent framework designed to operate smartphone applications. Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps. Central to our agent's functionality is its innovative learning method. The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations. This process generates a knowledge base that the agent refers to for executing complex tasks across different applications. To demonstrate the practicality of our agent, we conducted extensive testing over 50 tasks in 10 different applications, including social media, email, maps, shopping, and sophisticated image editing tools. The results affirm our agent's proficiency in handling a diverse array of high-level tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AppAgent, a novel LLM-based multimodal agent framework for operating smartphone applications. It uses a simplified action space mimicking human interactions such as tapping and swiping, bypassing the need for system back-end access. The agent learns to navigate new apps through autonomous exploration or human demonstrations, generating a knowledge base for executing complex tasks. Extensive testing is reported over 50 tasks in 10 different applications, affirming the agent's proficiency.
Significance. If the results hold, the work would be significant in advancing multimodal agents for real-world mobile interfaces, broadening applicability without backend dependencies. The learning method for building reusable knowledge bases could enable more generalizable agents. Credit is due for the empirical construction approach and focus on practical smartphone use cases.
major comments (2)
- The abstract reports testing on 50 tasks in 10 apps but provides no quantitative results, error analysis, or comparison baselines; the central claim of proficiency rests on high-level description only.
- The generalization of the knowledge base to unseen apps is asserted but not isolated in experiments. No ablation holds the knowledge base fixed while introducing held-out apps with dissimilar UI structures; success on the original 10 does not isolate whether proficiency stems from per-app memorization or cross-app generalization.
minor comments (1)
- Consider adding more details on the exact composition of the 10 applications and the specific tasks to allow better assessment of diversity.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and positive assessment of the significance of our work. We address each major comment below and outline the revisions we plan to incorporate.
read point-by-point responses
-
Referee: The abstract reports testing on 50 tasks in 10 apps but provides no quantitative results, error analysis, or comparison baselines; the central claim of proficiency rests on high-level description only.
Authors: We agree that the abstract, being a high-level summary, omits specific quantitative metrics. The full manuscript contains detailed experimental results, including success rates across the 50 tasks, error breakdowns, and baseline comparisons in the evaluation section. We will revise the abstract to include key quantitative highlights, such as overall task success rates and a brief mention of the baselines employed, to better support the proficiency claim. revision: yes
-
Referee: The generalization of the knowledge base to unseen apps is asserted but not isolated in experiments. No ablation holds the knowledge base fixed while introducing held-out apps with dissimilar UI structures; success on the original 10 does not isolate whether proficiency stems from per-app memorization or cross-app generalization.
Authors: We appreciate this observation on isolating generalization effects. Our current evaluation focuses on tasks within the 10 apps after building app-specific knowledge bases via exploration or demonstrations. To address the concern, we will add a new ablation study in the revised manuscript: we will fix the knowledge base from a subset of the original apps and evaluate performance on held-out apps with dissimilar UI structures. This will help distinguish cross-app generalization from per-app memorization. revision: yes
Circularity Check
No circularity: empirical knowledge-base construction from exploration/demos
full rationale
The paper describes an LLM-based agent that builds a reusable knowledge base via autonomous exploration or human demonstrations on smartphone apps, then uses it for task execution without backend access. No equations, fitted parameters, or first-principles derivations are present that reduce any claimed result to its inputs by construction. Evaluation on 50 tasks across 10 apps is framed as direct experimental validation of the constructed system rather than a self-referential prediction. Any self-citations (if present) are not load-bearing for the core empirical claims, which rest on new interaction data rather than prior author results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multimodal LLMs can map screenshots to appropriate tap/swipe actions for app navigation
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.RealityFromDistinctionreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 18 Pith papers
-
GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents
GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.
-
MMSkills: Towards Multimodal Skills for General Visual Agents
MMSkills creates compact multimodal skill packages from trajectories and uses a branch-loaded agent to improve visual decision-making on GUI and game benchmarks.
-
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
-
Temporal UI State Inconsistency in Desktop GUI Agents: Formalizing and Defending Against TOCTOU Attacks on Computer-Use Agents
Desktop GUI agents face TOCTOU attacks from UI state changes during the ~6.5s observation-to-action gap, with a three-layer pre-execution verification defense achieving 100% interception on two attack types but failin...
-
ProAgent: Harnessing On-Demand Sensory Contexts for Proactive LLM Agent Systems in the Wild
ProAgent uses on-demand tiered perception and context-aware LLM reasoning to deliver proactive assistance on AR glasses, achieving up to 27.7% higher prediction accuracy and 20.5% lower false detections than baselines.
-
MMSkills: Towards Multimodal Skills for General Visual Agents
MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.
-
AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark
AutoGUI-v2 is a new benchmark exposing that VLMs handle basic GUI grounding but struggle with complex interaction logic and state prediction.
-
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
-
AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management
AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.
-
UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning
UI-R1 shows rule-based RL with GRPO on 136 GUI tasks improves a 3B MLLM's action prediction accuracy by 6-22% over its base model and matches larger SFT-trained models.
-
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
-
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
Mobile-Agent is a vision-centric autonomous agent that uses MLLMs to perceive UI elements, plan complex multi-step tasks, and operate mobile apps without relying on XML or system metadata, showing strong results on th...
-
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.
-
A Survey on Large Language Model based Autonomous Agents
A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...
-
Less Detail, Better Answers: Degradation-Driven Prompting for VQA
Degradation-Driven Prompting improves VQA by intentionally reducing image detail and using masks, lines, and examples to guide models toward essential structures.
-
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
-
X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction
X-OmniClaw presents a unified architecture for Android mobile agents using Omni Perception, Memory, and Action modules to enable efficient multimodal task handling and personalized interactions.
-
A Survey on Multimodal Large Language Models
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
Reference graph
Works this paper leans on
-
[5]
Meta FAIR, Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, et al. 2022. Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067--1074
work page 2022
- [6]
-
[7]
Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. 2023. http://arxiv.org/abs/2307.12856 A real-world webagent with planning, long context understanding, and program synthesis
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [8]
-
[9]
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. 2023. http://arxiv.org/abs/2308.00352 Metagpt: Meta programming for a multi-agent collaborative framework
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [11]
-
[12]
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2023 a . Improved baselines with visual instruction tuning
work page 2023
-
[13]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023 b . Visual instruction tuning
work page 2023
-
[14]
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2023 c . Agent B ench: E valuating LLM s as agents. arXiv preprint arXiv: 2308.03688
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
OpenAI. 2021. Chatgpt. https://openai.com/research/chatgpt
work page 2021
-
[16]
OpenAI. 2023. http://arxiv.org/abs/2303.08774 Gpt-4 technical report
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1--22
work page 2023
-
[20]
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. In Advances in Neural Information Processing Systems
work page 2023
- [22]
-
[23]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023 a . http://arxiv.org/abs/2302.13971 Llama: Open and efficient foundation language models
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, and Yitao Liang. 2023. http://arxiv.org/abs/2311.05997 Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models
-
[27]
Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, and Tao Yu
Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, and Tao Yu. 2023. http://arxiv.org/abs/2310.10634 Openagents: An open platform for language agents in the wild
- [30]
-
[33]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct : Synergizing reasoning and acting in language models. In ICLR
work page 2023
-
[35]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. http://arxiv.org/abs/2306.05685 Judging llm-as-a-judge with mt-bench and chatbot arena
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Multimodal Web Navigation with Instruction-Finetuned Foundation Models , author=. 2023 , eprint=
work page 2023
-
[37]
JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models , author=. 2023 , eprint=
work page 2023
-
[38]
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis , author=. 2023 , eprint=
work page 2023
-
[39]
OpenAgents: An Open Platform for Language Agents in the Wild , author=. 2023 , eprint=
work page 2023
-
[40]
Auto-GPT for Online Decision Making: Benchmarks and Additional Opinions , author=. 2023 , eprint=
work page 2023
-
[41]
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework , author=. 2023 , eprint=
work page 2023
-
[42]
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data , author=. 2023 , eprint=
work page 2023
-
[43]
ChartLlama: A Multimodal LLM for Chart Understanding and Generation , author=. 2023 , eprint=
work page 2023
-
[44]
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention , author=. arXiv preprint arXiv:2303.16199 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions , author=
-
[46]
Mmicl: Empowering vision-language model with multi-modal in-context learn- ing
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning , author=. arXiv preprint arXiv:2309.07915 , year=
-
[47]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition , author=. 2023 , eprint=
work page 2023
-
[49]
Improved Baselines with Visual Instruction Tuning , author=
-
[50]
Visual Instruction Tuning , author=
-
[51]
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning , author=. 2023 , journal=
work page 2023
-
[52]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , author=. arXiv preprint arXiv:2304.10592 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[53]
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
The dawn of lmms: Preliminary explorations with gpt-4v (ision) , author=. arXiv preprint arXiv:2309.17421 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [54]
-
[55]
LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=
work page 2023
-
[56]
Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=
work page 2023
-
[57]
Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[58]
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=
work page 2023
- [59]
-
[60]
Training language models to follow instructions with human feedback , author=. 2022 , eprint=
work page 2022
-
[61]
GLM-130B: An Open Bilingual Pre-trained Model
Glm-130b: An open bilingual pre-trained model , author=. arXiv preprint arXiv:2210.02414 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[62]
arXiv preprint arXiv:2301.13688 , year=
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning , author=. arXiv preprint arXiv:2301.13688 , year=
- [63]
-
[64]
Publications Manual , year = "1983", publisher =
work page 1983
-
[65]
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243
-
[66]
Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of. 2007 , url=
work page 2007
-
[67]
Dan Gusfield , title =. 1997
work page 1997
-
[68]
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
work page 2015
-
[69]
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =
Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =. 2005 , url=
work page 2005
-
[70]
Cooley, James W. and Tukey, John W. , journal=. An algorithm for the machine calculation of complex. 1965 , url=
work page 1965
-
[71]
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) , author =. 2023 , journal =
work page 2023
-
[72]
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation , author =. 2023 , journal =
work page 2023
-
[73]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Rt-2: Vision-language-action models transfer web knowledge to robotic control , author=. arXiv preprint arXiv:2307.15818 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[74]
RT-1: Robotics Transformer for Real-World Control at Scale
Rt-1: Robotics transformer for real-world control at scale , author=. arXiv preprint arXiv:2212.06817 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[75]
Xiao Liu and Hao Yu and Hanchen Zhang and Yifan Xu and Xuanyu Lei and Hanyu Lai and Yu Gu and Hangliang Ding and Kaiwen Men and Kejuan Yang and Shudan Zhang and Xiang Deng and Aohan Zeng and Zhengxiao Du and Chenhui Zhang and Sheng Shen and Tianjun Zhang and Yu Su and Huan Sun and Minlie Huang and Yuxiao Dong and Jie Tang , year =. Agent
-
[76]
Evaluating Large Language Models Trained on Code
Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[77]
Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=
Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=
-
[78]
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace , year =
Shen, Yongliang and Song, Kaitao and Tan, Xu and Li, Dongsheng and Lu, Weiming and Zhuang, Yueting , booktitle =. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace , year =
-
[79]
ChatDev: Communicative Agents for Software Development
Communicative agents for software development , author=. arXiv preprint arXiv:2307.07924 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[80]
Human-level play in the game of Diplomacy by combining language models with strategic reasoning , author=. Science , volume=. 2022 , publisher=
work page 2022
-
[81]
Exploring large language models for communication games: An empirical study on werewolf,
Exploring large language models for communication games: An empirical study on werewolf , author=. arXiv preprint arXiv:2309.04658 , year=
-
[82]
The Rise and Potential of Large Language Model Based Agents: A Survey
The rise and potential of large language model based agents: A survey , author=. arXiv preprint arXiv:2309.07864 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[83]
arXiv preprint arXiv:2312.05230 , year=
Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning , author=. arXiv preprint arXiv:2312.05230 , year=
-
[84]
arXiv preprint arXiv:2311.07562 , year=
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation , author=. arXiv preprint arXiv:2311.07562 , year=
-
[85]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Do As I Can and Not As I Say: Grounding Language in Robotic Affordances , author=. arXiv preprint arXiv:2204.01691 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[86]
A generalist agent , author=. arXiv preprint arXiv:2205.06175 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[87]
arXiv preprint arXiv:2310.12945 , year=
3D-GPT: Procedural 3D Modeling with Large Language Models , author=. arXiv preprint arXiv:2310.12945 , year=
-
[88]
Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =
-
[89]
Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022
work page 2022
-
[90]
A Systematic Survey of Text Worlds as Embodied Natural Language Environments
Jansen, Peter. A Systematic Survey of Text Worlds as Embodied Natural Language Environments. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.1
-
[91]
A Minimal Computational Improviser Based on Oral Thought
Montfort, Nick and Bartlett Fernandez, Sebastian. A Minimal Computational Improviser Based on Oral Thought. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.2
-
[92]
Volum, Ryan and Rao, Sudha and Xu, Michael and DesGarennes, Gabriel and Brockett, Chris and Van Durme, Benjamin and Deng, Olivia and Malhotra, Akanksha and Dolan, Bill. Craft an Iron Sword: Dynamically Generating Interactive Game Characters by Prompting Large Language Models Tuned on Code. Proceedings of the 3rd Wordplay: When Language Meets Games Worksho...
-
[93]
A Sequence Modelling Approach to Question Answering in Text-Based Games
Furman, Gregory and Toledo, Edan and Shock, Jonathan and Buys, Jan. A Sequence Modelling Approach to Question Answering in Text-Based Games. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.4
-
[94]
Automatic Exploration of Textual Environments with Language-Conditioned Autotelic Agents
Teodorescu, Laetitia and Yuan, Xingdi and C \^o t \'e , Marc-Alexandre and Oudeyer, Pierre-Yves. Automatic Exploration of Textual Environments with Language-Conditioned Autotelic Agents. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.5
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.