arxiv: 2312.13771 · v2 · submitted 2023-12-21 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

AppAgent: Multimodal Agents as Smartphone Users

Chi Zhang , Zhao Yang , Jiaxuan Liu , Yucheng Han , Xin Chen , Zebiao Huang , Bin Fu , Gang Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-17 10:10 UTC · model grok-4.3

classification 💻 cs.CV

keywords agentapplicationstaskssmartphoneacrossagentsappscomplex

0 comments

The pith

AppAgent lets large language models operate diverse smartphone apps via visual interactions and learns app usage from exploration or demonstrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The agent sees the phone screen as images and decides on simple actions like taps or swipes to complete tasks. It builds knowledge either by trying apps on its own or by learning from human examples, then uses that knowledge to handle complex jobs in apps like social media, email, maps, shopping, and photo editing. Testing covered 50 tasks across 10 different applications.

Core claim

Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps.

Load-bearing premise

The agent can reliably learn to navigate and execute tasks in new apps through autonomous exploration or human demonstrations, producing a knowledge base that generalizes across applications.

read the original abstract

Recent advancements in large language models (LLMs) have led to the creation of intelligent agents capable of performing complex tasks. This paper introduces a novel LLM-based multimodal agent framework designed to operate smartphone applications. Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps. Central to our agent's functionality is its innovative learning method. The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations. This process generates a knowledge base that the agent refers to for executing complex tasks across different applications. To demonstrate the practicality of our agent, we conducted extensive testing over 50 tasks in 10 different applications, including social media, email, maps, shopping, and sophisticated image editing tools. The results affirm our agent's proficiency in handling a diverse array of high-level tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AppAgent gives a concrete framework for LLM agents to drive real phones with taps, swipes, and a learnable knowledge base, but the experiments stay too high-level to confirm generalization.

read the letter

The paper's main idea is an agent that controls smartphone apps through a simplified action space of human-like gestures, learning either by autonomous exploration or human demonstrations to build a reusable knowledge base. This lets it handle tasks across apps without any backend hooks or custom integrations. That setup is the clearest new piece: it takes existing LLM agent patterns and applies them specifically to mobile UIs with dual learning modes and a focus on real, diverse apps like social media, maps, and image editors. The ambition to make automation work on arbitrary apps without special access is a practical step forward and worth noting. The description of the framework itself is straightforward and could give readers concrete starting points for similar work. The evaluation, however, is the main soft spot. The abstract reports tests on 50 tasks in 10 apps but supplies no success rates, error breakdowns, baselines, or comparisons. That leaves the claim of proficiency resting on description alone. The stress-test point about generalization also lands: without an ablation that holds the knowledge base fixed and introduces held-out apps with different UI patterns, success on the original set does not separate per-app memorization from the claimed cross-app transfer. This paper is aimed at people working on LLM agents and mobile automation. Readers who want implementation sketches or ideas for practical deployments would get some value from the framework details, even if they need stronger numbers. It shows clear engagement with the problem and the literature on agents, so it deserves a serious referee to push for tighter experiments and quantitative evidence.

Referee Report

2 major / 1 minor

Summary. The paper introduces AppAgent, a novel LLM-based multimodal agent framework for operating smartphone applications. It uses a simplified action space mimicking human interactions such as tapping and swiping, bypassing the need for system back-end access. The agent learns to navigate new apps through autonomous exploration or human demonstrations, generating a knowledge base for executing complex tasks. Extensive testing is reported over 50 tasks in 10 different applications, affirming the agent's proficiency.

Significance. If the results hold, the work would be significant in advancing multimodal agents for real-world mobile interfaces, broadening applicability without backend dependencies. The learning method for building reusable knowledge bases could enable more generalizable agents. Credit is due for the empirical construction approach and focus on practical smartphone use cases.

major comments (2)

The abstract reports testing on 50 tasks in 10 apps but provides no quantitative results, error analysis, or comparison baselines; the central claim of proficiency rests on high-level description only.
The generalization of the knowledge base to unseen apps is asserted but not isolated in experiments. No ablation holds the knowledge base fixed while introducing held-out apps with dissimilar UI structures; success on the original 10 does not isolate whether proficiency stems from per-app memorization or cross-app generalization.

minor comments (1)

Consider adding more details on the exact composition of the 10 applications and the specific tasks to allow better assessment of diversity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the significance of our work. We address each major comment below and outline the revisions we plan to incorporate.

read point-by-point responses

Referee: The abstract reports testing on 50 tasks in 10 apps but provides no quantitative results, error analysis, or comparison baselines; the central claim of proficiency rests on high-level description only.

Authors: We agree that the abstract, being a high-level summary, omits specific quantitative metrics. The full manuscript contains detailed experimental results, including success rates across the 50 tasks, error breakdowns, and baseline comparisons in the evaluation section. We will revise the abstract to include key quantitative highlights, such as overall task success rates and a brief mention of the baselines employed, to better support the proficiency claim. revision: yes
Referee: The generalization of the knowledge base to unseen apps is asserted but not isolated in experiments. No ablation holds the knowledge base fixed while introducing held-out apps with dissimilar UI structures; success on the original 10 does not isolate whether proficiency stems from per-app memorization or cross-app generalization.

Authors: We appreciate this observation on isolating generalization effects. Our current evaluation focuses on tasks within the 10 apps after building app-specific knowledge bases via exploration or demonstrations. To address the concern, we will add a new ablation study in the revised manuscript: we will fix the knowledge base from a subset of the original apps and evaluate performance on held-out apps with dissimilar UI structures. This will help distinguish cross-app generalization from per-app memorization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical knowledge-base construction from exploration/demos

full rationale

The paper describes an LLM-based agent that builds a reusable knowledge base via autonomous exploration or human demonstrations on smartphone apps, then uses it for task execution without backend access. No equations, fitted parameters, or first-principles derivations are present that reduce any claimed result to its inputs by construction. Evaluation on 50 tasks across 10 apps is framed as direct experimental validation of the constructed system rather than a self-referential prediction. Any self-citations (if present) are not load-bearing for the core empirical claims, which rest on new interaction data rather than prior author results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that current multimodal LLMs can interpret smartphone screen images sufficiently well to generate effective actions and that exploration or demonstration data produces transferable knowledge.

axioms (1)

domain assumption Multimodal LLMs can map screenshots to appropriate tap/swipe actions for app navigation
Invoked as the basis for the simplified action space and agent operation

pith-pipeline@v0.9.0 · 5476 in / 1039 out tokens · 38886 ms · 2026-05-17T10:10:48.536796+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents
cs.CR 2026-01 unverdicted novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.
MMSkills: Towards Multimodal Skills for General Visual Agents
cs.AI 2026-05 unverdicted novelty 7.0

MMSkills creates compact multimodal skill packages from trajectories and uses a branch-loaded agent to improve visual decision-making on GUI and game benchmarks.
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
cs.CL 2026-04 unverdicted novelty 7.0

OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
Temporal UI State Inconsistency in Desktop GUI Agents: Formalizing and Defending Against TOCTOU Attacks on Computer-Use Agents
cs.CR 2026-04 unverdicted novelty 7.0

Desktop GUI agents face TOCTOU attacks from UI state changes during the ~6.5s observation-to-action gap, with a three-layer pre-execution verification defense achieving 100% interception on two attack types but failin...
ProAgent: Harnessing On-Demand Sensory Contexts for Proactive LLM Agent Systems in the Wild
cs.AI 2025-12 conditional novelty 7.0

ProAgent uses on-demand tiered perception and context-aware LLM reasoning to deliver proactive assistance on AR glasses, achieving up to 27.7% higher prediction accuracy and 20.5% lower false detections than baselines.
MMSkills: Towards Multimodal Skills for General Visual Agents
cs.AI 2026-05 unverdicted novelty 6.0

MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.
AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark
cs.CV 2026-04 unverdicted novelty 6.0

AutoGUI-v2 is a new benchmark exposing that VLMs handle basic GUI grounding but struggle with complex interaction logic and state prediction.
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
cs.CR 2026-02 unverdicted novelty 6.0

The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management
cs.AI 2025-12 conditional novelty 6.0

AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.
UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning
cs.AI 2025-03 accept novelty 6.0

UI-R1 shows rule-based RL with GRPO on 136 GUI tasks improves a 3B MLLM's action prediction accuracy by 6-22% over its base model and matches larger SFT-trained models.
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
cs.CL 2024-10 unverdicted novelty 6.0

OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
cs.CL 2024-01 conditional novelty 6.0

Mobile-Agent is a vision-centric autonomous agent that uses MLLMs to perceive UI elements, plan complex multi-step tasks, and operate mobile apps without relying on XML or system metadata, showing strong results on th...
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
cs.HC 2024-01 unverdicted novelty 6.0

SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.
A Survey on Large Language Model based Autonomous Agents
cs.AI 2023-08 accept novelty 6.0

A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...
Less Detail, Better Answers: Degradation-Driven Prompting for VQA
cs.CV 2026-04 unverdicted novelty 5.0

Degradation-Driven Prompting improves VQA by intentionally reducing image detail and using masks, lines, and examples to guide models toward essential structures.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 4.0

The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction
cs.CV 2026-05 unverdicted novelty 3.0

X-OmniClaw presents a unified architecture for Android mobile agents using Omni Perception, Memory, and Action modules to enable efficient multimodal task handling and personalized interactions.
A Survey on Multimodal Large Language Models
cs.CV 2023-06 accept novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

286 extracted references · 286 canonical work pages · cited by 17 Pith papers · 19 internal anchors

[5]

Meta FAIR, Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, et al. 2022. Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067--1074

work page 2022
[6]

Hiroki Furuta, Kuang-Huei Lee, Ofir Nachum, Yutaka Matsuo, Aleksandra Faust, Shixiang Shane Gu, and Izzeddin Gur. 2023. http://arxiv.org/abs/2305.11854 Multimodal web navigation with instruction-finetuned foundation models

work page arXiv 2023
[7]

Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. 2023. http://arxiv.org/abs/2307.12856 A real-world webagent with planning, long context understanding, and program synthesis

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, and Hanwang Zhang. 2023. http://arxiv.org/abs/2311.16483 Chartllama: A multimodal llm for chart understanding and generation

work page arXiv 2023
[9]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. 2023. http://arxiv.org/abs/2308.00352 Metagpt: Meta programming for a multi-agent collaborative framework

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Yanda Li, Chi Zhang, Gang Yu, Zhibin Wang, Bin Fu, Guosheng Lin, Chunhua Shen, Ling Chen, and Yunchao Wei. 2023. http://arxiv.org/abs/2308.10253 Stablellava: Enhanced visual instruction tuning with synthesized image-dialogue data

work page arXiv 2023
[12]

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2023 a . Improved baselines with visual instruction tuning

work page 2023
[13]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023 b . Visual instruction tuning

work page 2023
[14]

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2023 c . Agent B ench: E valuating LLM s as agents. arXiv preprint arXiv: 2308.03688

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

OpenAI. 2021. Chatgpt. https://openai.com/research/chatgpt

work page 2021
[16]

OpenAI. 2023. http://arxiv.org/abs/2303.08774 Gpt-4 technical report

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1--22

work page 2023
[20]

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. In Advances in Neural Information Processing Systems

work page 2023
[22]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca

work page 2023
[23]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023 a . http://arxiv.org/abs/2302.13971 Llama: Open and efficient foundation language models

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, and Yitao Liang. 2023. http://arxiv.org/abs/2311.05997 Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models

work page arXiv 2023
[27]

Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, and Tao Yu

Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, and Tao Yu. 2023. http://arxiv.org/abs/2310.10634 Openagents: An open platform for language agents in the wild

work page arXiv 2023
[30]

Hui Yang, Sifu Yue, and Yunzhong He. 2023 a . http://arxiv.org/abs/2306.02224 Auto-gpt for online decision making: Benchmarks and additional opinions

work page arXiv 2023
[33]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct : Synergizing reasoning and acting in language models. In ICLR

work page 2023
[35]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. http://arxiv.org/abs/2306.05685 Judging llm-as-a-judge with mt-bench and chatbot arena

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

2023 , eprint=

Multimodal Web Navigation with Instruction-Finetuned Foundation Models , author=. 2023 , eprint=

work page 2023
[37]

2023 , eprint=

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models , author=. 2023 , eprint=

work page 2023
[38]

2023 , eprint=

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis , author=. 2023 , eprint=

work page 2023
[39]

2023 , eprint=

OpenAgents: An Open Platform for Language Agents in the Wild , author=. 2023 , eprint=

work page 2023
[40]

2023 , eprint=

Auto-GPT for Online Decision Making: Benchmarks and Additional Opinions , author=. 2023 , eprint=

work page 2023
[41]

2023 , eprint=

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework , author=. 2023 , eprint=

work page 2023
[42]

2023 , eprint=

StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data , author=. 2023 , eprint=

work page 2023
[43]

2023 , eprint=

ChartLlama: A Multimodal LLM for Chart Understanding and Generation , author=. 2023 , eprint=

work page 2023
[44]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention , author=. arXiv preprint arXiv:2303.16199 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions , author=

work page
[46]

Mmicl: Empowering vision-language model with multi-modal in-context learn- ing

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning , author=. arXiv preprint arXiv:2309.07915 , year=

work page arXiv
[47]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

2023 , eprint=

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition , author=. 2023 , eprint=

work page 2023
[49]

Improved Baselines with Visual Instruction Tuning , author=

work page
[50]

Visual Instruction Tuning , author=

work page
[51]

2023 , journal=

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning , author=. 2023 , journal=

work page 2023
[52]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , author=. arXiv preprint arXiv:2304.10592 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

The dawn of lmms: Preliminary explorations with gpt-4v (ision) , author=. arXiv preprint arXiv:2309.17421 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023
[55]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

work page 2023
[56]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[57]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[58]

2023 , eprint=

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

work page 2023
[59]

2021 , howpublished =

OpenAI , title =. 2021 , howpublished =

work page 2021
[60]

2022 , eprint=

Training language models to follow instructions with human feedback , author=. 2022 , eprint=

work page 2022
[61]

GLM-130B: An Open Bilingual Pre-trained Model

Glm-130b: An open bilingual pre-trained model , author=. arXiv preprint arXiv:2210.02414 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

arXiv preprint arXiv:2301.13688 , year=

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning , author=. arXiv preprint arXiv:2301.13688 , year=

work page arXiv
[63]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[64]

Publications Manual , year = "1983", publisher =

work page 1983
[65]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[66]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of. 2007 , url=

work page 2007
[67]

Dan Gusfield , title =. 1997

work page 1997
[68]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[69]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =. 2005 , url=

work page 2005
[70]

and Tukey, John W

Cooley, James W. and Tukey, John W. , journal=. An algorithm for the machine calculation of complex. 1965 , url=

work page 1965
[71]

2023 , journal =

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) , author =. 2023 , journal =

work page 2023
[72]

2023 , journal =

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation , author =. 2023 , journal =

work page 2023
[73]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Rt-2: Vision-language-action models transfer web knowledge to robotic control , author=. arXiv preprint arXiv:2307.15818 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[74]

RT-1: Robotics Transformer for Real-World Control at Scale

Rt-1: Robotics transformer for real-world control at scale , author=. arXiv preprint arXiv:2212.06817 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[75]

Xiao Liu and Hao Yu and Hanchen Zhang and Yifan Xu and Xuanyu Lei and Hanyu Lai and Yu Gu and Hangliang Ding and Kaiwen Men and Kejuan Yang and Shudan Zhang and Xiang Deng and Aohan Zeng and Zhengxiao Du and Chenhui Zhang and Sheng Shen and Tianjun Zhang and Yu Su and Huan Sun and Minlie Huang and Yuxiao Dong and Jie Tang , year =. Agent

work page
[76]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[77]

Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

work page
[78]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace , year =

Shen, Yongliang and Song, Kaitao and Tan, Xu and Li, Dongsheng and Lu, Weiming and Zhuang, Yueting , booktitle =. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace , year =

work page
[79]

ChatDev: Communicative Agents for Software Development

Communicative agents for software development , author=. arXiv preprint arXiv:2307.07924 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[80]

Science , volume=

Human-level play in the game of Diplomacy by combining language models with strategic reasoning , author=. Science , volume=. 2022 , publisher=

work page 2022
[81]

Exploring large language models for communication games: An empirical study on werewolf,

Exploring large language models for communication games: An empirical study on werewolf , author=. arXiv preprint arXiv:2309.04658 , year=

work page arXiv
[82]

The Rise and Potential of Large Language Model Based Agents: A Survey

The rise and potential of large language model based agents: A survey , author=. arXiv preprint arXiv:2309.07864 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[83]

arXiv preprint arXiv:2312.05230 , year=

Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning , author=. arXiv preprint arXiv:2312.05230 , year=

work page arXiv
[84]

arXiv preprint arXiv:2311.07562 , year=

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation , author=. arXiv preprint arXiv:2311.07562 , year=

work page arXiv
[85]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Do As I Can and Not As I Say: Grounding Language in Robotic Affordances , author=. arXiv preprint arXiv:2204.01691 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[86]

A Generalist Agent

A generalist agent , author=. arXiv preprint arXiv:2205.06175 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[87]

arXiv preprint arXiv:2310.12945 , year=

3D-GPT: Procedural 3D Modeling with Large Language Models , author=. arXiv preprint arXiv:2310.12945 , year=

work page arXiv
[88]

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =

work page
[89]

Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022

work page 2022
[90]

A Systematic Survey of Text Worlds as Embodied Natural Language Environments

Jansen, Peter. A Systematic Survey of Text Worlds as Embodied Natural Language Environments. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.1

work page doi:10.18653/v1/2022.wordplay-1.1 2022
[91]

A Minimal Computational Improviser Based on Oral Thought

Montfort, Nick and Bartlett Fernandez, Sebastian. A Minimal Computational Improviser Based on Oral Thought. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.2

work page doi:10.18653/v1/2022.wordplay-1.2 2022
[92]

Craft an Iron Sword: Dynamically Generating Interactive Game Characters by Prompting Large Language Models Tuned on Code

Volum, Ryan and Rao, Sudha and Xu, Michael and DesGarennes, Gabriel and Brockett, Chris and Van Durme, Benjamin and Deng, Olivia and Malhotra, Akanksha and Dolan, Bill. Craft an Iron Sword: Dynamically Generating Interactive Game Characters by Prompting Large Language Models Tuned on Code. Proceedings of the 3rd Wordplay: When Language Meets Games Worksho...

work page doi:10.18653/v1/2022.wordplay-1.3 2022
[93]

A Sequence Modelling Approach to Question Answering in Text-Based Games

Furman, Gregory and Toledo, Edan and Shock, Jonathan and Buys, Jan. A Sequence Modelling Approach to Question Answering in Text-Based Games. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.4

work page doi:10.18653/v1/2022.wordplay-1.4 2022
[94]

Automatic Exploration of Textual Environments with Language-Conditioned Autotelic Agents

Teodorescu, Laetitia and Yuan, Xingdi and C \^o t \'e , Marc-Alexandre and Oudeyer, Pierre-Yves. Automatic Exploration of Textual Environments with Language-Conditioned Autotelic Agents. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.5

work page doi:10.18653/v1/2022.wordplay-1.5 2022

Showing first 80 references.