An Interactive Agent Foundation Model

Ade Famoti; Arnold Milstein; Ashley Llorens; Bidipta Sarkar; Demetri Terzopoulos; Ehsan Adeli; Hoi Vo; Jianfeng Gao; Katsu Ikeuchi; Kevin Schulman

arxiv: 2402.05929 · v2 · pith:KAYLHN2Dnew · submitted 2024-02-08 · 💻 cs.AI · cs.LG· cs.RO

An Interactive Agent Foundation Model

Zane Durante , Bidipta Sarkar , Ran Gong , Rohan Taori , Yusuke Noda , Paul Tang , Ehsan Adeli , Shrinidhi Kowshika Lakshmikanth

show 12 more authors

Kevin Schulman Arnold Milstein Demetri Terzopoulos Ade Famoti Noboru Kuno Ashley Llorens Hoi Vo Katsu Ikeuchi Li Fei-Fei Jianfeng Gao Naoki Wake Qiuyuan Huang

This is my paper

classification 💻 cs.AI cs.LGcs.RO

keywords agentmodelsystemstrainingacrossapproachdatadatasets

0 comments

read the original abstract

The development of artificial intelligence systems is transitioning from creating static, task-specific models to dynamic, agent-based systems capable of performing well in a wide range of applications. We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents across a wide range of domains, datasets, and tasks. Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction, enabling a versatile and adaptable AI framework. We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare. Our model demonstrates its ability to generate meaningful and contextually relevant outputs in each area. The strength of our approach lies in its generality, leveraging a variety of data sources such as robotics sequences, gameplay data, large-scale video datasets, and textual information for effective multimodal and multi-task learning. Our approach provides a promising avenue for developing generalist, action-taking, multimodal systems.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VLAs are Confined yet Capable of Generalizing to Novel Instructions
cs.RO 2025-05 unverdicted novelty 7.0

Averaging and temporally interpolating text latents in VLAs enables 83% success on novel task combinations in the libero-ood benchmark where SOTA models achieve under 15%.
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
cs.RO 2025-02 accept novelty 6.0

OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.
Prune, Update and Trim: Robust Structured Pruning for Large Language Models
cs.LG 2026-05 unverdicted novelty 5.0

Putri is a structured pruning technique for LLMs that compensates for pruning errors via weight updates and sequential processing while pruning at the attention-head level to reach state-of-the-art results at extreme ...
DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization
cs.RO 2026-05 unverdicted novelty 5.0

DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.
AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions
cs.AI 2024-08 unverdicted novelty 4.0

The paper introduces a taxonomy of AI safety for LLMs organized into Trustworthy AI, Responsible AI, and Safe AI perspectives, accompanied by a review of state-of-the-art methods, challenges, and future directions.