Agent AI: Surveying the Horizons of Multimodal Interaction

Bidipta Sarkar; Demetri Terzopoulos; Hoi Vo; Jae Sung Park; Jianfeng Gao; Katsushi Ikeuchi; Li Fei-Fei; Naoki Wake; Qiuyuan Huang; Ran Gong

arxiv: 2401.03568 · v2 · pith:OUVZHD3Lnew · submitted 2024-01-07 · 💻 cs.AI · cs.HC· cs.LG

Agent AI: Surveying the Horizons of Multimodal Interaction

Zane Durante , Qiuyuan Huang , Naoki Wake , Ran Gong , Jae Sung Park , Bidipta Sarkar , Rohan Taori , Yusuke Noda

show 6 more authors

Demetri Terzopoulos Yejin Choi Katsushi Ikeuchi Hoi Vo Li Fei-Fei Jianfeng Gao

This is my paper

Pith reviewed 2026-05-18 14:20 UTC · model grok-4.3

classification 💻 cs.AI cs.HCcs.LG

keywords Agent AIMultimodal InteractionEmbodied AgentsFoundation ModelsHallucinationsGrounded EnvironmentsVirtual Reality

0 comments

The pith

Developing agentic AI systems in grounded environments mitigates hallucinations in large foundation models by producing environmentally accurate outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines Agent AI as interactive systems that perceive visual stimuli, language inputs, and other environmentally grounded data to produce meaningful embodied actions. It surveys how existing foundation models serve as building blocks for such agents in physical and virtual settings. The central argument is that this embodiment improves next-action prediction through external knowledge, multi-sensory inputs, and human feedback. A sympathetic reader would care because it offers a path to more context-aware multimodal systems that avoid generating outputs mismatched with their surroundings. The work envisions users easily creating simulated scenes for interaction with embodied agents.

Core claim

By developing agentic AI systems in grounded environments, one can also mitigate the hallucinations of large foundation models and their tendency to generate environmentally incorrect outputs. The emerging field of Agent AI subsumes the broader embodied and agentic aspects of multimodal interactions, with agents that can perceive user actions, human behavior, environmental objects, audio expressions, and scene sentiment to inform responses.

What carries the argument

Agent AI as the class of interactive systems that perceive visual stimuli, language inputs, and environmentally-grounded data to produce meaningful embodied actions, carrying the argument by improving next-embodied action prediction via external knowledge, multi-sensory inputs, and human feedback.

If this is right

Agents gain the ability to interpret user actions, human behavior, and scene sentiment to direct context-appropriate responses.
Multimodal systems become more sophisticated by processing visual, language, and other grounded data together.
The approach subsumes embodied and agentic aspects of multimodal interactions beyond isolated language or vision tasks.
Users can create arbitrary virtual reality or simulated scenes and interact directly with agents embodied inside them.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This grounding strategy might extend to reducing other model failures such as inconsistent reasoning across repeated queries in the same scene.
It suggests a route to safer deployment in robotics or simulation training where environmental mismatch carries real costs.
Future work could test whether feedback loops from embodied actions improve model performance faster than additional pretraining data alone.

Load-bearing premise

Embedding today's AI models as agents inside physical or simulated worlds will ground their outputs enough to reduce made-up or mismatched information without redesigning the models or their training.

What would settle it

Deploy an Agent AI system in a controlled environment and check whether it still describes absent objects or generates actions that contradict visible scene elements; persistent mismatches would falsify the grounding claim.

read the original abstract

Multi-modal AI systems will likely become a ubiquitous presence in our everyday lives. A promising approach to making these systems more interactive is to embody them as agents within physical and virtual environments. At present, systems leverage existing foundation models as the basic building blocks for the creation of embodied agents. Embedding agents within such environments facilitates the ability of models to process and interpret visual and contextual data, which is critical for the creation of more sophisticated and context-aware AI systems. For example, a system that can perceive user actions, human behavior, environmental objects, audio expressions, and the collective sentiment of a scene can be used to inform and direct agent responses within the given environment. To accelerate research on agent-based multimodal intelligence, we define "Agent AI" as a class of interactive systems that can perceive visual stimuli, language inputs, and other environmentally-grounded data, and can produce meaningful embodied actions. In particular, we explore systems that aim to improve agents based on next-embodied action prediction by incorporating external knowledge, multi-sensory inputs, and human feedback. We argue that by developing agentic AI systems in grounded environments, one can also mitigate the hallucinations of large foundation models and their tendency to generate environmentally incorrect outputs. The emerging field of Agent AI subsumes the broader embodied and agentic aspects of multimodal interactions. Beyond agents acting and interacting in the physical world, we envision a future where people can easily create any virtual reality or simulated scene and interact with agents embodied within the virtual environment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a survey that defines Agent AI as embodied multimodal systems and argues grounding can reduce hallucinations in foundation models, but the support stays conceptual and example-driven.

read the letter

This paper defines Agent AI as interactive systems that take in visual, language, and other grounded inputs then output embodied actions. It pulls together work on multimodal models and robotics to sketch a future where agents operate in physical or virtual scenes, using feedback and multi-sensory data for better next-action prediction. The authors also claim this setup can cut hallucinations by keeping outputs tied to real environmental context rather than letting models drift into invented details.

Referee Report

2 major / 2 minor

Summary. The manuscript is a survey that defines 'Agent AI' as a class of interactive multimodal systems capable of perceiving visual stimuli, language inputs, and other environmentally-grounded data to produce meaningful embodied actions. It reviews systems that leverage existing foundation models for embodied agents in physical and virtual environments, emphasizing next-embodied-action prediction with multi-sensory inputs, external knowledge, and human feedback. The central argument is that grounding foundation models as agents in such environments can mitigate hallucinations and environmentally incorrect outputs, while envisioning future applications in VR and simulated scenes where users interact with embodied agents.

Significance. If the conceptual framework holds, the survey could organize research directions in multimodal embodied AI by subsuming embodied and agentic aspects under a single 'Agent AI' umbrella and highlighting embodiment as a route to address foundation-model limitations without major architectural overhauls. Its forward-looking vision for virtual environments adds relevance for interactive applications. As a high-level synthesis without new empirical results or formal derivations, its primary contribution is definitional and directional rather than evidentiary.

major comments (2)

Abstract: The claim that 'by developing agentic AI systems in grounded environments, one can also mitigate the hallucinations of large foundation models and their tendency to generate environmentally incorrect outputs' is presented as a forward-looking benefit but rests on illustrative reasoning rather than a mechanistic account or citations to empirical demonstrations of reduced hallucination rates in embodied versus non-embodied settings. This load-bearing argument for the value of Agent AI would be strengthened by explicit discussion of feedback loops or prediction mechanisms that enforce environmental consistency.
Definition and exploration sections: The distinction between Agent AI and prior embodied AI or multimodal agent work is not sharply delineated; the definition incorporates 'next-embodied action prediction' with external knowledge and feedback, yet it remains unclear whether this introduces novel technical requirements or largely overlaps with existing reinforcement-learning or vision-language-action models reviewed later in the manuscript.

minor comments (2)

The manuscript would benefit from a summary table or taxonomy classifying the reviewed multimodal agent systems by sensory modalities, grounding mechanisms, and hallucination-mitigation strategies to improve readability and synthesis value.
Some citations to foundational work on embodied AI (e.g., specific vision-language-action models or simulation platforms) appear illustrative; ensuring comprehensive coverage of recent benchmarks on hallucination in grounded settings would strengthen the literature review.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our survey manuscript. The comments help clarify the presentation of our central arguments and the positioning of Agent AI relative to existing literature. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: Abstract: The claim that 'by developing agentic AI systems in grounded environments, one can also mitigate the hallucinations of large foundation models and their tendency to generate environmentally incorrect outputs' is presented as a forward-looking benefit but rests on illustrative reasoning rather than a mechanistic account or citations to empirical demonstrations of reduced hallucination rates in embodied versus non-embodied settings. This load-bearing argument for the value of Agent AI would be strengthened by explicit discussion of feedback loops or prediction mechanisms that enforce environmental consistency.

Authors: We agree that the abstract claim would benefit from a more explicit mechanistic discussion. In the revised manuscript we will expand the abstract slightly and add a new paragraph in the introduction that outlines concrete mechanisms: closed-loop environmental feedback (where predicted next actions are validated against observed state changes), multi-sensory consistency checks, and human-in-the-loop corrections. We will also cite relevant empirical studies from the vision-language-action and embodied RL literature that report measurable reductions in environmentally inconsistent outputs when grounding is applied. These additions will be supported by references already present in the survey plus two or three additional citations. revision: yes
Referee: Definition and exploration sections: The distinction between Agent AI and prior embodied AI or multimodal agent work is not sharply delineated; the definition incorporates 'next-embodied action prediction' with external knowledge and feedback, yet it remains unclear whether this introduces novel technical requirements or largely overlaps with existing reinforcement-learning or vision-language-action models reviewed later in the manuscript.

Authors: We acknowledge that the current definition section could draw sharper boundaries. In the revision we will insert an explicit comparison subsection (or table) that contrasts Agent AI with (i) classical embodied AI focused on physical control without foundation-model backbones, (ii) multimodal agents that operate primarily in digital interfaces without embodiment, and (iii) standard RL/VLA pipelines. We will clarify that while technical components such as policy learning overlap, the Agent AI framing uniquely emphasizes the integration of next-embodied-action prediction with external knowledge retrieval, multi-sensory fusion, and human feedback as a single coherent research program. This does not claim entirely new primitives but rather a unifying lens for future work; we will state this limitation explicitly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; survey paper with no derivations or fitted predictions

full rationale

The manuscript is a high-level survey defining 'Agent AI' as interactive systems that perceive multimodal inputs and produce embodied actions. It reviews existing work and argues that grounding foundation models in physical/virtual environments can mitigate hallucinations via next-embodied-action prediction and feedback. No equations, parameters, predictions, or self-citation chains appear. Central claims are forward-looking arguments, not reductions of outputs to inputs by construction. The paper is self-contained as a review and does not rely on internal derivations that collapse to fitted quantities or prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The survey relies on background assumptions from foundation model literature and embodied AI without introducing new fitted parameters or ungrounded entities beyond the definitional framing of Agent AI.

axioms (1)

domain assumption Existing foundation models can serve as effective building blocks for embodied agents when placed in grounded environments.
Invoked in the abstract when stating that systems leverage foundation models for agent creation.

invented entities (1)

Agent AI no independent evidence
purpose: A class of interactive multimodal systems for embodied action prediction.
New term defined to subsume embodied and agentic aspects of multimodal interactions.

pith-pipeline@v0.9.0 · 5851 in / 1220 out tokens · 41795 ms · 2026-05-18T14:20:16.027518+00:00 · methodology

discussion (0)

Forward citations

Cited by 34 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents
cs.CL 2026-06 conditional novelty 7.0

Introduces APB benchmark with 4209 cases across 22 domains to diagnose planning in 12 MLLMs and shows it improves downstream execution when used for refinement.
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
cs.CL 2026-05 unverdicted novelty 7.0

An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators
cs.AR 2026-04 conditional novelty 7.0

ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.
Towards Considerate Human-Robot Coexistence: A Dual-Space Framework of Robot Design and Human Perception in Healthcare
cs.RO 2026-04 unverdicted novelty 7.0

A dual-space framework models healthcare human-robot coexistence as a co-evolving loop between robot design and four human perception dimensions, positioning humans as active interpreters and mediators.
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs
cs.RO 2026-02 unverdicted novelty 7.0

ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
SANet: A Semantic-aware Agentic AI Networking Framework for Cross-layer Optimization in 6G
cs.AI 2025-12 unverdicted novelty 7.0

SANet uses semantic-aware AI agents for cross-layer 6G optimization, achieving up to 14.61% performance gains with 44.37% of the FLOPs of prior methods via model partitioning and decentralized multi-objective algorithms.
Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs
cs.AI 2025-05 unverdicted novelty 7.0

UniR is a composable reasoning module trained with verifiable rewards and added to frozen LLMs via logit summation, enabling modular composition and weak-to-strong generalization across tasks and model sizes.
Do Recommendation Algorithms Work When Users Are LLM Agents? A Case Study on Moltbook
cs.IR 2026-06 unverdicted novelty 6.0

On the Moltbook platform populated by LLM agents, popularity-based and item-side collaborative filtering methods outperform user-representation techniques for predicting next forum engagement.
Self-Evolving Agentic Image Restoration via Deliberate Planning and Intuitive Execution
cs.CV 2026-06 unverdicted novelty 6.0

SEAR introduces a dual-process agentic framework for image restoration that combines pruning-aware MCTS planning with self-evolving episodic memory to address greedy search and episodic amnesia limitations.
RAPID: A Reproducible Multi-Agent Pipeline for Interpretable Disaster Damage Assessment from Satellite and Street-View Imagery
cs.CV 2026-06 unverdicted novelty 6.0

RAPID is a multi-agent pipeline for zero-shot interpretable damage assessment and reporting from cross-view satellite and street-view imagery across multiple disaster types.
AURA: Intent-Directed Probing for Implicit-Need Surfacing in Situated LLM Agents
cs.CL 2026-06 unverdicted novelty 6.0

AURA improves implicit-need coverage by 0.07 over ReAct baselines on a 100-query benchmark by inserting an intent inference step controlled by a gap score, while cutting probes 82% on factual tasks.
CHAL: Council of Hierarchical Agentic Language
cs.AI 2026-05 unverdicted novelty 6.0

CHAL is a multi-agent dialectic system that performs structured belief optimization over defeasible domains using Bayesian-inspired graph representations and configurable meta-cognitive value system hyperparameters.
Context-Agent: Dynamic Discourse Trees for Non-Linear Dialogue
cs.CL 2026-04 conditional novelty 6.0

Context-Agent represents dialogue history as a dynamic tree to handle non-linear topic shifts and introduces the NTM benchmark for evaluating long-horizon non-linear dialogues.
MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings
cs.ET 2026-03 unverdicted novelty 6.0

MM-tau-p² is a new benchmark with 12 metrics that measures how well multi-modal agents adapt to user personas and maintain robustness in dual-control interactions.
Formalizing the Safety, Security, and Functional Properties of Agentic AI Systems
cs.AI 2025-10 unverdicted novelty 6.0

Introduces host agent and task lifecycle models plus 30 temporal logic properties to enable formal verification of liveness, safety, completeness, and fairness in agentic AI systems.
When Should Users Check? Modeling Confirmation Frequency inMulti-Step Agentic AI Tasks
cs.HC 2025-10 conditional novelty 6.0

A decision-theoretic model based on the observed Confirmation-Diagnosis-Correction-Redo user pattern places intermediate confirmations in AI agent tasks, yielding 81% user preference and 13.54% faster completion versu...
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
cs.AI 2025-09 accept novelty 6.0

Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
cs.CL 2024-10 unverdicted novelty 6.0

OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
EvoSci: A Bio-Inspired Multi-Agent Framework for the Evolution of Scientific Discovery
cs.AI 2026-05 unverdicted novelty 5.0

EvoSci combines evolutionary multi-agent collaboration with knowledge graphs to produce scientific ideas that score higher on LLM peer-review metrics than baselines.
Toward Natural and Companionable Virtual Agents via Cross-Temporal Emotional Modeling
cs.HC 2026-05 unverdicted novelty 5.0

CTEM framework links behavioral history to evolving emotional states with user feedback updates, instantiated as Auri agent and tested in a 21-day study showing gains in naturalness, coherence, and emotional harmony.
Is a team only as strong as its weakest link? Quantifying the short-board effect with AI Agents
physics.soc-ph 2026-05 unverdicted novelty 5.0

LLM multi-agent simulations reveal a cumulative product effect from multiple weak links on team performance and identify distinct capability regimes including a Sisyphus predicament.
LanG -- A Governance-Aware Agentic AI Platform for Unified Security Operations
cs.CR 2026-04 unverdicted novelty 5.0

LanG presents a governance-aware agentic AI platform for unified security operations that reports strong performance on incident correlation, rule generation, attack reconstruction, and AI safety guardrails in an open...
Semantic-Aware Logical Reasoning via a Semiotic Framework
cs.AI 2025-09 conditional novelty 5.0

LogicAgent uses a semiotic-square-guided approach to enhance logical reasoning in LLMs on the new RepublicQA benchmark and others, reporting average gains of 6.25% and 7.05% respectively.
How Far Are We from Generating Missing Modalities with Foundation Models?
cs.MM 2025-06 unverdicted novelty 5.0

Evaluates 42 variants of foundation models across three formalized paradigms for missing modality reconstruction, identifies shortfalls in semantic extraction and validation, and introduces an agentic framework that r...
Towards an Agent-First Web: Redesigning the Web for AI Agents
cs.AI 2026-06 unverdicted novelty 4.0

Proposes ten design principles for an agent-first web with changes to access (agent identification and dual content), economics (intent-based tiers and tokens), and content (ATML and provenance chains) to address bloc...
DFLOP: A Data-driven Framework for Multimodal LLM Training Pipeline Optimization
cs.DC 2026-03 unverdicted novelty 4.0

DFLOP is a data-driven framework that profiles data-induced computation variance and uses predictive scheduling to balance workloads in multimodal LLM training pipelines, claiming up to 3.6x faster training than exist...
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
cs.AI 2025-04 accept novelty 4.0

A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
Large Language Model-Brained GUI Agents: A Survey
cs.AI 2024-11 unverdicted novelty 4.0

A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.
Agent System Operations: Categorization, Challenges, and Future Directions
cs.MA 2026-06 unverdicted novelty 3.0

This survey categorizes anomalies in agent systems into intra-agent and inter-agent types and introduces the AgentOps framework with four operational stages.
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
cs.CL 2025-03 accept novelty 3.0

A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.
A Survey on the Memory Mechanism of Large Language Model based Agents
cs.AI 2024-04 accept novelty 3.0

A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
LLM-Powered AI Agent Systems and Their Applications in Industry
cs.AI 2025-05 unverdicted novelty 2.0

A survey categorizing LLM-powered agent systems into software-based, physical, and hybrid types, covering industrial applications and challenges such as latency and security.
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
cs.AI 2025-03 unverdicted novelty 2.0

This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 34 Pith papers · 44 internal anchors

[1]

2021 IEEE/SICE International Symposium on System Integration (SII) , year=

A Learning-from-Observation Framework: One-Shot Robot Teaching for Grasp-Manipulation-Release Household Operations , author=. 2021 IEEE/SICE International Symposium on System Integration (SII) , year=

work page 2021
[2]

The International Journal of Robotics Research , volume =

Katsushi Ikeuchi and Naoki Wake and Kazuhiro Sasabuchi and Jun Takamatsu , title =. The International Journal of Robotics Research , volume =. 0 , doi =

work page
[3]

arXiv preprint arXiv:2304.09966 , year=

Applying Learning-from-observation to household service robots: three common-sense formulation , author=. arXiv preprint arXiv:2304.09966 , year=

work page arXiv
[4]

arXiv preprint arXiv:2310.15319 , year=

Hallucination Detection for Grounded Instruction Generation , author=. arXiv preprint arXiv:2310.15319 , year=

work page arXiv
[5]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Hierarchical object-to-zone graph for object navigation , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[6]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Poni: Potential functions for objectgoal navigation with interaction-free learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[7]

Science Robotics , volume=

Navigating to objects in the real world , author=. Science Robotics , volume=. 2023 , publisher=

work page 2023
[8]

Objectnav revisited: On evaluation of embodied agents navigating to objects

Objectnav revisited: On evaluation of embodied agents navigating to objects , author=. arXiv preprint arXiv:2006.13171 , year=

work page arXiv 2006
[9]

Advances in Neural Information Processing Systems , volume=

Object goal navigation using goal-oriented semantic exploration , author=. Advances in Neural Information Processing Systems , volume=

work page
[10]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[11]

2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Visual language maps for robot navigation , author=. 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2023 , organization=

work page 2023
[12]

cat-shaped mug

Can an embodied agent find your" cat-shaped mug"? llm-based zero-shot object navigation , author=. arXiv preprint arXiv:2303.03480 , year=

work page arXiv
[13]

arXiv preprint arXiv:2309.10309 , year=

Bridging Zero-shot Object Navigation and Foundation Models through Pixel-Guided Navigation Skill , author=. arXiv preprint arXiv:2309.10309 , year=

work page arXiv
[14]

arXiv preprint arXiv:2306.10322 , year=

MO-VLN: A Multi-Task Benchmark for Open-set Zero-Shot Vision-and-Language Navigation , author=. arXiv preprint arXiv:2306.10322 , year=

work page arXiv
[15]

Clip-nav: Using clip for zero-shot vision-and-language navigation.arXiv preprint arXiv:2211.16649, 2022

Clip-nav: Using clip for zero-shot vision-and-language navigation , author=. arXiv preprint arXiv:2211.16649 , year=

work page arXiv
[16]

Robot Operating System (ROS) The Complete Reference (Volume 1) , pages=

ROS navigation: Concepts and tutorial , author=. Robot Operating System (ROS) The Complete Reference (Volume 1) , pages=. 2016 , publisher=

work page 2016
[17]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[18]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Manipulathor: A framework for visual object manipulation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[19]

Visual Instruction Tuning , author=

work page
[20]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[21]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Advances in neural information processing systems , volume=

Deep fragment embeddings for bidirectional image sentence mapping , author=. Advances in neural information processing systems , volume=

work page
[23]

Explicit Knowledge-based Reasoning for Visual Question Answering

Explicit knowledge-based reasoning for visual question answering , author=. arXiv preprint arXiv:1511.02570 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

2023 , eprint=

AgentTuning: Enabling Generalized Agent Abilities for LLMs , author=. 2023 , eprint=

work page 2023
[25]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[26]

Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14 , pages=

Modeling context in referring expressions , author=. Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14 , pages=. 2016 , organization=

work page 2016
[27]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Towards vqa models that can read , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[28]

Advances in neural information processing systems , volume=

Exploring models and data for image question answering , author=. Advances in neural information processing systems , volume=

work page
[29]

Proceedings of the IEEE international conference on computer vision , pages=

Vqa: Visual question answering , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page
[30]

2023 , eprint=

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

work page 2023
[31]

arXiv preprint arXiv:2306.09442 , year=

Explore, Establish, Exploit: Red Teaming Language Models from Scratch , author=. arXiv preprint arXiv:2306.09442 , year=

work page arXiv
[32]

Bousmalis, G

RoboCat: A Self-Improving Foundation Agent for Robotic Manipulation , author=. arXiv preprint arXiv:2306.11706 , year=

work page arXiv
[33]

AI2-THOR: An Interactive 3D Environment for Visual AI

Ai2-thor: An interactive 3d environment for visual ai , author=. arXiv preprint arXiv:1712.05474 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Interactive

Interactive Robot Learning from Verbal Correction , author=. arXiv preprint arXiv:2310.17555 , year=

work page arXiv
[35]

arXiv preprint arXiv:2311.10678 , year=

Distilling and Retrieving Generalizable Knowledge for Robot Manipulation via Language Corrections , author=. arXiv preprint arXiv:2311.10678 , year=

work page arXiv
[36]

Fast Model Identification via Physics Engines for Data-Efficient Policy Search

Fast model identification via physics engines for data-efficient policy search , author=. arXiv preprint arXiv:1710.08893 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Conference on Robot Learning , pages=

Tunenet: One-shot residual tuning for system identification and sim-to-real robot task transfer , author=. Conference on Robot Learning , pages=. 2020 , organization=

work page 2020
[38]

2020 , eprint=

Neurosymbolic AI: The 3rd Wave , author=. 2020 , eprint=

work page 2020
[39]

2021 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Retinagan: An object-aware approach to sim-to-real transfer , author=. 2021 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2021 , organization=

work page 2021
[40]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Rl-cyclegan: Reinforcement learning aware simulation-to-real , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[41]

Proceedings of the IEEE international conference on computer vision , pages=

Unpaired image-to-image translation using cycle-consistent adversarial networks , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page
[42]

2017 IEEE/RSJ international conference on intelligent robots and systems (IROS) , pages=

Domain randomization for transferring deep neural networks from simulation to the real world , author=. 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS) , pages=. 2017 , organization=

work page 2017
[43]

Task-grasping from a demonstrated human strategy , year=

Saito, Daichi and Sasabuchi, Kazuhiro and Wake, Naoki and Takamatsu, Jun and Koike, Hideki and Ikeuchi, Katsushi , booktitle=. Task-grasping from a demonstrated human strategy , year=

work page
[44]

arXiv preprint arXiv:2301.01382 , year=

Task-sequencing simulator: Integrated machine learning to execution simulation for robot manipulation , author=. arXiv preprint arXiv:2301.01382 , year=

work page arXiv
[45]

Field and Service Robotics: Results of the 11th International Conference , pages=

Airsim: High-fidelity visual and physical simulation for autonomous vehicles , author=. Field and Service Robotics: Results of the 11th International Conference , pages=. 2018 , organization=

work page 2018
[46]

International Journal of Computer Vision , volume=

Sim4cv: A photo-realistic simulator for computer vision applications , author=. International Journal of Computer Vision , volume=. 2018 , publisher=

work page 2018
[47]

Virtual Reality , volume=

Unrealrox: an extremely photorealistic virtual reality environment for robotics simulations and synthetic data generation , author=. Virtual Reality , volume=. 2020 , publisher=

work page 2020
[48]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

UniSim: A Neural Closed-Loop Sensor Simulator , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[49]

arXiv preprint arXiv:2311.06211 , year=

ASSIST: Interactive Scene Nodes for Scalable and Realistic Indoor Simulation , author=. arXiv preprint arXiv:2311.06211 , year=

work page arXiv
[50]

IEEE Robotics and Automation Letters , year=

Orbit: A unified simulation framework for interactive robot learning environments , author=. IEEE Robotics and Automation Letters , year=

work page
[51]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Habitat 2.0: Training Home Assistants to Rearrange their Habitat , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[52]

A Survey on Large Language Model based Autonomous Agents

A survey on large language model based autonomous agents , author=. arXiv preprint arXiv:2308.11432 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Habitat: A platform for embodied ai research , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[54]

Conference on Robot Learning , pages=

Scalable deep reinforcement learning for vision-based robotic manipulation , author=. Conference on Robot Learning , pages=. 2018 , organization=

work page 2018
[55]

Annual Review of Control, Robotics, and Autonomous Systems , volume=

The role of physics-based simulators in robotics , author=. Annual Review of Control, Robotics, and Autonomous Systems , volume=. 2021 , publisher=

work page 2021
[56]

IEEE Robotics and Automation Letters , volume=

Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks , author=. IEEE Robotics and Automation Letters , volume=. 2022 , publisher=

work page 2022
[57]

Conference on Robot Learning , pages=

Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments , author=. Conference on Robot Learning , pages=. 2022 , organization=

work page 2022
[58]

arXiv preprint arXiv:2108.03272 (2021),�� 3

igibson 2.0: Object-centric simulation for robot learning of everyday household tasks , author=. arXiv preprint arXiv:2108.03272 , year=

work page arXiv
[59]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Robothor: An open simulation-to-real embodied ai platform , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[60]

IEEE Robotics and Automation Letters , volume=

Sean 2.0: Formalizing and generating social situations for robot navigation , author=. IEEE Robotics and Automation Letters , volume=. 2022 , publisher=

work page 2022
[61]

Turner, Oleksandr Maksymets, Zsolt Kira, Mrinal Kalakrishnan, Jitendra Malik, Devendra Singh Chaplot, Unnat Jain, Dhruv Batra, Akshara Rai, and Roozbeh Mottaghi

Habitat 3.0: A co-habitat for humans, avatars and robots , author=. arXiv preprint arXiv:2310.13724 , year=

work page arXiv
[62]

arXiv preprint arXiv:2311.11007 , year=

Constraint-aware Policy for Compliant Manipulation , author=. arXiv preprint arXiv:2311.11007 , year=

work page arXiv
[63]

arXiv preprint arXiv:2311.12015 , year=

GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration , author=. arXiv preprint arXiv:2311.12015 , year=

work page arXiv
[64]

ECCV , year=

Detecting Twenty-thousand Classes using Image-level Supervision , author=. ECCV , year=

work page
[65]

arXiv preprint arXiv:2201.05176 , year=

Neural Approaches to Conversational Information Retrieval , author=. arXiv preprint arXiv:2201.05176 , year=

work page arXiv
[66]

arXiv preprint arXiv:2002.06177 , year=

The next decade in ai: four steps towards robust artificial intelligence , author=. arXiv preprint arXiv:2002.06177 , year=

work page arXiv 2002
[67]

2019 , publisher=

Rebooting AI: Building artificial intelligence we can trust , author=. 2019 , publisher=

work page 2019
[68]

arXiv preprint arXiv:2009.03457 , year=

Robust conversational AI with grounded text generation , author=. arXiv preprint arXiv:2009.03457 , year=

work page arXiv 2009
[69]

arXiv preprint arXiv:2305.04835 , year=

How Do In-Context Examples Affect Compositional Generalization? , author=. arXiv preprint arXiv:2305.04835 , year=

work page arXiv
[70]

A Survey on In-context Learning

A survey for in-context learning , author=. arXiv preprint arXiv:2301.00234 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[71]

Recent advances in deep learning for object detection , volume=

Wu, Xiongwei and Sahoo, Doyen and Hoi, Steven CH , year=. Recent advances in deep learning for object detection , volume=. Neurocomputing , publisher=

work page
[72]

2023 , eprint=

Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models , author=. 2023 , eprint=

work page 2023
[73]

2023 , eprint=

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action , author=. 2023 , eprint=

work page 2023
[74]

2023 , eprint=

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , author=. 2023 , eprint=

work page 2023
[75]

2023 , eprint=

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality , author=. 2023 , eprint=

work page 2023
[76]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Localized Symbolic Knowledge Distillation for Visual Commonsense Models , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[77]

2020 , eprint=

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training , author=. 2020 , eprint=

work page 2020
[78]

2021 , eprint=

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation , author=. 2021 , eprint=

work page 2021
[79]

2022 , eprint=

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling , author=. 2022 , eprint=

work page 2022
[80]

2021 , eprint=

MERLOT: Multimodal Neural Script Knowledge Models , author=. 2021 , eprint=

work page 2021

Showing first 80 references.

[1] [1]

2021 IEEE/SICE International Symposium on System Integration (SII) , year=

A Learning-from-Observation Framework: One-Shot Robot Teaching for Grasp-Manipulation-Release Household Operations , author=. 2021 IEEE/SICE International Symposium on System Integration (SII) , year=

work page 2021

[2] [2]

The International Journal of Robotics Research , volume =

Katsushi Ikeuchi and Naoki Wake and Kazuhiro Sasabuchi and Jun Takamatsu , title =. The International Journal of Robotics Research , volume =. 0 , doi =

work page

[3] [3]

arXiv preprint arXiv:2304.09966 , year=

Applying Learning-from-observation to household service robots: three common-sense formulation , author=. arXiv preprint arXiv:2304.09966 , year=

work page arXiv

[4] [4]

arXiv preprint arXiv:2310.15319 , year=

Hallucination Detection for Grounded Instruction Generation , author=. arXiv preprint arXiv:2310.15319 , year=

work page arXiv

[5] [5]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Hierarchical object-to-zone graph for object navigation , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page

[6] [6]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Poni: Potential functions for objectgoal navigation with interaction-free learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[7] [7]

Science Robotics , volume=

Navigating to objects in the real world , author=. Science Robotics , volume=. 2023 , publisher=

work page 2023

[8] [8]

Objectnav revisited: On evaluation of embodied agents navigating to objects

Objectnav revisited: On evaluation of embodied agents navigating to objects , author=. arXiv preprint arXiv:2006.13171 , year=

work page arXiv 2006

[9] [9]

Advances in Neural Information Processing Systems , volume=

Object goal navigation using goal-oriented semantic exploration , author=. Advances in Neural Information Processing Systems , volume=

work page

[10] [10]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[11] [11]

2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Visual language maps for robot navigation , author=. 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2023 , organization=

work page 2023

[12] [12]

cat-shaped mug

Can an embodied agent find your" cat-shaped mug"? llm-based zero-shot object navigation , author=. arXiv preprint arXiv:2303.03480 , year=

work page arXiv

[13] [13]

arXiv preprint arXiv:2309.10309 , year=

Bridging Zero-shot Object Navigation and Foundation Models through Pixel-Guided Navigation Skill , author=. arXiv preprint arXiv:2309.10309 , year=

work page arXiv

[14] [14]

arXiv preprint arXiv:2306.10322 , year=

MO-VLN: A Multi-Task Benchmark for Open-set Zero-Shot Vision-and-Language Navigation , author=. arXiv preprint arXiv:2306.10322 , year=

work page arXiv

[15] [15]

Clip-nav: Using clip for zero-shot vision-and-language navigation.arXiv preprint arXiv:2211.16649, 2022

Clip-nav: Using clip for zero-shot vision-and-language navigation , author=. arXiv preprint arXiv:2211.16649 , year=

work page arXiv

[16] [16]

Robot Operating System (ROS) The Complete Reference (Volume 1) , pages=

ROS navigation: Concepts and tutorial , author=. Robot Operating System (ROS) The Complete Reference (Volume 1) , pages=. 2016 , publisher=

work page 2016

[17] [17]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[18] [18]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Manipulathor: A framework for visual object manipulation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[19] [19]

Visual Instruction Tuning , author=

work page

[20] [20]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023

[21] [21]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Advances in neural information processing systems , volume=

Deep fragment embeddings for bidirectional image sentence mapping , author=. Advances in neural information processing systems , volume=

work page

[23] [23]

Explicit Knowledge-based Reasoning for Visual Question Answering

Explicit knowledge-based reasoning for visual question answering , author=. arXiv preprint arXiv:1511.02570 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

2023 , eprint=

AgentTuning: Enabling Generalized Agent Abilities for LLMs , author=. 2023 , eprint=

work page 2023

[25] [25]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page

[26] [26]

Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14 , pages=

Modeling context in referring expressions , author=. Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14 , pages=. 2016 , organization=

work page 2016

[27] [27]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Towards vqa models that can read , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[28] [28]

Advances in neural information processing systems , volume=

Exploring models and data for image question answering , author=. Advances in neural information processing systems , volume=

work page

[29] [29]

Proceedings of the IEEE international conference on computer vision , pages=

Vqa: Visual question answering , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page

[30] [30]

2023 , eprint=

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

work page 2023

[31] [31]

arXiv preprint arXiv:2306.09442 , year=

Explore, Establish, Exploit: Red Teaming Language Models from Scratch , author=. arXiv preprint arXiv:2306.09442 , year=

work page arXiv

[32] [32]

Bousmalis, G

RoboCat: A Self-Improving Foundation Agent for Robotic Manipulation , author=. arXiv preprint arXiv:2306.11706 , year=

work page arXiv

[33] [33]

AI2-THOR: An Interactive 3D Environment for Visual AI

Ai2-thor: An interactive 3d environment for visual ai , author=. arXiv preprint arXiv:1712.05474 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Interactive

Interactive Robot Learning from Verbal Correction , author=. arXiv preprint arXiv:2310.17555 , year=

work page arXiv

[35] [35]

arXiv preprint arXiv:2311.10678 , year=

Distilling and Retrieving Generalizable Knowledge for Robot Manipulation via Language Corrections , author=. arXiv preprint arXiv:2311.10678 , year=

work page arXiv

[36] [36]

Fast Model Identification via Physics Engines for Data-Efficient Policy Search

Fast model identification via physics engines for data-efficient policy search , author=. arXiv preprint arXiv:1710.08893 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

Conference on Robot Learning , pages=

Tunenet: One-shot residual tuning for system identification and sim-to-real robot task transfer , author=. Conference on Robot Learning , pages=. 2020 , organization=

work page 2020

[38] [38]

2020 , eprint=

Neurosymbolic AI: The 3rd Wave , author=. 2020 , eprint=

work page 2020

[39] [39]

2021 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Retinagan: An object-aware approach to sim-to-real transfer , author=. 2021 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2021 , organization=

work page 2021

[40] [40]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Rl-cyclegan: Reinforcement learning aware simulation-to-real , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[41] [41]

Proceedings of the IEEE international conference on computer vision , pages=

Unpaired image-to-image translation using cycle-consistent adversarial networks , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page

[42] [42]

2017 IEEE/RSJ international conference on intelligent robots and systems (IROS) , pages=

Domain randomization for transferring deep neural networks from simulation to the real world , author=. 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS) , pages=. 2017 , organization=

work page 2017

[43] [43]

Task-grasping from a demonstrated human strategy , year=

Saito, Daichi and Sasabuchi, Kazuhiro and Wake, Naoki and Takamatsu, Jun and Koike, Hideki and Ikeuchi, Katsushi , booktitle=. Task-grasping from a demonstrated human strategy , year=

work page

[44] [44]

arXiv preprint arXiv:2301.01382 , year=

Task-sequencing simulator: Integrated machine learning to execution simulation for robot manipulation , author=. arXiv preprint arXiv:2301.01382 , year=

work page arXiv

[45] [45]

Field and Service Robotics: Results of the 11th International Conference , pages=

Airsim: High-fidelity visual and physical simulation for autonomous vehicles , author=. Field and Service Robotics: Results of the 11th International Conference , pages=. 2018 , organization=

work page 2018

[46] [46]

International Journal of Computer Vision , volume=

Sim4cv: A photo-realistic simulator for computer vision applications , author=. International Journal of Computer Vision , volume=. 2018 , publisher=

work page 2018

[47] [47]

Virtual Reality , volume=

Unrealrox: an extremely photorealistic virtual reality environment for robotics simulations and synthetic data generation , author=. Virtual Reality , volume=. 2020 , publisher=

work page 2020

[48] [48]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

UniSim: A Neural Closed-Loop Sensor Simulator , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[49] [49]

arXiv preprint arXiv:2311.06211 , year=

ASSIST: Interactive Scene Nodes for Scalable and Realistic Indoor Simulation , author=. arXiv preprint arXiv:2311.06211 , year=

work page arXiv

[50] [50]

IEEE Robotics and Automation Letters , year=

Orbit: A unified simulation framework for interactive robot learning environments , author=. IEEE Robotics and Automation Letters , year=

work page

[51] [51]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Habitat 2.0: Training Home Assistants to Rearrange their Habitat , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[52] [52]

A Survey on Large Language Model based Autonomous Agents

A survey on large language model based autonomous agents , author=. arXiv preprint arXiv:2308.11432 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Habitat: A platform for embodied ai research , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page

[54] [54]

Conference on Robot Learning , pages=

Scalable deep reinforcement learning for vision-based robotic manipulation , author=. Conference on Robot Learning , pages=. 2018 , organization=

work page 2018

[55] [55]

Annual Review of Control, Robotics, and Autonomous Systems , volume=

The role of physics-based simulators in robotics , author=. Annual Review of Control, Robotics, and Autonomous Systems , volume=. 2021 , publisher=

work page 2021

[56] [56]

IEEE Robotics and Automation Letters , volume=

Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks , author=. IEEE Robotics and Automation Letters , volume=. 2022 , publisher=

work page 2022

[57] [57]

Conference on Robot Learning , pages=

Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments , author=. Conference on Robot Learning , pages=. 2022 , organization=

work page 2022

[58] [58]

arXiv preprint arXiv:2108.03272 (2021),�� 3

igibson 2.0: Object-centric simulation for robot learning of everyday household tasks , author=. arXiv preprint arXiv:2108.03272 , year=

work page arXiv

[59] [59]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Robothor: An open simulation-to-real embodied ai platform , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[60] [60]

IEEE Robotics and Automation Letters , volume=

Sean 2.0: Formalizing and generating social situations for robot navigation , author=. IEEE Robotics and Automation Letters , volume=. 2022 , publisher=

work page 2022

[61] [61]

Turner, Oleksandr Maksymets, Zsolt Kira, Mrinal Kalakrishnan, Jitendra Malik, Devendra Singh Chaplot, Unnat Jain, Dhruv Batra, Akshara Rai, and Roozbeh Mottaghi

Habitat 3.0: A co-habitat for humans, avatars and robots , author=. arXiv preprint arXiv:2310.13724 , year=

work page arXiv

[62] [62]

arXiv preprint arXiv:2311.11007 , year=

Constraint-aware Policy for Compliant Manipulation , author=. arXiv preprint arXiv:2311.11007 , year=

work page arXiv

[63] [63]

arXiv preprint arXiv:2311.12015 , year=

GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration , author=. arXiv preprint arXiv:2311.12015 , year=

work page arXiv

[64] [64]

ECCV , year=

Detecting Twenty-thousand Classes using Image-level Supervision , author=. ECCV , year=

work page

[65] [65]

arXiv preprint arXiv:2201.05176 , year=

Neural Approaches to Conversational Information Retrieval , author=. arXiv preprint arXiv:2201.05176 , year=

work page arXiv

[66] [66]

arXiv preprint arXiv:2002.06177 , year=

The next decade in ai: four steps towards robust artificial intelligence , author=. arXiv preprint arXiv:2002.06177 , year=

work page arXiv 2002

[67] [67]

2019 , publisher=

Rebooting AI: Building artificial intelligence we can trust , author=. 2019 , publisher=

work page 2019

[68] [68]

arXiv preprint arXiv:2009.03457 , year=

Robust conversational AI with grounded text generation , author=. arXiv preprint arXiv:2009.03457 , year=

work page arXiv 2009

[69] [69]

arXiv preprint arXiv:2305.04835 , year=

How Do In-Context Examples Affect Compositional Generalization? , author=. arXiv preprint arXiv:2305.04835 , year=

work page arXiv

[70] [70]

A Survey on In-context Learning

A survey for in-context learning , author=. arXiv preprint arXiv:2301.00234 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[71] [71]

Recent advances in deep learning for object detection , volume=

Wu, Xiongwei and Sahoo, Doyen and Hoi, Steven CH , year=. Recent advances in deep learning for object detection , volume=. Neurocomputing , publisher=

work page

[72] [72]

2023 , eprint=

Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models , author=. 2023 , eprint=

work page 2023

[73] [73]

2023 , eprint=

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action , author=. 2023 , eprint=

work page 2023

[74] [74]

2023 , eprint=

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , author=. 2023 , eprint=

work page 2023

[75] [75]

2023 , eprint=

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality , author=. 2023 , eprint=

work page 2023

[76] [76]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Localized Symbolic Knowledge Distillation for Visual Commonsense Models , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page

[77] [77]

2020 , eprint=

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training , author=. 2020 , eprint=

work page 2020

[78] [78]

2021 , eprint=

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation , author=. 2021 , eprint=

work page 2021

[79] [79]

2022 , eprint=

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling , author=. 2022 , eprint=

work page 2022

[80] [80]

2021 , eprint=

MERLOT: Multimodal Neural Script Knowledge Models , author=. 2021 , eprint=

work page 2021