Agent AI: Surveying the Horizons of Multimodal Interaction
Pith reviewed 2026-05-18 14:20 UTC · model grok-4.3
The pith
Developing agentic AI systems in grounded environments mitigates hallucinations in large foundation models by producing environmentally accurate outputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By developing agentic AI systems in grounded environments, one can also mitigate the hallucinations of large foundation models and their tendency to generate environmentally incorrect outputs. The emerging field of Agent AI subsumes the broader embodied and agentic aspects of multimodal interactions, with agents that can perceive user actions, human behavior, environmental objects, audio expressions, and scene sentiment to inform responses.
What carries the argument
Agent AI as the class of interactive systems that perceive visual stimuli, language inputs, and environmentally-grounded data to produce meaningful embodied actions, carrying the argument by improving next-embodied action prediction via external knowledge, multi-sensory inputs, and human feedback.
If this is right
- Agents gain the ability to interpret user actions, human behavior, and scene sentiment to direct context-appropriate responses.
- Multimodal systems become more sophisticated by processing visual, language, and other grounded data together.
- The approach subsumes embodied and agentic aspects of multimodal interactions beyond isolated language or vision tasks.
- Users can create arbitrary virtual reality or simulated scenes and interact directly with agents embodied inside them.
Where Pith is reading between the lines
- This grounding strategy might extend to reducing other model failures such as inconsistent reasoning across repeated queries in the same scene.
- It suggests a route to safer deployment in robotics or simulation training where environmental mismatch carries real costs.
- Future work could test whether feedback loops from embodied actions improve model performance faster than additional pretraining data alone.
Load-bearing premise
Embedding today's AI models as agents inside physical or simulated worlds will ground their outputs enough to reduce made-up or mismatched information without redesigning the models or their training.
What would settle it
Deploy an Agent AI system in a controlled environment and check whether it still describes absent objects or generates actions that contradict visible scene elements; persistent mismatches would falsify the grounding claim.
read the original abstract
Multi-modal AI systems will likely become a ubiquitous presence in our everyday lives. A promising approach to making these systems more interactive is to embody them as agents within physical and virtual environments. At present, systems leverage existing foundation models as the basic building blocks for the creation of embodied agents. Embedding agents within such environments facilitates the ability of models to process and interpret visual and contextual data, which is critical for the creation of more sophisticated and context-aware AI systems. For example, a system that can perceive user actions, human behavior, environmental objects, audio expressions, and the collective sentiment of a scene can be used to inform and direct agent responses within the given environment. To accelerate research on agent-based multimodal intelligence, we define "Agent AI" as a class of interactive systems that can perceive visual stimuli, language inputs, and other environmentally-grounded data, and can produce meaningful embodied actions. In particular, we explore systems that aim to improve agents based on next-embodied action prediction by incorporating external knowledge, multi-sensory inputs, and human feedback. We argue that by developing agentic AI systems in grounded environments, one can also mitigate the hallucinations of large foundation models and their tendency to generate environmentally incorrect outputs. The emerging field of Agent AI subsumes the broader embodied and agentic aspects of multimodal interactions. Beyond agents acting and interacting in the physical world, we envision a future where people can easily create any virtual reality or simulated scene and interact with agents embodied within the virtual environment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a survey that defines 'Agent AI' as a class of interactive multimodal systems capable of perceiving visual stimuli, language inputs, and other environmentally-grounded data to produce meaningful embodied actions. It reviews systems that leverage existing foundation models for embodied agents in physical and virtual environments, emphasizing next-embodied-action prediction with multi-sensory inputs, external knowledge, and human feedback. The central argument is that grounding foundation models as agents in such environments can mitigate hallucinations and environmentally incorrect outputs, while envisioning future applications in VR and simulated scenes where users interact with embodied agents.
Significance. If the conceptual framework holds, the survey could organize research directions in multimodal embodied AI by subsuming embodied and agentic aspects under a single 'Agent AI' umbrella and highlighting embodiment as a route to address foundation-model limitations without major architectural overhauls. Its forward-looking vision for virtual environments adds relevance for interactive applications. As a high-level synthesis without new empirical results or formal derivations, its primary contribution is definitional and directional rather than evidentiary.
major comments (2)
- Abstract: The claim that 'by developing agentic AI systems in grounded environments, one can also mitigate the hallucinations of large foundation models and their tendency to generate environmentally incorrect outputs' is presented as a forward-looking benefit but rests on illustrative reasoning rather than a mechanistic account or citations to empirical demonstrations of reduced hallucination rates in embodied versus non-embodied settings. This load-bearing argument for the value of Agent AI would be strengthened by explicit discussion of feedback loops or prediction mechanisms that enforce environmental consistency.
- Definition and exploration sections: The distinction between Agent AI and prior embodied AI or multimodal agent work is not sharply delineated; the definition incorporates 'next-embodied action prediction' with external knowledge and feedback, yet it remains unclear whether this introduces novel technical requirements or largely overlaps with existing reinforcement-learning or vision-language-action models reviewed later in the manuscript.
minor comments (2)
- The manuscript would benefit from a summary table or taxonomy classifying the reviewed multimodal agent systems by sensory modalities, grounding mechanisms, and hallucination-mitigation strategies to improve readability and synthesis value.
- Some citations to foundational work on embodied AI (e.g., specific vision-language-action models or simulation platforms) appear illustrative; ensuring comprehensive coverage of recent benchmarks on hallucination in grounded settings would strengthen the literature review.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our survey manuscript. The comments help clarify the presentation of our central arguments and the positioning of Agent AI relative to existing literature. We address each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: Abstract: The claim that 'by developing agentic AI systems in grounded environments, one can also mitigate the hallucinations of large foundation models and their tendency to generate environmentally incorrect outputs' is presented as a forward-looking benefit but rests on illustrative reasoning rather than a mechanistic account or citations to empirical demonstrations of reduced hallucination rates in embodied versus non-embodied settings. This load-bearing argument for the value of Agent AI would be strengthened by explicit discussion of feedback loops or prediction mechanisms that enforce environmental consistency.
Authors: We agree that the abstract claim would benefit from a more explicit mechanistic discussion. In the revised manuscript we will expand the abstract slightly and add a new paragraph in the introduction that outlines concrete mechanisms: closed-loop environmental feedback (where predicted next actions are validated against observed state changes), multi-sensory consistency checks, and human-in-the-loop corrections. We will also cite relevant empirical studies from the vision-language-action and embodied RL literature that report measurable reductions in environmentally inconsistent outputs when grounding is applied. These additions will be supported by references already present in the survey plus two or three additional citations. revision: yes
-
Referee: Definition and exploration sections: The distinction between Agent AI and prior embodied AI or multimodal agent work is not sharply delineated; the definition incorporates 'next-embodied action prediction' with external knowledge and feedback, yet it remains unclear whether this introduces novel technical requirements or largely overlaps with existing reinforcement-learning or vision-language-action models reviewed later in the manuscript.
Authors: We acknowledge that the current definition section could draw sharper boundaries. In the revision we will insert an explicit comparison subsection (or table) that contrasts Agent AI with (i) classical embodied AI focused on physical control without foundation-model backbones, (ii) multimodal agents that operate primarily in digital interfaces without embodiment, and (iii) standard RL/VLA pipelines. We will clarify that while technical components such as policy learning overlap, the Agent AI framing uniquely emphasizes the integration of next-embodied-action prediction with external knowledge retrieval, multi-sensory fusion, and human feedback as a single coherent research program. This does not claim entirely new primitives but rather a unifying lens for future work; we will state this limitation explicitly. revision: yes
Circularity Check
No significant circularity; survey paper with no derivations or fitted predictions
full rationale
The manuscript is a high-level survey defining 'Agent AI' as interactive systems that perceive multimodal inputs and produce embodied actions. It reviews existing work and argues that grounding foundation models in physical/virtual environments can mitigate hallucinations via next-embodied-action prediction and feedback. No equations, parameters, predictions, or self-citation chains appear. Central claims are forward-looking arguments, not reductions of outputs to inputs by construction. The paper is self-contained as a review and does not rely on internal derivations that collapse to fitted quantities or prior self-work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing foundation models can serve as effective building blocks for embodied agents when placed in grounded environments.
invented entities (1)
-
Agent AI
no independent evidence
Forward citations
Cited by 18 Pith papers
-
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
-
A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators
ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.
-
Towards Considerate Human-Robot Coexistence: A Dual-Space Framework of Robot Design and Human Perception in Healthcare
A dual-space framework models healthcare human-robot coexistence as a co-evolving loop between robot design and four human perception dimensions, positioning humans as active interpreters and mediators.
-
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
-
SANet: A Semantic-aware Agentic AI Networking Framework for Cross-layer Optimization in 6G
SANet uses semantic-aware AI agents for cross-layer 6G optimization, achieving up to 14.61% performance gains with 44.37% of the FLOPs of prior methods via model partitioning and decentralized multi-objective algorithms.
-
CHAL: Council of Hierarchical Agentic Language
CHAL is a multi-agent dialectic system that performs structured belief optimization over defeasible domains using Bayesian-inspired graph representations and configurable meta-cognitive value system hyperparameters.
-
Context-Agent: Dynamic Discourse Trees for Non-Linear Dialogue
Context-Agent represents dialogue history as a dynamic tree to handle non-linear topic shifts and introduces the NTM benchmark for evaluating long-horizon non-linear dialogues.
-
MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings
MM-tau-p² is a new benchmark with 12 metrics that measures how well multi-modal agents adapt to user personas and maintain robustness in dual-control interactions.
-
Formalizing the Safety, Security, and Functional Properties of Agentic AI Systems
Introduces host agent and task lifecycle models plus 30 temporal logic properties to enable formal verification of liveness, safety, completeness, and fairness in agentic AI systems.
-
When Should Users Check? Modeling Confirmation Frequency inMulti-Step Agentic AI Tasks
A decision-theoretic model based on the observed Confirmation-Diagnosis-Correction-Redo user pattern places intermediate confirmations in AI agent tasks, yielding 81% user preference and 13.54% faster completion versu...
-
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
-
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
-
Is a team only as strong as its weakest link? Quantifying the short-board effect with AI Agents
LLM multi-agent simulations reveal a cumulative product effect from multiple weak links on team performance and identify distinct capability regimes including a Sisyphus predicament.
-
LanG -- A Governance-Aware Agentic AI Platform for Unified Security Operations
LanG presents a governance-aware agentic AI platform for unified security operations that reports strong performance on incident correlation, rule generation, attack reconstruction, and AI safety guardrails in an open...
-
Semantic-Aware Logical Reasoning via a Semiotic Framework
LogicAgent uses a semiotic-square-guided approach to enhance logical reasoning in LLMs on the new RepublicQA benchmark and others, reporting average gains of 6.25% and 7.05% respectively.
-
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
-
A Survey on the Memory Mechanism of Large Language Model based Agents
A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
Reference graph
Works this paper leans on
-
[1]
2021 IEEE/SICE International Symposium on System Integration (SII) , year=
A Learning-from-Observation Framework: One-Shot Robot Teaching for Grasp-Manipulation-Release Household Operations , author=. 2021 IEEE/SICE International Symposium on System Integration (SII) , year=
work page 2021
-
[2]
The International Journal of Robotics Research , volume =
Katsushi Ikeuchi and Naoki Wake and Kazuhiro Sasabuchi and Jun Takamatsu , title =. The International Journal of Robotics Research , volume =. 0 , doi =
-
[3]
arXiv preprint arXiv:2304.09966 , year=
Applying Learning-from-observation to household service robots: three common-sense formulation , author=. arXiv preprint arXiv:2304.09966 , year=
-
[4]
arXiv preprint arXiv:2310.15319 , year=
Hallucination Detection for Grounded Instruction Generation , author=. arXiv preprint arXiv:2310.15319 , year=
-
[5]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Hierarchical object-to-zone graph for object navigation , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[6]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Poni: Potential functions for objectgoal navigation with interaction-free learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[7]
Navigating to objects in the real world , author=. Science Robotics , volume=. 2023 , publisher=
work page 2023
-
[8]
arXiv preprint arXiv:2006.13171 , year=
Objectnav revisited: On evaluation of embodied agents navigating to objects , author=. arXiv preprint arXiv:2006.13171 , year=
-
[9]
Advances in Neural Information Processing Systems , volume=
Object goal navigation using goal-oriented semantic exploration , author=. Advances in Neural Information Processing Systems , volume=
-
[10]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[11]
2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=
Visual language maps for robot navigation , author=. 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2023 , organization=
work page 2023
-
[12]
Can an embodied agent find your" cat-shaped mug"? llm-based zero-shot object navigation , author=. arXiv preprint arXiv:2303.03480 , year=
-
[13]
arXiv preprint arXiv:2309.10309 , year=
Bridging Zero-shot Object Navigation and Foundation Models through Pixel-Guided Navigation Skill , author=. arXiv preprint arXiv:2309.10309 , year=
-
[14]
arXiv preprint arXiv:2306.10322 , year=
MO-VLN: A Multi-Task Benchmark for Open-set Zero-Shot Vision-and-Language Navigation , author=. arXiv preprint arXiv:2306.10322 , year=
-
[15]
arXiv preprint arXiv:2211.16649 , year=
Clip-nav: Using clip for zero-shot vision-and-language navigation , author=. arXiv preprint arXiv:2211.16649 , year=
-
[16]
Robot Operating System (ROS) The Complete Reference (Volume 1) , pages=
ROS navigation: Concepts and tutorial , author=. Robot Operating System (ROS) The Complete Reference (Volume 1) , pages=. 2016 , publisher=
work page 2016
-
[17]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[18]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Manipulathor: A framework for visual object manipulation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[19]
Visual Instruction Tuning , author=
-
[20]
Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[21]
LLaMA: Open and Efficient Foundation Language Models
Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Advances in neural information processing systems , volume=
Deep fragment embeddings for bidirectional image sentence mapping , author=. Advances in neural information processing systems , volume=
-
[23]
Explicit Knowledge-based Reasoning for Visual Question Answering
Explicit knowledge-based reasoning for visual question answering , author=. arXiv preprint arXiv:1511.02570 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
AgentTuning: Enabling Generalized Agent Abilities for LLMs , author=. 2023 , eprint=
work page 2023
-
[25]
Advances in Neural Information Processing Systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[26]
Modeling context in referring expressions , author=. Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14 , pages=. 2016 , organization=
work page 2016
-
[27]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Towards vqa models that can read , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[28]
Advances in neural information processing systems , volume=
Exploring models and data for image question answering , author=. Advances in neural information processing systems , volume=
-
[29]
Proceedings of the IEEE international conference on computer vision , pages=
Vqa: Visual question answering , author=. Proceedings of the IEEE international conference on computer vision , pages=
-
[30]
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=
work page 2023
-
[31]
arXiv preprint arXiv:2306.09442 , year=
Explore, Establish, Exploit: Red Teaming Language Models from Scratch , author=. arXiv preprint arXiv:2306.09442 , year=
-
[32]
arXiv preprint arXiv:2306.11706 , year=
RoboCat: A Self-Improving Foundation Agent for Robotic Manipulation , author=. arXiv preprint arXiv:2306.11706 , year=
-
[33]
AI2-THOR: An Interactive 3D Environment for Visual AI
Ai2-thor: An interactive 3d environment for visual ai , author=. arXiv preprint arXiv:1712.05474 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
arXiv preprint arXiv:2310.17555 , year=
Interactive Robot Learning from Verbal Correction , author=. arXiv preprint arXiv:2310.17555 , year=
-
[35]
arXiv preprint arXiv:2311.10678 , year=
Distilling and Retrieving Generalizable Knowledge for Robot Manipulation via Language Corrections , author=. arXiv preprint arXiv:2311.10678 , year=
-
[36]
Fast Model Identification via Physics Engines for Data-Efficient Policy Search
Fast model identification via physics engines for data-efficient policy search , author=. arXiv preprint arXiv:1710.08893 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Conference on Robot Learning , pages=
Tunenet: One-shot residual tuning for system identification and sim-to-real robot task transfer , author=. Conference on Robot Learning , pages=. 2020 , organization=
work page 2020
- [38]
-
[39]
2021 IEEE International Conference on Robotics and Automation (ICRA) , pages=
Retinagan: An object-aware approach to sim-to-real transfer , author=. 2021 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2021 , organization=
work page 2021
-
[40]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Rl-cyclegan: Reinforcement learning aware simulation-to-real , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[41]
Proceedings of the IEEE international conference on computer vision , pages=
Unpaired image-to-image translation using cycle-consistent adversarial networks , author=. Proceedings of the IEEE international conference on computer vision , pages=
-
[42]
2017 IEEE/RSJ international conference on intelligent robots and systems (IROS) , pages=
Domain randomization for transferring deep neural networks from simulation to the real world , author=. 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS) , pages=. 2017 , organization=
work page 2017
-
[43]
Task-grasping from a demonstrated human strategy , year=
Saito, Daichi and Sasabuchi, Kazuhiro and Wake, Naoki and Takamatsu, Jun and Koike, Hideki and Ikeuchi, Katsushi , booktitle=. Task-grasping from a demonstrated human strategy , year=
-
[44]
arXiv preprint arXiv:2301.01382 , year=
Task-sequencing simulator: Integrated machine learning to execution simulation for robot manipulation , author=. arXiv preprint arXiv:2301.01382 , year=
-
[45]
Field and Service Robotics: Results of the 11th International Conference , pages=
Airsim: High-fidelity visual and physical simulation for autonomous vehicles , author=. Field and Service Robotics: Results of the 11th International Conference , pages=. 2018 , organization=
work page 2018
-
[46]
International Journal of Computer Vision , volume=
Sim4cv: A photo-realistic simulator for computer vision applications , author=. International Journal of Computer Vision , volume=. 2018 , publisher=
work page 2018
-
[47]
Unrealrox: an extremely photorealistic virtual reality environment for robotics simulations and synthetic data generation , author=. Virtual Reality , volume=. 2020 , publisher=
work page 2020
-
[48]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
UniSim: A Neural Closed-Loop Sensor Simulator , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[49]
arXiv preprint arXiv:2311.06211 , year=
ASSIST: Interactive Scene Nodes for Scalable and Realistic Indoor Simulation , author=. arXiv preprint arXiv:2311.06211 , year=
-
[50]
IEEE Robotics and Automation Letters , year=
Orbit: A unified simulation framework for interactive robot learning environments , author=. IEEE Robotics and Automation Letters , year=
-
[51]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Habitat 2.0: Training Home Assistants to Rearrange their Habitat , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[52]
A Survey on Large Language Model based Autonomous Agents
A survey on large language model based autonomous agents , author=. arXiv preprint arXiv:2308.11432 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[53]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Habitat: A platform for embodied ai research , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[54]
Conference on Robot Learning , pages=
Scalable deep reinforcement learning for vision-based robotic manipulation , author=. Conference on Robot Learning , pages=. 2018 , organization=
work page 2018
-
[55]
Annual Review of Control, Robotics, and Autonomous Systems , volume=
The role of physics-based simulators in robotics , author=. Annual Review of Control, Robotics, and Autonomous Systems , volume=. 2021 , publisher=
work page 2021
-
[56]
IEEE Robotics and Automation Letters , volume=
Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks , author=. IEEE Robotics and Automation Letters , volume=. 2022 , publisher=
work page 2022
-
[57]
Conference on Robot Learning , pages=
Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments , author=. Conference on Robot Learning , pages=. 2022 , organization=
work page 2022
-
[58]
arXiv preprint arXiv:2108.03272 , year=
igibson 2.0: Object-centric simulation for robot learning of everyday household tasks , author=. arXiv preprint arXiv:2108.03272 , year=
-
[59]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Robothor: An open simulation-to-real embodied ai platform , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[60]
IEEE Robotics and Automation Letters , volume=
Sean 2.0: Formalizing and generating social situations for robot navigation , author=. IEEE Robotics and Automation Letters , volume=. 2022 , publisher=
work page 2022
-
[61]
Habitat 3.0: A co-habitat for humans, avatars and robots,
Habitat 3.0: A co-habitat for humans, avatars and robots , author=. arXiv preprint arXiv:2310.13724 , year=
-
[62]
arXiv preprint arXiv:2311.11007 , year=
Constraint-aware Policy for Compliant Manipulation , author=. arXiv preprint arXiv:2311.11007 , year=
-
[63]
arXiv preprint arXiv:2311.12015 , year=
GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration , author=. arXiv preprint arXiv:2311.12015 , year=
-
[64]
Detecting Twenty-thousand Classes using Image-level Supervision , author=. ECCV , year=
-
[65]
arXiv preprint arXiv:2201.05176 , year=
Neural Approaches to Conversational Information Retrieval , author=. arXiv preprint arXiv:2201.05176 , year=
-
[66]
arXiv preprint arXiv:2002.06177 , year=
The next decade in ai: four steps towards robust artificial intelligence , author=. arXiv preprint arXiv:2002.06177 , year=
-
[67]
Rebooting AI: Building artificial intelligence we can trust , author=. 2019 , publisher=
work page 2019
-
[68]
arXiv preprint arXiv:2009.03457 , year=
Robust conversational AI with grounded text generation , author=. arXiv preprint arXiv:2009.03457 , year=
-
[69]
arXiv preprint arXiv:2305.04835 , year=
How Do In-Context Examples Affect Compositional Generalization? , author=. arXiv preprint arXiv:2305.04835 , year=
-
[70]
A Survey on In-context Learning
A survey for in-context learning , author=. arXiv preprint arXiv:2301.00234 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[71]
Recent advances in deep learning for object detection , volume=
Wu, Xiongwei and Sahoo, Doyen and Hoi, Steven CH , year=. Recent advances in deep learning for object detection , volume=. Neurocomputing , publisher=
-
[72]
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models , author=. 2023 , eprint=
work page 2023
-
[73]
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action , author=. 2023 , eprint=
work page 2023
-
[74]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , author=. 2023 , eprint=
work page 2023
-
[75]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality , author=. 2023 , eprint=
work page 2023
-
[76]
Thirty-seventh Conference on Neural Information Processing Systems , year=
Localized Symbolic Knowledge Distillation for Visual Commonsense Models , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[77]
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training , author=. 2020 , eprint=
work page 2020
-
[78]
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation , author=. 2021 , eprint=
work page 2021
-
[79]
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling , author=. 2022 , eprint=
work page 2022
-
[80]
MERLOT: Multimodal Neural Script Knowledge Models , author=. 2021 , eprint=
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.