pith. sign in

arxiv: 2602.22897 · v3 · pith:5XI654B2new · submitted 2026-02-26 · 💻 cs.AI · cs.CL· cs.CV· cs.LG· cs.MM

OmniGAIA: Towards Native Omni-Modal AI Agents

classification 💻 cs.AI cs.CLcs.CVcs.LGcs.MM
keywords omni-modalreasoningnativeomnigaiatoolagentsassistantsaudio
0
0 comments X
read the original abstract

Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world. However, current multi-modal LLMs are primarily confined to bi-modal interactions (e.g., vision-language), lacking the unified cognitive capabilities required for general AI assistants. To bridge this gap, we introduce OmniGAIA, a comprehensive benchmark designed to evaluate omni-modal agents on tasks necessitating deep reasoning and multi-turn tool execution across video, audio, and image modalities. Constructed via a novel omni-modal event graph approach, OmniGAIA synthesizes complex, multi-hop queries derived from real-world data that require cross-modal reasoning and external tool integration. Furthermore, we propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception. Trained on trajectories synthesized via a hindsight-guided tree exploration strategy and OmniDPO for fine-grained error correction, OmniAtlas effectively enhances the tool-use capabilities of existing open-source models. This work marks a step towards next-generation native omni-modal AI assistants for real-world scenarios.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

    cs.SD 2026-05 unverdicted novelty 8.0

    Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.

  2. TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

    cs.CV 2026-05 unverdicted novelty 8.0

    TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

  3. Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs

    cs.CV 2026-04 unverdicted novelty 7.0

    Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.

  4. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

  5. Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

    cs.CL 2026-06 unverdicted novelty 5.0

    This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environm...

  6. MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

    cs.AI 2026-05 unverdicted novelty 5.0

    MUSE-Autoskill introduces a skill-centric framework for self-evolving LLM agents through a unified lifecycle of skill creation, memory, management, evaluation, and refinement.

  7. TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

    cs.AI 2026-05 unverdicted novelty 5.0

    MM-ToolBench introduces 100 closed-loop multimodal tasks across two domains with 27 MCP servers and 324 tools, where agents must execute, inspect artifacts, and revise before final output.

  8. Qwen3.5-Omni Technical Report

    cs.CL 2026-04 unverdicted novelty 5.0

    Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding mul...