Omni-DuplexEval creates a new benchmark and LLM-as-a-Judge framework for real-time duplex omni-modal interaction, revealing that current models score below 40% overall and struggle especially with proactive responses.
Gemini 3.1 pro model card
2 Pith papers cite this work. Polarity classification is still indexing.
years
2026 2representative citing papers
OpenSkillEval automatically builds realistic tasks from evolving artifacts to audit skill effectiveness in LLM agents, finding that skill use depends on model and framework and that many popular skills do not outperform base agents.
citing papers explorer
-
Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction
Omni-DuplexEval creates a new benchmark and LLM-as-a-Judge framework for real-time duplex omni-modal interaction, revealing that current models score below 40% overall and struggle especially with proactive responses.
-
OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents
OpenSkillEval automatically builds realistic tasks from evolving artifacts to audit skill effectiveness in LLM agents, finding that skill use depends on model and framework and that many popular skills do not outperform base agents.