Intellagent: A multi-agent framework for evaluating conversational ai systems

Elad Levi, Ilan Kadar · 2025 · arXiv 2501.11067

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

MANTRA automatically synthesizes SMT-validated compliance benchmarks for LLM agents from natural language manuals and tool schemas, producing 285 tasks across 6 domains with minimal human effort.

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

cs.AI · 2025-06-09 · unverdicted · novelty 7.0

τ²-bench provides a Dec-POMDP-based telecom domain with compositional task generation and a tool-constrained user simulator to measure agent performance drops in dual-control versus single-control settings.

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

cs.CL · 2025-11-25 · unverdicted · novelty 6.0

Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.

Interactive Evaluation Requires a Design Science

cs.AI · 2026-05-18 · unverdicted · novelty 5.0

Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.

AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments

cs.AI · 2026-04-07 · unverdicted · novelty 5.0

AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.

citing papers explorer

Showing 5 of 5 citing papers.

MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents cs.CL · 2026-05-07 · unverdicted · none · ref 19
MANTRA automatically synthesizes SMT-validated compliance benchmarks for LLM agents from natural language manuals and tool schemas, producing 285 tasks across 6 domains with minimal human effort.
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment cs.AI · 2025-06-09 · unverdicted · none · ref 12
τ²-bench provides a Dec-POMDP-based telecom domain with compositional task generation and a tool-constrained user simulator to measure agent performance drops in dual-control versus single-control settings.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory cs.CL · 2025-11-25 · unverdicted · none · ref 98
Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.
Interactive Evaluation Requires a Design Science cs.AI · 2026-05-18 · unverdicted · none · ref 32
Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.
AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments cs.AI · 2026-04-07 · unverdicted · none · ref 10
AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.

Intellagent: A multi-agent framework for evaluating conversational ai systems

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer