pith. sign in
Pith Number

pith:GXZWOMWX

pith:2026:GXZWOMWXDECNUCQRZHDBCWFDH3
not attested not anchored not stored refs resolved

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Anil Madamala, Fanny Riols, Gabrielle Gauthier Melan\c{c}on, Hari Subramani, Hoang H. Nguyen, Joseph Marinier, Katrina Stankiewicz, Lindsay Devon Brin, Oluwanifemi Bamgbose, Raghav Mehndiratta, Sridhar Krishna Nemala, Srinivas Sunkara, Tara Bogavelli

No voice agent exceeds 0.5 on both accuracy and experience metrics simultaneously.

arxiv:2605.13841 v1 · 2026-05-13 · cs.SD · cs.AI · cs.CL · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{GXZWOMWXDECNUCQRZHDBCWFDH3}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; peak and reliable performance diverge substantially (median pass@k - pass^k gap of 0.44 on EVA-A); accent and noise perturbations expose substantial robustness gaps

C2weakest assumption

That bot-to-bot simulated conversations with automatic validation sufficiently capture the distribution of real human voice interactions and that the composite metrics EVA-A and EVA-X align with downstream user satisfaction or task success in production.

C3one line summary

EVA-Bench introduces a simulation-plus-scoring framework for voice agents that reveals no tested system exceeds 0.5 on both accuracy and experience metrics at pass@1.

References

134 extracted · 134 resolved · 5 Pith anchors

[1] Andres, Vadim Fedorov, Rida Sadek, Enric Spagnolo-Arrizabalaga, and Nadescha Trudel 2025
[2] Sd-eval: A benchmark dataset for spoken dialogue understanding beyond words.Advances in Neural Information Processing Systems, 37:56898–56918, 2024 2024
[3] Talking turns: Bench- marking audio foundation models on turn-taking dynamics 2025
[4] Beyond task completion: Revealing corrupt success in LLM agents through procedure-aware evaluation.arXiv preprint arXiv:2603.03116, 2026 2026
[5] VoiceBench: Benchmarking LLM-Based Voice Assistants 2024 · arXiv:2410.17196

Formal links

2 machine-checked theorem links

Receipt and verification
First computed 2026-05-18T02:44:09.529740Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

35f36732d71904da0a11c9c61158a33ec49e949722e8faaddfe924035a59771b

Aliases

arxiv: 2605.13841 · arxiv_version: 2605.13841v1 · doi: 10.48550/arxiv.2605.13841 · pith_short_12: GXZWOMWXDECN · pith_short_16: GXZWOMWXDECNUCQR · pith_short_8: GXZWOMWX
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/GXZWOMWXDECNUCQRZHDBCWFDH3 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 35f36732d71904da0a11c9c61158a33ec49e949722e8faaddfe924035a59771b
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "6cbba20173a88250d99dfd19ea8a1161529a5098f4f2da605e38a56fecf7ceac",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CL",
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.SD",
    "submitted_at": "2026-05-13T17:58:52Z",
    "title_canon_sha256": "21a1fdb87bd4786de47265dd4dc46819ebcd0c4bfa52c5571ff63f262e2ca61c"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.13841",
    "kind": "arxiv",
    "version": 1
  }
}