Recognition: unknown
Failure-Centered Runtime Evaluation for Deployed Trilingual Public-Space Agents
Pith reviewed 2026-05-08 03:36 UTC · model grok-4.3
The pith
Failure-centered evaluation exposes cross-language drifts in deployed trilingual agents that aggregate scores hide.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When evaluation targets a runtime system instead of a static input-output mapping, the basic unit must shift from score to failure. PSA-Eval implements the shift by extending the chain to Question -> Batch -> Run -> Score -> Failure Case -> Repair -> Regression Batch, rendering failures traceable, reviewable, repairable, and regression-testable. Using trilingual equivalent inputs on a deployed single-model front-desk system, the pilot recorded an average score of 23.15/24 yet found non-zero cross-language drift in 14 of 27 groups, with a maximum of 9 points.
What carries the argument
The PSA-Eval evaluation chain that inserts Failure Case -> Repair -> Regression Batch after scoring, paired with trilingual equivalent inputs as probes for observing cross-language policy drift.
If this is right
- Aggregate scores can conceal structured inconsistencies in multilingual runtime behavior.
- Specific failures become directly linked to repair actions and subsequent regression batches.
- Group-level drift measurements provide deployment signals usable for targeted maintenance.
- The method applies to live public-space systems without requiring separate A/B model comparisons.
Where Pith is reading between the lines
- The same failure-tracing cycle could be adapted to detect other runtime inconsistencies, such as those arising from context length or user demographics.
- Prioritizing repair of high-drift failure groups may improve consistency more efficiently than optimizing average scores alone.
- Public deployments of multilingual agents could adopt periodic failure audits as a standard monitoring practice.
Load-bearing premise
Trilingual equivalent inputs serve as valid controlled probes that reveal genuine cross-language policy drift rather than artifacts of phrasing or scoring rules.
What would settle it
A controlled test in which repeated identical questions in one language produce score variations comparable to the observed cross-language differences would falsify the claim that the drifts represent language-specific policy inconsistencies.
Figures
read the original abstract
This paper presents PSA-Eval, a failure-centered runtime evaluation framework for deployed trilingual public-space agents. The central claim is that, when the evaluation object shifts from a static input-output mapping to a runtime system, the basic unit of analysis should shift from score to failure. PSA-Eval extends the conventional chain Question -> Answer -> Score -> End into Question -> Batch -> Run -> Score -> Failure Case -> Repair -> Regression Batch, making failures traceable, reviewable, repairable, and regression-testable. The framework uses trilingual equivalent inputs as controlled probes for observing group-level cross-language policy drift. We conduct a pilot study on a real trilingual digital front-desk system deployed in the lobby of an international financial institution. The pilot uses a simplified single-foundation-model setting (MA = MB), so the observed drift should not be interpreted as an A/B foundation-model difference. The study contains 81 samples organized into 27 trilingual equivalent question groups. Although the system achieves an average score of 23.15/24, 14 groups show non-zero cross-language score drift, 5 groups show drift of at least 3 points, and the maximum drift reaches 9 points. These results provide initial evidence that failure-centered runtime evaluation can expose structured deployment signals hidden by aggregate scoring.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PSA-Eval, a failure-centered runtime evaluation framework for deployed trilingual public-space agents. It argues that evaluation of runtime systems should treat failure (rather than aggregate score) as the basic unit of analysis, extending the conventional Question-Answer-Score chain into a traceable loop of Batch-Run-Score-Failure Case-Repair-Regression Batch. Using trilingual equivalent inputs as controlled probes, the authors report a pilot on a real deployed digital front-desk system comprising 81 samples in 27 groups; despite an aggregate score of 23.15/24, 14 groups exhibit non-zero cross-language drift, 5 groups show drift of at least 3 points, and the maximum drift is 9 points. The study is conducted under a single-foundation-model simplification (MA = MB).
Significance. If the pilot measurements are robust, the work supplies concrete evidence that failure-centered evaluation can surface structured cross-language policy signals that aggregate scoring conceals. The use of a live deployed system and the reporting of specific drift counts (14 non-zero, 5 >=3 points) constitute a practical strength; the framework's emphasis on traceable, repairable failures also offers a clear operational path for maintainers of multilingual agents.
major comments (2)
- [pilot study] Pilot study description (81 samples, 27 trilingual groups): the manuscript provides no explicit validation that the trilingual equivalent inputs are semantically equivalent (e.g., back-translation checks, independent linguist review, or inter-rater agreement on equivalence). Without this, the reported drifts cannot be confidently attributed to runtime policy differences rather than phrasing or rubric artifacts, which directly undermines the central claim that the framework exposes 'structured deployment signals hidden by aggregate scoring.'
- [abstract / pilot results] Abstract and pilot results: scoring rules and the procedure for assigning the 0-24 scores are not described. It is therefore impossible to assess whether the observed cross-language differences (max drift 9) reflect genuine policy drift or inconsistent application of the rubric across languages, a load-bearing issue for interpreting the 14 non-zero drift groups.
minor comments (2)
- [abstract] The acronym PSA-Eval is introduced in the title and abstract without an immediate expansion; a parenthetical definition on first use would improve readability.
- [pilot study] The single-foundation-model simplification (MA = MB) is noted but its implications for generalizing the drift findings to multi-model deployments could be stated more explicitly in the discussion.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our pilot study. These points identify areas where additional detail will improve the manuscript's transparency. We address each major comment below and will revise the paper accordingly.
read point-by-point responses
-
Referee: [pilot study] Pilot study description (81 samples, 27 trilingual groups): the manuscript provides no explicit validation that the trilingual equivalent inputs are semantically equivalent (e.g., back-translation checks, independent linguist review, or inter-rater agreement on equivalence). Without this, the reported drifts cannot be confidently attributed to runtime policy differences rather than phrasing or rubric artifacts, which directly undermines the central claim that the framework exposes 'structured deployment signals hidden by aggregate scoring.'
Authors: We agree that the absence of explicit semantic-equivalence validation is a limitation in the current manuscript and weakens confidence in attributing drifts solely to policy differences. The trilingual inputs were prepared to be equivalent, but the manuscript does not document validation steps such as back-translation or inter-rater review. In the revision we will add a subsection describing the input-construction process and any equivalence checks that were performed, along with a discussion of remaining limitations. This will allow readers to assess the robustness of the reported cross-language drifts. revision: yes
-
Referee: [abstract / pilot results] Abstract and pilot results: scoring rules and the procedure for assigning the 0-24 scores are not described. It is therefore impossible to assess whether the observed cross-language differences (max drift 9) reflect genuine policy drift or inconsistent application of the rubric across languages, a load-bearing issue for interpreting the 14 non-zero drift groups.
Authors: We concur that the scoring rules and assignment procedure must be described for the drift results to be interpretable. The manuscript currently omits these details. In the revised version we will expand the pilot-study section to present the full scoring rubric (including the criteria that sum to the 0-24 range), the exact procedure used to apply the rubric, and how consistency was maintained across languages. This addition will directly address concerns about whether the observed drifts (including the maximum of 9 points) arise from policy differences or rubric application. revision: yes
Circularity Check
No significant circularity; framework definition and pilot measurements are independent.
full rationale
The paper introduces PSA-Eval as an independent framework that extends the evaluation chain to include failure cases and regression testing. The pilot results (27 trilingual groups, 81 samples, reported drift statistics) are direct empirical measurements collected from an external deployed system rather than quantities fitted, predicted, or derived from the framework's own parameters or definitions. No self-citations, ansatzes, uniqueness theorems, or renamings of known results appear as load-bearing steps in the derivation. The trilingual probes are presented as an assumption for observing drift, but the observed numbers do not reduce to the framework inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Trilingual equivalent inputs control for semantic content while varying only language, allowing observation of policy drift.
- domain assumption Failures identified in runtime can be repaired and re-tested in regression batches to confirm consistency.
invented entities (1)
-
PSA-Eval framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Jiawei Gu, Xuhui Jiang, Zhichao Shi, et al. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594, 2024
work page internal anchor Pith review arXiv 2024
-
[2]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021
2021
-
[3]
BenchMAX: A comprehensive multilingual evaluation suite for large language models
Xu Huang, Wenhao Zhu, Hanxu Hu, Conghui He, Lei Li, Shujian Huang, and Fei Yuan. BenchMAX: A comprehensive multilingual evaluation suite for large language models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 16751–16774. Association for Computational Linguistics, 2025
2025
-
[4]
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
Haitao Li, Qian Dong, Junjie Chen, et al. Llms-as-judges: A comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579, 2024. 21
work page internal anchor Pith review arXiv 2024
-
[5]
Holistic evaluation of language models.Transactions on Machine Learning Research, 2023
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models.Transactions on Machine Learning Research, 2023
2023
-
[6]
TruthfulQA: Measuring how models mimic human falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022
2022
-
[7]
AgentBench: Evaluating LLMs as agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. AgentBench: Evaluating LLMs as agents. InThe Twelfth International Conference on Learning Representations, 2024
2024
-
[8]
AgentBoard: An analytical evaluation board of multi-turn LLM agents
Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. AgentBoard: An analytical evaluation board of multi-turn LLM agents. InAdvances in Neural Information Processing Systems 37 (Datasets and Benchmarks Track), 2024
2024
-
[9]
Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R. Bowman. BBQ: A hand-built bias benchmark for question answering. InFindings of the Association for Computational Linguistics: ACL 2022, 2022
2022
-
[10]
Red teaming language models with language models
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022
2022
-
[11]
Brown, Adam Santoro, Aditya Gupta, Adri` a Garriga-Alonso, et al
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adri` a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on Machine Learning Research, 2023
2023
-
[12]
LiveBench: A challenging, contamination-limited LLM benchmark
Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, et al. LiveBench: A challenging, contamination-limited LLM benchmark. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[13]
Xing, et al
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, et al. Judging LLM-as-a-judge with MT-bench and chatbot arena. InAdvances in Neural Information Processing Systems 36 (Datasets and Benchmarks Track), 2023. 22 A Terminology English term Meaning in this paper Short explanati...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.