MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems

Aaron Guoxiang Guo; Aldeida Aleti; Chakkrit Tantithamthavorn; Neelofar Neelofar; Tsong Yueh Chen; Yuanyuan Qi

arxiv: 2412.15557 · v4 · pith:AQFIRS6Xnew · submitted 2024-12-20 · 💻 cs.SE · cs.CL

MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems

Aaron Guoxiang Guo , Aldeida Aleti , Neelofar Neelofar , Chakkrit Tantithamthavorn , Yuanyuan Qi , Tsong Yueh Chen This is my paper

classification 💻 cs.SE cs.CL

keywords testingdialoguemortarmulti-turnsystemsmetamorphicllm-basedtest

0 comments

read the original abstract

With the widespread application of LLM-based dialogue systems in daily life, quality assurance has become more important than ever. Recent research has successfully introduced methods to identify unexpected behaviour in single-turn testing scenarios. However, multi-turn interaction is the common real-world usage of dialogue systems, yet testing methods for such interactions remain underexplored. This is largely due to the oracle problem in multi-turn testing, which continues to pose a significant challenge for dialogue system developers and researchers. In this paper, we propose MORTAR, a metamorphic multi-turn dialogue testing approach, which mitigates the test oracle problem in testing LLM-based dialogue systems. MORTAR formalises the multi-turn testing for dialogue systems, and automates the generation of question-answer dialogue test cases with multiple dialogue-level perturbations and metamorphic relations (MRs). The automated MR matching mechanism allows MORTAR more flexibility and efficiency in metamorphic testing. The proposed approach is fully automated without reliance on LLM judges. In testing six popular LLM-based dialogue systems, MORTAR reaches significantly better effectiveness with over 150\% more bugs revealed per test case when compared to the single-turn metamorphic testing baseline. Regarding the quality of bugs, MORTAR reveals higher-quality bugs in terms of diversity, precision and uniqueness. MORTAR is expected to inspire more multi-turn testing approaches, and assist developers in evaluating the dialogue system performance more comprehensively with constrained test resources and budget.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Bidirectional Empowerment of Metamorphic Testing and Large Language Models: A Systematic Survey
cs.SE 2026-05 accept novelty 4.0

A systematic survey of 93 studies that maps the bidirectional relationship between metamorphic testing and LLMs, proposing a taxonomy for MT applied to LLMs and LLMs applied to MT.