DeepTest Tool Competition 2026: Benchmarking an LLM-Based Automotive Assistant

Ivan Vasilev; Lev Sorokin; Samuele Pasini

arxiv: 2604.12615 · v1 · submitted 2026-04-14 · 💻 cs.AI

DeepTest Tool Competition 2026: Benchmarking an LLM-Based Automotive Assistant

Lev Sorokin , Ivan Vasilev , Samuele Pasini This is my paper

Pith reviewed 2026-05-10 15:08 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM testingautomotive assistanttool competitionsafety warningsfailure detectionbenchmarkingDeepTest workshopcar manual retrieval

0 comments

The pith

Four tools competed to find inputs that make an LLM car manual assistant omit safety warnings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper summarizes the first LLM Testing competition at the DeepTest workshop, where four tools were evaluated on their performance against an LLM-based application that answers questions from car manuals. The central task was to generate user inputs that expose cases in which the system fails to reference important warnings present in the manual. Competitors were scored on effectiveness at revealing such failures and on the diversity of the test inputs they produced. The report describes the methodology, the participating tools, and the outcomes of this evaluation exercise.

Core claim

In the 2026 DeepTest Tool Competition, four testing solutions were benchmarked for their capacity to identify user queries that cause an LLM automotive manual assistant to neglect mentioning contained safety warnings, with performance measured by the number and variety of failure-revealing tests each tool produced.

What carries the argument

The competition evaluation framework that scores LLM testing tools by effectiveness in exposing warning-omission failures plus diversity of the generated test inputs.

If this is right

Tools that score higher on both effectiveness and diversity can be selected for more thorough pre-release testing of similar LLM assistants.
The competition results identify which testing strategies are better at covering a wide range of potential user inputs that trigger warning omissions.
Future LLM applications in safety-critical domains can incorporate the winning testing approaches to reduce the chance of omitted warnings.
The methodology provides a repeatable template for organizing tool competitions that target specific failure modes in LLM systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying the same competition format to other failure modes, such as incorrect numerical advice or hallucinated procedures, would test whether the effectiveness-plus-diversity metric generalizes.
Comparing the discovered tests against logs of actual driver queries could show how well the proxy matches real usage patterns.
The results may encourage development of automated test generators that explicitly search for safety-related omissions in domain-specific manuals.

Load-bearing premise

That success at surfacing inputs where the assistant skips manual warnings, scored by effectiveness and diversity, serves as a reliable stand-in for actual safety problems without direct checks against real user behavior or incident data.

What would settle it

A controlled user study in which participants interact with the LLM assistant using the discovered test inputs and the rate of missed warnings is compared to the competition scores.

Figures

Figures reproduced from arXiv: 2604.12615 by Ivan Vasilev, Lev Sorokin, Samuele Pasini.

**Figure 2.** Figure 2: Metric results for SUT-I (left) and SUT-II (right) with manual Manual-2 and LLM GPT-4o-Mini over 6 runs. 4 Conclusions The DeepTest 2026 Testing Competition focused on benchmarking an LLM-based Assistant for retrieving information from an owner’s manual. In total, four tools and 10 participants were competing in testing one industrial and one open-source implementation an automotive assistant with two dif… view at source ↗

read the original abstract

This report summarizes the results of the first edition of the Large Language Model (LLM) Testing competition, held as part of the DeepTest workshop at ICSE 2026. Four tools competed in benchmarking an LLM-based car manual information retrieval application, with the objective of identifying user inputs for which the system fails to appropriately mention warnings contained in the manual. The testing solutions were evaluated based on their effectiveness in exposing failures and the diversity of the discovered failure-revealing tests. We report on the experimental methodology, the competitors, and the results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a plain workshop report on the first run of a tool competition for finding failure cases in an LLM car-manual assistant, with no new methods or verifiable outcomes shown.

read the letter

The main takeaway is that this paper simply documents the first edition of the DeepTest LLM testing competition. Four tools competed to generate user inputs that make an LLM-based car manual retriever omit safety warnings from the manual. The tools were scored on effectiveness at exposing those failures and on the diversity of the tests they produced. The report covers the setup, the competitors, and the evaluation rules without claiming broader breakthroughs. That is the extent of what is new here: an instance of an established competition format applied to this particular automotive LLM task. The description of the failure mode and the two evaluation axes is clear and practical for the purpose of running such an event. The authors stick to factual reporting of the competition rules and participants. The soft spots are straightforward. No quantitative results, failure examples, or statistical details appear in the text, so there is no way to check whether the tools actually performed differently or whether the diversity metric captured anything meaningful. The chosen failure mode—omitting warnings—is reasonable to test but receives no validation against real user behavior or accident data, which limits how far the results can be taken. Tool competitions and LLM benchmarking exercises already exist in the literature, so the format itself adds nothing novel. This kind of report may interest a narrow group of researchers who run or attend workshops on AI testing for safety-critical systems and want to see an early example in the automotive domain. Most readers will not get much from it. I would not bring it to a reading group or cite it. It does not contain claims or evidence that justify sending it out for peer review; it belongs as workshop notes rather than a full paper.

Referee Report

0 major / 2 minor

Summary. The paper summarizes the results of the first edition of the DeepTest LLM Testing competition at ICSE 2026. Four tools competed to benchmark an LLM-based car manual information retrieval application by identifying user inputs that cause the system to fail to appropriately mention warnings contained in the manual. The tools were evaluated on effectiveness in exposing failures and diversity of the discovered failure-revealing tests, with the report covering experimental methodology, competitors, and results.

Significance. This competition report documents tool performance on a concrete, safety-relevant task for LLM assistants in the automotive domain. It provides a useful community benchmark and comparative data that can inform future testing tool development, particularly for detecting omission failures. The descriptive focus on event documentation rather than broad claims is appropriate and avoids overgeneralization; the competition format itself is a strength for reproducibility and tool comparison.

minor comments (2)

Abstract and results section: the report should include concrete quantitative values (e.g., effectiveness scores, diversity metrics, or example failure-revealing inputs) for each of the four tools so readers can directly assess the outcomes rather than relying on the high-level summary alone.
Methodology section: clarify the exact definitions and measurement procedures for 'effectiveness' and 'diversity' (e.g., any formulas, thresholds, or statistical tests used) to support replication by other researchers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our competition report and the recommendation for minor revision. The referee's summary correctly reflects the paper's focus on documenting the first DeepTest LLM Testing competition results, including tool performance on effectiveness and test diversity for an automotive LLM assistant. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a factual competition report that documents the methodology, participating tools, and observed results of an external benchmarking event. It contains no derivations, equations, fitted parameters, or load-bearing claims that reduce any reported outcome to prior inputs by construction. The central content is limited to event documentation rather than a generalizable scientific derivation, so no self-citation or ansatz reduces the findings to tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, fitted parameters, axioms, or new postulated entities are present; the document is a descriptive report of competition outcomes.

pith-pipeline@v0.9.0 · 5384 in / 1156 out tokens · 77137 ms · 2026-05-10T15:08:15.609908+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

Ahmed, Ludwig Otto Baader, Firas Bayram, Siri Jagstedt, and Peter Magnusson

Bestoun S. Ahmed, Ludwig Otto Baader, Firas Bayram, Siri Jagstedt, and Peter Magnusson. 2025. Quality Assurance for LLM-RAG Systems: Empirical Insights from Tourism Application Testing. InICSTW

work page 2025
[2]

Antonio Pedro Santos Alves and Marcos Kalinowski. 2026. ATLAS: Adaptive Test Learning And Selection. InProceedings of the Seventh International Workshop on Deep Learning for Testing and Testing for Deep Learning (DeepTest 2026), co-located with the IEEE/ACM International Conference on Software Engineering

work page 2026
[3]

Selhan Berber. 2026. CRISP: Contextual Risk-Driven Input Structuring for Probing. InProceedings of the Seventh International Workshop on Deep Learning for Testing and Testing for Deep Learning (DeepTest 2026), co-located with the IEEE/ACM International Conference on Software Engineering (ICSE 2026)

work page 2026
[4]

Matteo Biagiola and Paolo Tonella. 2024. Testing of Deep Reinforcement Learning Agents with Surrogate Models.ACM Trans. Softw. Eng. Methodol.33, 3, Article 73 (2024), 33 pages

work page 2024
[5]

BMW Group Press Club. 2026. A Milestone for Human–Vehicle Interaction: BMW Intelligent Personal Assistant Expanded to Include Amazon Alexa Technology. Accessed: 2026-01-25

work page 2026
[6]

Friedl, Lev Sorokin, and Andrea Stocco

Rafael Giebisch, Ken E. Friedl, Lev Sorokin, and Andrea Stocco. 2025. Automated Factual Benchmarking for In-Car Conversational Systems using Large Language Models. InProceedings of the 36th IEEE Intelligent Vehicles Symposium (IV ’25)

work page 2025
[7]

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Yuanzhuo Wang, and Jian Guo

work page
[8]

A Survey on LLM-as-a-Judge.arXiv preprint arXiv: 2411.15594(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Friedl, and Andrea Stocco

Philipp Habicht, Lev Sorokin, Abdullah Saydemir, Ken E. Friedl, and Andrea Stocco. 2025. Benchmarking Contextual Understanding for In-Car Conversational Systems. arXiv:2512.12042 [cs.CL]

work page arXiv 2025
[10]

Kaan-Gueney Keklikci, Gaia Peressini, and Dimitrij Krepis. 2026. Multistage Prompt Decomposition for Failure-Inducing Tests. InProceedings of the Seventh International Workshop on Deep Learning for Testing and Testing for Deep Learning (DeepTest 2026), co-located with the IEEE/ACM International Conference on Software Engineering (ICSE 2026)

work page 2026
[11]

Song Qunying, Yuan Gao, Roberto Brusnicki, and Federica Sarro. 2026. Warnless at the DeepTest 2026 Tool Competition. InProceedings of the Seventh International Workshop on Deep Learning for Testing and Testing for Deep Learning (DeepTest 2026), co-located with the IEEE/ACM International Conference on Software Engi- neering (ICSE 2026)

work page 2026
[12]

Friedl, and Andrea Stocco

Lev Sorokin, Ivan Vasilev, Ken E. Friedl, and Andrea Stocco. 2026. STELLAR: A Search-Based Testing Framework for Large Language Model Applications. InProceedings of the 33rd IEEE International Conference on Software Analysis, Evolution and Reengineering. IEEE

work page 2026
[13]

Lev Sorokin, Ivan Vasilev, and Samuele Pasini. [n. d.]. Replication Package. https://github.com/deeptest-competition

work page
[14]

Reuben Thomas. 2026. Enchant: A Generic Spell Checking Library. https:// rrthomas.github.io/enchant/. Accessed: 2026-01-25

work page 2026

[1] [1]

Ahmed, Ludwig Otto Baader, Firas Bayram, Siri Jagstedt, and Peter Magnusson

Bestoun S. Ahmed, Ludwig Otto Baader, Firas Bayram, Siri Jagstedt, and Peter Magnusson. 2025. Quality Assurance for LLM-RAG Systems: Empirical Insights from Tourism Application Testing. InICSTW

work page 2025

[2] [2]

Antonio Pedro Santos Alves and Marcos Kalinowski. 2026. ATLAS: Adaptive Test Learning And Selection. InProceedings of the Seventh International Workshop on Deep Learning for Testing and Testing for Deep Learning (DeepTest 2026), co-located with the IEEE/ACM International Conference on Software Engineering

work page 2026

[3] [3]

Selhan Berber. 2026. CRISP: Contextual Risk-Driven Input Structuring for Probing. InProceedings of the Seventh International Workshop on Deep Learning for Testing and Testing for Deep Learning (DeepTest 2026), co-located with the IEEE/ACM International Conference on Software Engineering (ICSE 2026)

work page 2026

[4] [4]

Matteo Biagiola and Paolo Tonella. 2024. Testing of Deep Reinforcement Learning Agents with Surrogate Models.ACM Trans. Softw. Eng. Methodol.33, 3, Article 73 (2024), 33 pages

work page 2024

[5] [5]

BMW Group Press Club. 2026. A Milestone for Human–Vehicle Interaction: BMW Intelligent Personal Assistant Expanded to Include Amazon Alexa Technology. Accessed: 2026-01-25

work page 2026

[6] [6]

Friedl, Lev Sorokin, and Andrea Stocco

Rafael Giebisch, Ken E. Friedl, Lev Sorokin, and Andrea Stocco. 2025. Automated Factual Benchmarking for In-Car Conversational Systems using Large Language Models. InProceedings of the 36th IEEE Intelligent Vehicles Symposium (IV ’25)

work page 2025

[7] [7]

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Yuanzhuo Wang, and Jian Guo

work page

[8] [8]

A Survey on LLM-as-a-Judge.arXiv preprint arXiv: 2411.15594(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Friedl, and Andrea Stocco

Philipp Habicht, Lev Sorokin, Abdullah Saydemir, Ken E. Friedl, and Andrea Stocco. 2025. Benchmarking Contextual Understanding for In-Car Conversational Systems. arXiv:2512.12042 [cs.CL]

work page arXiv 2025

[10] [10]

Kaan-Gueney Keklikci, Gaia Peressini, and Dimitrij Krepis. 2026. Multistage Prompt Decomposition for Failure-Inducing Tests. InProceedings of the Seventh International Workshop on Deep Learning for Testing and Testing for Deep Learning (DeepTest 2026), co-located with the IEEE/ACM International Conference on Software Engineering (ICSE 2026)

work page 2026

[11] [11]

Song Qunying, Yuan Gao, Roberto Brusnicki, and Federica Sarro. 2026. Warnless at the DeepTest 2026 Tool Competition. InProceedings of the Seventh International Workshop on Deep Learning for Testing and Testing for Deep Learning (DeepTest 2026), co-located with the IEEE/ACM International Conference on Software Engi- neering (ICSE 2026)

work page 2026

[12] [12]

Friedl, and Andrea Stocco

Lev Sorokin, Ivan Vasilev, Ken E. Friedl, and Andrea Stocco. 2026. STELLAR: A Search-Based Testing Framework for Large Language Model Applications. InProceedings of the 33rd IEEE International Conference on Software Analysis, Evolution and Reengineering. IEEE

work page 2026

[13] [13]

Lev Sorokin, Ivan Vasilev, and Samuele Pasini. [n. d.]. Replication Package. https://github.com/deeptest-competition

work page

[14] [14]

Reuben Thomas. 2026. Enchant: A Generic Spell Checking Library. https:// rrthomas.github.io/enchant/. Accessed: 2026-01-25

work page 2026