PICKLES: a Natural Language Framework for Requirement Specification and Model-Based Testing

Mar\'ia Bel\'en Rodr\'iguez; Petra van den Bos

arxiv: 2604.26572 · v1 · submitted 2026-04-29 · 💻 cs.SE

PICKLES: a Natural Language Framework for Requirement Specification and Model-Based Testing

Mar\'ia Bel\'en Rodr\'iguez , Petra van den Bos This is my paper

Pith reviewed 2026-05-07 11:22 UTC · model grok-4.3

classification 💻 cs.SE

keywords natural language specificationsmodel-based testingbehavior-driven developmentautomatic test generationrequirements engineeringgherkin scenariostest coverageindustrial case study

0 comments

The pith

Structured natural language scenarios translate into formal models whose composition enables automatic test generation with higher coverage than direct BDD use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines a language for writing requirement scenarios in readable natural language that also supports full translation into formal models. These models are then merged into a single master model to which standard algorithms for automatic test derivation can be applied. A sympathetic reader would care because the approach keeps specifications understandable to non-technical stakeholders while producing more thorough test suites from the identical starting scenarios, as shown on a real software component from a traffic management system.

Core claim

Pickles extends Gherkin-style scenarios with precise keywords for inputs and control flow, supporting a two-way translation to formal models. All translated scenarios are composed into one master model without semantic loss, after which model-based testing algorithms automatically derive test cases. On an industrial case study this pipeline produced significantly higher coverage than applying the original scenarios directly within BDD.

What carries the argument

Bi-directional translation of Pickles scenarios into formal models, followed by their composition into a master model that admits standard automatic test generation algorithms.

If this is right

Test cases can be produced automatically once scenarios are written, without separate manual test design.
The same set of scenarios yields measurably larger test suites and higher coverage than when used only in BDD style.
Requirements and the derived tests remain in a single human-readable format that stakeholders can review.
Composition into the master model enforces consistency across scenarios that would otherwise be checked only manually.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams could maintain tests by editing the natural language scenarios and regenerating the suite rather than updating tests by hand.
The same translation step might be reused to feed other analysis tools such as model checkers or simulators.
The approach could be tried on domains with strict regulatory test documentation needs to see whether the readable scenarios satisfy audit requirements while still supplying the generated tests.

Load-bearing premise

The translation between natural language scenarios and formal models preserves every intended behavior and the master model assembles multiple scenarios without introducing inconsistencies or omissions.

What would settle it

A collection of Pickles scenarios for which the test cases generated from the master model miss a behavior or path that is present when the scenarios are examined individually or that the original natural language description explicitly required.

read the original abstract

This paper combines methods from the fields of Model-Based Testing (MBT) and Behaviour-Driven Development (BDD) to define a testing approach with human-readable specifications and test cases, as in BDD, while using the modelling techniques and automatic test generation algorithms from MBT. We introduce PICKLES, a Precise Input and Control-flow Keyword-based Language for tEst Scenarios; an extension of Gherkin-style BDD scenarios, specified in structured natural language. We provide a bi-directional translation from Pickles scenarios to formal models, where both specifications and tests are human-readable, and a method to obtain a so-called master model combining all translated scenarios. Standard MBT algorithms can then be applied to automatically derive test cases from it. We implement a prototype of the translation and composition steps, which we use on an industrial case study: a software component from a traffic management system. With it, we illustrate the pipeline and show how Pickles can achieve significantly higher coverage with respect to BDD from the same set of scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PICKLES links BDD scenarios to MBT via keyword extensions and a master model, but the translation and composition steps lack shown guarantees.

read the letter

The main takeaway is that PICKLES extends BDD scenarios with precise keywords so they can drive model-based test generation through a master model, and the authors report better coverage on one industrial case. What the paper does is lay out a bi-directional translation from the natural language specs to formal models and back, plus a way to merge multiple scenarios into one master model for MBT algorithms. This keeps the human-readable part while adding automation. The traffic management example illustrates the whole flow and backs the coverage claim with a real system component. It does well at showing a practical pipeline that could appeal to teams using BDD who want more systematic testing. The prototype implementation is a plus because it means the ideas are not just theoretical. The soft spots are around the formal foundations. The translation and composition steps assume no loss of intended behavior, but without details on the underlying model or checks for consistency, it's hard to know if the higher coverage is valid. The abstract gives the high-level claim but skips metrics, comparison methods, or handling of potential conflicts in the master model. That makes the central result rest on the single case study. This paper is for software engineering folks focused on testing and requirements. A reader interested in bridging informal and formal methods would get something out of the example and the language design. I would recommend sending it for peer review. The work has a clear application and some evidence from practice, so referees could help tighten the formal parts and evaluate the coverage results more closely.

Referee Report

2 major / 2 minor

Summary. The paper introduces PICKLES, a Precise Input and Control-flow Keyword-based Language for tEst Scenarios, as an extension of Gherkin-style BDD scenarios using structured natural language. It provides a bi-directional translation from Pickles scenarios to formal models, a method to obtain a master model by combining all translated scenarios, and applies standard MBT algorithms to derive test cases automatically. A prototype of the translation and composition is implemented and demonstrated on an industrial case study of a software component from a traffic management system, claiming significantly higher coverage compared to BDD from the same scenarios.

Significance. Should the bi-directional translation and master model composition preserve the semantics of the original scenarios without loss or inconsistency, this work offers a promising integration of BDD's human-readable specifications with MBT's automated test generation and improved coverage capabilities. This could facilitate better requirement engineering and testing in industrial settings. The provision of a prototype and its application to a real-world traffic management system component is a positive aspect, offering empirical illustration of the pipeline's potential.

major comments (2)

Abstract: The assertion that Pickles 'can achieve significantly higher coverage with respect to BDD from the same set of scenarios' is presented without quantitative metrics, details on the BDD baseline implementation, coverage criteria, or number of scenarios/test cases. This leaves the central empirical claim weakly supported and difficult to evaluate.
Bi-directional translation and master model composition: The central claim rests on the unverified assumptions that each Pickles scenario translates to a formal model exactly capturing its intended behavior and that the master model is a faithful composition (no lost constraints, no spurious interleavings). No formal semantics for the target model (e.g., LTS, FSM, or Petri net), equivalence relation, or argument that composition is monotonic and conflict-free are supplied, which is load-bearing for guaranteeing that extra coverage is valid with respect to the original requirements.

minor comments (2)

The abstract would benefit from briefly specifying the formal model formalism used in the translation to aid immediate technical understanding.
In the case study description, include concrete details such as the number of scenarios, component size, or exact coverage values to strengthen the empirical illustration.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the presentation of empirical results and the justification of the core technical contributions. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses

Referee: Abstract: The assertion that Pickles 'can achieve significantly higher coverage with respect to BDD from the same set of scenarios' is presented without quantitative metrics, details on the BDD baseline implementation, coverage criteria, or number of scenarios/test cases. This leaves the central empirical claim weakly supported and difficult to evaluate.

Authors: We agree that the abstract would be strengthened by including concrete details from the case study. The body of the manuscript reports the industrial case study results, including the number of scenarios, the specific coverage criteria (statement and transition coverage), the BDD baseline implementation used for comparison, and the measured coverage improvements. We will revise the abstract to summarize these quantitative metrics and briefly note the evaluation setup. revision: yes
Referee: Bi-directional translation and master model composition: The central claim rests on the unverified assumptions that each Pickles scenario translates to a formal model exactly capturing its intended behavior and that the master model is a faithful composition (no lost constraints, no spurious interleavings). No formal semantics for the target model (e.g., LTS, FSM, or Petri net), equivalence relation, or argument that composition is monotonic and conflict-free are supplied, which is load-bearing for guaranteeing that extra coverage is valid with respect to the original requirements.

Authors: The manuscript defines the translation rules from Pickles scenarios to finite-state-machine models in Section 3 and describes the master-model construction via state merging in Section 4. We acknowledge that a formal semantics, equivalence relation, and monotonicity proof are not supplied. We will add a dedicated subsection providing an informal argument for semantic preservation: each scenario maps deterministically to a state-machine fragment via the keyword rules, and composition merges on shared control states without introducing new transitions or removing existing constraints. We will also reference related work on scenario composition and note that a full formal treatment is planned as future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces PICKLES as an extension of Gherkin, describes a bi-directional translation to formal models and a master-model composition step, then reports an empirical coverage comparison on an industrial traffic-management case study. The coverage result is presented as an observed outcome of applying standard MBT algorithms to the composed model rather than as a quantity derived from fitted parameters or self-referential definitions. No equations appear that equate a claimed prediction to its own inputs by construction, no self-citations are invoked as uniqueness theorems or load-bearing premises, and the central claim rests on the case-study measurement rather than on any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; full text unavailable, so ledger entries are inferred at high level from the described method.

axioms (1)

domain assumption Formal models derived from Pickles scenarios accurately capture the intended system behaviour.
The approach assumes the translation step is semantics-preserving, which is required for the master model and subsequent MBT to be valid.

invented entities (1)

PICKLES language no independent evidence
purpose: Structured natural language extension of Gherkin that enables bi-directional mapping to formal models.
New language and associated translation rules introduced by the paper.

pith-pipeline@v0.9.0 · 5478 in / 1239 out tokens · 48111 ms · 2026-05-07T11:22:22.617523+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

Alégroth, E., Karl, K., Rosshagen, H., Helmfridsson, T., & Olsson, N. (2022). Practitioners’ best practices to Adopt, Use or Abandon Model-based Testing with Graphical models for Software-intensive Systems.Empirical Software Engineering, 27(5). doi: 10.1007/s10664-022-10145-2 Alferez, M., Pastore, F., Sabetzadeh, M., Briand, L., & Riccardi, J.-R. (2019). ...

work page doi:10.1007/s10664-022-10145-2 2022
[2]

S., Babu, D

doi: 10.48550/arXiv.2503.18597 Rani, V. S., Babu, D. A. R., Deepthi, K., & Reddy, V. R. (2023). Shift-Left Testing in DevOps: A Study of Benefits, Chal- lenges, and Best Practices. InICACRS’23(p. 1675-1680). doi: 10.1109/{ICACRS}58579.2023.10404436 Rao, N., Gilbert, E., Green, H., Ramananandro, T., Swamy, N., Le Goues, C., & Fakhoury, S. (2025).DiffSpec: ...

work page doi:10.48550/arxiv.2503.18597 2023

[1] [1]

Alégroth, E., Karl, K., Rosshagen, H., Helmfridsson, T., & Olsson, N. (2022). Practitioners’ best practices to Adopt, Use or Abandon Model-based Testing with Graphical models for Software-intensive Systems.Empirical Software Engineering, 27(5). doi: 10.1007/s10664-022-10145-2 Alferez, M., Pastore, F., Sabetzadeh, M., Briand, L., & Riccardi, J.-R. (2019). ...

work page doi:10.1007/s10664-022-10145-2 2022

[2] [2]

S., Babu, D

doi: 10.48550/arXiv.2503.18597 Rani, V. S., Babu, D. A. R., Deepthi, K., & Reddy, V. R. (2023). Shift-Left Testing in DevOps: A Study of Benefits, Chal- lenges, and Best Practices. InICACRS’23(p. 1675-1680). doi: 10.1109/{ICACRS}58579.2023.10404436 Rao, N., Gilbert, E., Green, H., Ramananandro, T., Swamy, N., Le Goues, C., & Fakhoury, S. (2025).DiffSpec: ...

work page doi:10.48550/arxiv.2503.18597 2023