Grammar-Constrained Refinement of Safety Operational Rules Using Language in the Loop: What Could Go Wrong

Khouloud Gaaloul; Madhu Latha Pulimi; Sam Emmanuel Kathiravan; Zaid Ghazal

arxiv: 2604.23523 · v1 · submitted 2026-04-26 · 💻 cs.SE · cs.AI

Grammar-Constrained Refinement of Safety Operational Rules Using Language in the Loop: What Could Go Wrong

Khouloud Gaaloul , Zaid Ghazal , Madhu Latha Pulimi , Sam Emmanuel Kathiravan This is my paper

Pith reviewed 2026-05-08 05:56 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords safety operational rulesgrammar constraintscounterfactual reasoningcyber-physical systemsautonomous drivingrule refinementlanguage model evaluation

0 comments

The pith

Combining counterfactual reasoning with grammar constraints can refine inconsistent safety operational rules for cyber-physical systems while keeping them syntactically valid.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a framework merging counterfactual reasoning inside a grammar-constrained refinement loop can update safety operational rules to match observed system behavior during simulations. A sympathetic reader would care because operating environments change over time, causing initial rules to become inconsistent, yet any revisions must remain correct under a domain grammar to avoid producing invalid specifications. The method was tested on an autonomous driving control system, where it resolved inconsistencies found in rules produced by a conventional baseline approach. An empirical study using large language models showed that the quality of refinements depends on the specific model and highlighted the need for stronger safeguards.

Core claim

The authors present a framework that integrates counterfactual reasoning with a grammar-constrained refinement loop to revise operational rules so they align with observed system behavior while remaining valid under the domain-specific grammar. When applied to an autonomous driving control system, this approach resolved inconsistencies in an operational rule inferred by a conventional baseline method. A separate large language model study revealed that refinement quality varies with the model and motivated calls for rigorous grammar enforcement and additional semantic validation.

What carries the argument

The grammar-constrained refinement loop that embeds counterfactual reasoning to generate candidate rule changes and enforce syntactic compliance.

If this is right

Operational rules can be updated continuously as new simulation data arrives without introducing syntax errors.
Safety specifications for cyber-physical systems remain consistent with evolving operating conditions.
Language-model-assisted refinement becomes feasible only when paired with explicit grammar enforcement.
Inconsistencies detected by conventional inference methods can be corrected while preserving domain grammar compliance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same loop could be tested on other cyber-physical domains such as robotics or power-grid control where rule sets evolve over time.
Additional checks beyond grammar might be required to guard against refinements that fit training simulations but fail in unseen edge cases.
Scaling the method to larger rule sets would reveal whether the counterfactual step remains computationally tractable.

Load-bearing premise

That counterfactual reasoning combined with grammar constraints will produce refinements that are semantically justified rather than merely overfitting to the observed simulation outcomes.

What would settle it

Applying the refined rule to a fresh simulation scenario outside the original data set and observing that it still permits inconsistent or unsafe behavior would show the refinement failed to achieve genuine alignment.

Figures

Figures reproduced from arXiv: 2604.23523 by Khouloud Gaaloul, Madhu Latha Pulimi, Sam Emmanuel Kathiravan, Zaid Ghazal.

**Figure 1.** Figure 1: Operational Rule Refinement Approach Overview. view at source ↗

read the original abstract

Safety specifications in cyber-physical systems (CPS) capture the operational conditions the system must satisfy to operate safely within its intended environment. As operating environments evolve, operational rules must be continuously refined to preserve consistency with observed system behavior during simulation-based verification and validation. Revising inconsistent rules is challenging because the changes must remain syntactically correct under a domain-specific grammar. Language-in-the-loop refinement further raises safety concerns beyond syntactic violations, as it can produce semantically unjustified refinements that overfit to the observed outcomes. We introduce a framework that combines counterfactual reasoning with a grammar-constrained refinement loop to refine operational rules, aligning them with the observed system behavior. Applied to an autonomous driving control system, our approach successfully resolved the inconsistencies in an operational rule inferred by a conventional baseline while remaining grammar compliant. An empirical large language model (LLM) study further revealed model-dependent refinement quality and safety lessons, which motivate rigorous grammar enforcement, stronger semantic validation, and broader evaluation in future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a grammar-constrained counterfactual loop for safety rule refinement but supplies too little evidence that the changes are semantically sound rather than fitted to traces.

read the letter

The main thing to know is that the authors combine counterfactual reasoning with a grammar-constrained loop to revise inconsistent operational rules in CPS, and they report that it fixed one rule from a baseline on an autonomous-driving example while staying grammar-compliant. They also ran an LLM study that flagged model-dependent quality issues. That combination for this maintenance task looks new relative to the abstract, and the paper does a clean job naming the real risk that language-in-the-loop methods can overfit without semantic justification. Credit is due for calling out the need for stronger validation in future work instead of claiming the current version already solves it. The soft spots are clear and central. The success claim rests on a single example with no quantitative metrics, no error analysis, no comparison details, and no argument that the refined rule preserves intended semantics beyond matching the simulation traces used in the loop. The LLM study reinforces rather than resolves this, since it shows quality varies by model and still treats grammar compliance plus trace agreement as the main evidence. No formal semantics, invariant check, or out-of-distribution test appears to separate justified refinement from overfitting. This is work for researchers in automated safety engineering and CPS verification who deal with evolving rules. A reader could take the framing and the identified gaps as a starting point for their own experiments. I would bring it to a reading group to talk through the semantic-validation problem. I would not cite it yet because the current evidence is too thin to build on. It deserves peer review because the problem is practical and the proposed direction is worth developing, but the authors will need to add concrete checks and details before the central claim holds up.

Referee Report

3 major / 1 minor

Summary. The paper introduces a framework combining counterfactual reasoning with a grammar-constrained refinement loop to update safety operational rules in cyber-physical systems so they remain consistent with observed simulation behavior while obeying a domain-specific grammar. It reports that the method resolved inconsistencies in an operational rule for an autonomous-driving control system (inferred by a conventional baseline) and presents an empirical LLM study on refinement quality, noting model dependence and calling for stronger semantic validation.

Significance. If the central claim were supported by quantitative metrics, formal semantic arguments, and out-of-distribution checks, the work would address a genuine practical need in evolving CPS safety specifications. The explicit recognition that grammar compliance plus trace agreement is insufficient and the call for stronger validation are constructive. In its current form the contribution remains preliminary because the evidence for semantic soundness versus overfitting is absent.

major comments (3)

[Abstract] Abstract: the headline claim that the grammar-constrained counterfactual loop 'successfully resolved the inconsistencies in an operational rule inferred by a conventional baseline while remaining grammar compliant' supplies no quantitative metrics, error rates, comparison tables, or success criteria, leaving the central empirical result unsupported.
[Framework and case study] Framework and case-study sections: no formal semantics, invariant-preservation argument, or out-of-distribution test is described that would separate a semantically justified refinement from an artifact that merely reproduces the particular simulation traces used inside the loop; the skeptic's concern therefore lands directly on the load-bearing assumption.
[LLM study] LLM study: the manuscript states that the study 'revealed model-dependent refinement quality and safety lessons' yet provides neither the study design, concrete metrics, nor the specific lessons, so the empirical component cannot be evaluated.

minor comments (1)

[Abstract / Introduction] The abstract and introduction would benefit from a short paragraph contrasting the proposed loop with prior grammar-based or counterfactual repair techniques in the CPS and formal-methods literature.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments correctly identify gaps in quantitative support, formal grounding, and empirical detail. We agree these weaken the current presentation and will revise the manuscript to address them directly while preserving the paper's focus on the limitations of grammar-constrained refinement.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that the grammar-constrained counterfactual loop 'successfully resolved the inconsistencies in an operational rule inferred by a conventional baseline while remaining grammar compliant' supplies no quantitative metrics, error rates, comparison tables, or success criteria, leaving the central empirical result unsupported.

Authors: We agree the abstract is underspecified. In the revision we will insert concrete metrics drawn from the case study: number of inconsistencies resolved (all 3 in the reported rule), grammar compliance rate (100% post-refinement), trace agreement improvement, and a one-sentence baseline comparison. These will be tied to explicit success criteria already used in the evaluation section. revision: yes
Referee: [Framework and case study] Framework and case-study sections: no formal semantics, invariant-preservation argument, or out-of-distribution test is described that would separate a semantically justified refinement from an artifact that merely reproduces the particular simulation traces used inside the loop; the skeptic's concern therefore lands directly on the load-bearing assumption.

Authors: The current framework relies on grammar enforcement for syntax and counterfactual trace matching for behavior, but lacks an explicit invariant argument. We will add a dedicated limitations subsection that (a) states the assumption that trace agreement plus grammar compliance is only a necessary but not sufficient condition for semantic soundness, and (b) reports a simple out-of-distribution check using two additional unseen simulation scenarios. We cannot supply a full formal semantics in this revision without substantial new theoretical work, but the added discussion will make the gap transparent. revision: partial
Referee: [LLM study] LLM study: the manuscript states that the study 'revealed model-dependent refinement quality and safety lessons' yet provides neither the study design, concrete metrics, nor the specific lessons, so the empirical component cannot be evaluated.

Authors: The LLM study section is indeed too terse. We will expand it to include: the exact prompt templates and models tested (GPT-4, Claude, Llama-2 variants), quantitative metrics (syntactic validity rate, semantic alignment score via manual review, safety violation count), and the concrete lessons (e.g., tendency of larger models to overfit traces, necessity of grammar post-processing). A results table will be added. revision: yes

Circularity Check

0 steps flagged

No circularity: framework is an independent methodological construction

full rationale

The paper proposes a new framework that combines counterfactual reasoning with grammar-constrained refinement for safety rules in CPS. No equations, parameters, or derivation steps are present in the abstract or described claims. The central result is an empirical demonstration that the approach resolved inconsistencies in a baseline rule while staying grammar-compliant. This is presented as a fresh construction rather than a re-derivation, fit, or self-citation chain. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear. The skeptic concern about semantic justification vs. overfitting is a question of external validation, not circularity in the derivation itself. The method stands as self-contained against the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The abstract relies on standard domain assumptions about CPS safety specifications and introduces the refinement framework itself; no numeric free parameters or new physical entities are mentioned.

axioms (2)

domain assumption Operational rules must be refined to remain consistent with observed system behavior during simulation-based verification
Stated as the core motivation in the abstract.
domain assumption Grammar compliance is necessary but not sufficient to guarantee semantic safety of refined rules
Implicit in the discussion of safety concerns beyond syntactic violations.

invented entities (1)

Grammar-constrained refinement loop no independent evidence
purpose: To iteratively propose and validate rule updates that stay syntactically correct
Presented as the central technical contribution of the framework

pith-pipeline@v0.9.0 · 5485 in / 1368 out tokens · 63571 ms · 2026-05-08T05:56:30.538251+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 4 canonical work pages · 3 internal anchors

[1]

https://beamng.tech SEAMS ’26, April 13–14, 2026, Rio de Janeiro, Brazil Gaaloul et al

Accessed: January 2026.BeamNG.tech. https://beamng.tech SEAMS ’26, April 13–14, 2026, Rio de Janeiro, Brazil Gaaloul et al

2026
[2]

Eugene Asarin, Alexandre Donzé, Oded Maler, and Dejan Nickovic. 2011. Paramet- ric identification of temporal properties. InInternational Conference on Runtime Verification. Springer, 147–160

2011
[3]

Matteo Biagiola and Stefan Klikovits. 2024. SBFT Tool Competition 2024 - Cyber- Physical Systems Track. InProceedings of the 17th ACM/IEEE International Work- shop on Search-Based and Fuzz Testing, SBFT 2024, Lisbon, Portugal, 14 April 2024. ACM, 33–36. doi:10.1145/3643659.3643932

work page doi:10.1145/3643659.3643932 2024
[4]

Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337–340

2008
[5]

Rami Debouk. 2019. Overview of the second edition of ISO 26262: Functional safety—Road vehicles.Journal of System Safety55, 1 (2019), 13–21

2019
[6]

Blumenthal, James M

Laura Fraade-Blanar, Marjory S. Blumenthal, James M. Anderson, and Nidhi Kalra. 2018.Measuring Automated Vehicle Safety: Forging a Framework. RAND Corporation, Santa Monica, CA. https://www.rand.org/pubs/research_reports/ RR2662.html

2018
[7]

Khouloud Gaaloul, Claudio Menghi, Shiva Nejati, Lionel C Briand, and Yago Isasi Parache. 2021. Combining genetic programming and model checking to generate environment assumptions.IEEE Transactions on Software Engineering48, 9 (2021), 3664–3685

2021
[8]

Alfredo García, David Llopis-Castelló, and Francisco Javier Camacho-Torregrosa
[9]

From the vehicle-based concept of operational design domain to the road- based concept of operational road section.Frontiers in Built Environment8 (2022), 901840

2022
[10]

Bardh Hoxha, Adel Dokhanchi, and Georgios Fainekos. 2018. Mining parametric temporal logic properties in model-based design for cyber-physical systems. International Journal on Software Tools for Technology Transfer20, 1 (2018), 79– 93

2018
[11]

Susmit Jha, Ashish Tiwari, Sanjit A Seshia, Tuhin Sahai, and Natarajan Shankar
[12]

TeLEx: learning signal temporal logic from positive examples using tight- ness metric.Formal Methods in System Design54, 3 (2019), 364–387

2019
[13]

Baharin A Jodat, Abhishek Chandar, Shiva Nejati, and Mehrdad Sabetzadeh. 2024. Test generation strategies for building failure models and explaining spurious failures.ACM Transactions on Software Engineering and Methodology33, 4 (2024), 1–32

2024
[14]

Baharin A Jodat, Khouloud Gaaloul, Mehrdad Sabetzadeh, and Shiva Nejati. 2025. Automated Test Oracles for Flaky Cyber-Physical System Simulators: Approach and Evaluation.arXiv preprint arXiv:2508.20902(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Khouloud Gaaloul, Zaid Ghazal, Madhu Latha Pulimi, Sam Emmanuel Kathiravan
[16]

https://replication66.github.io/SEAMS2026/

Additional Materials. https://replication66.github.io/SEAMS2026/
[17]

OM Kirovskii and VA Gorelov. 2019. Driver assistance systems: analysis, tests and the safety case. ISO 26262 and ISO PAS 21448. InIOP Conference Series: Materials Science and Engineering, Vol. 534. IOP Publishing, 012019

2019
[18]

Panagiotis Kyriakis, Jyotirmoy V Deshmukh, and Paul Bogdan. 2019. Specifica- tion mining and robust design under uncertainty: A stochastic temporal logic approach.ACM Transactions on Embedded Computing Systems (TECS)18, 5s (2019), 1–21

2019
[19]

Caroline Lemieux, Dennis Park, and Ivan Beschastnikh. 2015. General LTL specification mining (T). In2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 81–92

2015
[20]

Junle Li, Meiqi Tian, and Bingzhuo Zhong. 2025. Automatic Generation of Safety- compliant Linear Temporal Logic via Large Language Model: A Self-supervised Framework.arXiv preprint arXiv:2503.15840(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Ali Nouri, Beatriz Cabrero-Daniel, Fredrik Törner, Håkan Sivencrona, and Chris- tian Berger. 2024. Engineering safety requirements for autonomous driving with large language models. In2024 IEEE 32nd International Requirements Engineering Conference (RE). IEEE, 218–228

2024
[22]

Amir Pnueli. 1977. The temporal logic of programs. In18th annual symposium on foundations of computer science (sfcs 1977). ieee, 46–57

1977
[23]

2012.Temporal logic

Nicholas Rescher and Alasdair Urquhart. 2012.Temporal logic. Vol. 3. Springer Science & Business Media

2012
[24]

Cumhur Erkan Tuncali, Georgios Fainekos, Hisahiro Ito, and James Kapinski
[25]

In2018 IEEE intelligent vehicles symposium (IV)

Simulation-based adversarial test generation for autonomous vehicles with machine learning components. In2018 IEEE intelligent vehicles symposium (IV). IEEE, 1555–1562
[26]

Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2017. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. JL & Tech.31 (2017), 841

2017
[27]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models.arXiv preprint arXiv:2303.182231, 2 (2023)

work page internal anchor Pith review arXiv 2023
[28]

Xiubin Zhu, Dan Wang, Witold Pedrycz, and Zhiwu Li. 2022. Fuzzy rule-based local surrogate models for black-box model explanation.IEEE Transactions on Fuzzy Systems31, 6 (2022), 2056–2064

2022

[1] [1]

https://beamng.tech SEAMS ’26, April 13–14, 2026, Rio de Janeiro, Brazil Gaaloul et al

Accessed: January 2026.BeamNG.tech. https://beamng.tech SEAMS ’26, April 13–14, 2026, Rio de Janeiro, Brazil Gaaloul et al

2026

[2] [2]

Eugene Asarin, Alexandre Donzé, Oded Maler, and Dejan Nickovic. 2011. Paramet- ric identification of temporal properties. InInternational Conference on Runtime Verification. Springer, 147–160

2011

[3] [3]

Matteo Biagiola and Stefan Klikovits. 2024. SBFT Tool Competition 2024 - Cyber- Physical Systems Track. InProceedings of the 17th ACM/IEEE International Work- shop on Search-Based and Fuzz Testing, SBFT 2024, Lisbon, Portugal, 14 April 2024. ACM, 33–36. doi:10.1145/3643659.3643932

work page doi:10.1145/3643659.3643932 2024

[4] [4]

Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337–340

2008

[5] [5]

Rami Debouk. 2019. Overview of the second edition of ISO 26262: Functional safety—Road vehicles.Journal of System Safety55, 1 (2019), 13–21

2019

[6] [6]

Blumenthal, James M

Laura Fraade-Blanar, Marjory S. Blumenthal, James M. Anderson, and Nidhi Kalra. 2018.Measuring Automated Vehicle Safety: Forging a Framework. RAND Corporation, Santa Monica, CA. https://www.rand.org/pubs/research_reports/ RR2662.html

2018

[7] [7]

Khouloud Gaaloul, Claudio Menghi, Shiva Nejati, Lionel C Briand, and Yago Isasi Parache. 2021. Combining genetic programming and model checking to generate environment assumptions.IEEE Transactions on Software Engineering48, 9 (2021), 3664–3685

2021

[8] [8]

Alfredo García, David Llopis-Castelló, and Francisco Javier Camacho-Torregrosa

[9] [9]

From the vehicle-based concept of operational design domain to the road- based concept of operational road section.Frontiers in Built Environment8 (2022), 901840

2022

[10] [10]

Bardh Hoxha, Adel Dokhanchi, and Georgios Fainekos. 2018. Mining parametric temporal logic properties in model-based design for cyber-physical systems. International Journal on Software Tools for Technology Transfer20, 1 (2018), 79– 93

2018

[11] [11]

Susmit Jha, Ashish Tiwari, Sanjit A Seshia, Tuhin Sahai, and Natarajan Shankar

[12] [12]

TeLEx: learning signal temporal logic from positive examples using tight- ness metric.Formal Methods in System Design54, 3 (2019), 364–387

2019

[13] [13]

Baharin A Jodat, Abhishek Chandar, Shiva Nejati, and Mehrdad Sabetzadeh. 2024. Test generation strategies for building failure models and explaining spurious failures.ACM Transactions on Software Engineering and Methodology33, 4 (2024), 1–32

2024

[14] [14]

Baharin A Jodat, Khouloud Gaaloul, Mehrdad Sabetzadeh, and Shiva Nejati. 2025. Automated Test Oracles for Flaky Cyber-Physical System Simulators: Approach and Evaluation.arXiv preprint arXiv:2508.20902(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Khouloud Gaaloul, Zaid Ghazal, Madhu Latha Pulimi, Sam Emmanuel Kathiravan

[16] [16]

https://replication66.github.io/SEAMS2026/

Additional Materials. https://replication66.github.io/SEAMS2026/

[17] [17]

OM Kirovskii and VA Gorelov. 2019. Driver assistance systems: analysis, tests and the safety case. ISO 26262 and ISO PAS 21448. InIOP Conference Series: Materials Science and Engineering, Vol. 534. IOP Publishing, 012019

2019

[18] [18]

Panagiotis Kyriakis, Jyotirmoy V Deshmukh, and Paul Bogdan. 2019. Specifica- tion mining and robust design under uncertainty: A stochastic temporal logic approach.ACM Transactions on Embedded Computing Systems (TECS)18, 5s (2019), 1–21

2019

[19] [19]

Caroline Lemieux, Dennis Park, and Ivan Beschastnikh. 2015. General LTL specification mining (T). In2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 81–92

2015

[20] [20]

Junle Li, Meiqi Tian, and Bingzhuo Zhong. 2025. Automatic Generation of Safety- compliant Linear Temporal Logic via Large Language Model: A Self-supervised Framework.arXiv preprint arXiv:2503.15840(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Ali Nouri, Beatriz Cabrero-Daniel, Fredrik Törner, Håkan Sivencrona, and Chris- tian Berger. 2024. Engineering safety requirements for autonomous driving with large language models. In2024 IEEE 32nd International Requirements Engineering Conference (RE). IEEE, 218–228

2024

[22] [22]

Amir Pnueli. 1977. The temporal logic of programs. In18th annual symposium on foundations of computer science (sfcs 1977). ieee, 46–57

1977

[23] [23]

2012.Temporal logic

Nicholas Rescher and Alasdair Urquhart. 2012.Temporal logic. Vol. 3. Springer Science & Business Media

2012

[24] [24]

Cumhur Erkan Tuncali, Georgios Fainekos, Hisahiro Ito, and James Kapinski

[25] [25]

In2018 IEEE intelligent vehicles symposium (IV)

Simulation-based adversarial test generation for autonomous vehicles with machine learning components. In2018 IEEE intelligent vehicles symposium (IV). IEEE, 1555–1562

[26] [26]

Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2017. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. JL & Tech.31 (2017), 841

2017

[27] [27]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models.arXiv preprint arXiv:2303.182231, 2 (2023)

work page internal anchor Pith review arXiv 2023

[28] [28]

Xiubin Zhu, Dan Wang, Witold Pedrycz, and Zhiwu Li. 2022. Fuzzy rule-based local surrogate models for black-box model explanation.IEEE Transactions on Fuzzy Systems31, 6 (2022), 2056–2064

2022