Characterizing the Robustness of Black-Box LLM Planners Under Perturbed Observations with Adaptive Stress Testing
Pith reviewed 2026-05-22 15:25 UTC · model grok-4.3
The pith
Adaptive stress testing with Monte Carlo tree search uncovers prompt and sensor perturbations that cause LLM planners to hallucinate or crash in driving scenarios.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our AST formulation enables discovery of scenarios, sensor configurations, and prompt phrasing that cause language models to act with high uncertainty or even crash. By generating MCTS prompt perturbation trees across diverse scenarios, we show through extensive experiments that offline analyses can be used to proactively understand potential failures that may arise at runtime.
What carries the argument
Adaptive stress testing (AST) using Monte Carlo tree search (MCTS) to explore the combined space of prompt phrasing variations and simulated sensor noise or failures.
If this is right
- Offline perturbation trees can be generated once and reused to anticipate runtime failures across multiple scenarios.
- Specific sensor failure modes and prompt variations that trigger hallucinations become identifiable without exhaustive manual search.
- The same search procedure applies to any multi-agent driving scenario where observations can be perturbed in these two dimensions.
- Developers gain a concrete list of prompt-sensor combinations to avoid or harden against before deployment.
Where Pith is reading between the lines
- The same AST approach could be applied to LLM planners in non-driving domains where observations come from imperfect perception systems.
- Results imply that robustness testing for language-model agents must jointly consider both linguistic and perceptual sources of error rather than treating them separately.
- Safety validation pipelines for autonomous systems could incorporate automated generation of these perturbation trees as a standard offline check.
Load-bearing premise
The two chosen perturbation dimensions of prompt rephrasing and sensor noise or failure sufficiently represent the observation corruptions that matter for deployed LLM planners, and that the MCTS search will locate the safety-critical failures without missing important ones.
What would settle it
Deploying the perturbation trees found by the method in a higher-fidelity simulator or physical testbed and observing that the language models do not produce the predicted high uncertainty or crashes would show that the discovered cases are not genuine failure modes.
read the original abstract
Large language models (LLMs) have recently demonstrated success in decision-making tasks including planning, control, and prediction, but their tendency to hallucinate unsafe and undesired outputs poses risks. This unwanted behavior is further exacerbated in environments where sensors are noisy or unreliable. Characterizing the behavior of LLM planners to varied observations is necessary to proactively avoid failures in safety-critical scenarios. We specifically investigate the response of LLMs along two different perturbation dimensions. Like prior works, one dimension generates semantically similar prompts with varied phrasing by randomizing order of details, modifying access to few-shot examples, etc. Unique to our work, the second dimension simulates access to varied sensors and noise to mimic raw sensor or detection algorithm failures. An initial case study in which perturbations are manually applied show that both dimensions lead LLMs to hallucinate in a multi-agent driving environment. However, manually covering the entire perturbation space for several scenarios is infeasible. As such, we propose a novel method for efficiently searching the space of prompt perturbations using adaptive stress testing (AST) with Monte-Carlo tree search (MCTS). Our AST formulation enables discovery of scenarios, sensor configurations, and prompt phrasing that cause language models to act with high uncertainty or even crash. By generating MCTS prompt perturbation trees across diverse scenarios, we show through extensive experiments that offline analyses can be used to proactively understand potential failures that may arise at runtime. Code is available at https://sites.google.com/illinois.edu/astllm.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that Adaptive Stress Testing (AST) with Monte Carlo Tree Search (MCTS) can efficiently discover prompt-phrasing variations and simulated sensor perturbations that cause black-box LLM planners to produce hallucinations, high-uncertainty actions, or crashes in a multi-agent driving environment. It supports this with a manual case study showing failures from both perturbation dimensions and extensive experiments that generate MCTS perturbation trees across scenarios to enable offline robustness analysis.
Significance. If the central claim holds, the work offers a practical extension of stress-testing techniques to LLM-based planners, addressing a timely safety concern in robotics applications where sensor noise and prompt sensitivity can lead to unsafe behavior. The availability of code and the focus on proactive offline analysis are positive aspects that could aid future robustness studies.
major comments (2)
- [Experiments] The experiments do not report a quantitative baseline comparison (e.g., failure discovery rate or severity) between MCTS-guided search and uniform random sampling over the same two-dimensional perturbation space. Without this, it is unclear whether the reported discoveries of high-uncertainty or crashing scenarios are attributable to the AST+MCTS formulation or simply to the chosen scenarios and manual perturbations already noted in the abstract.
- [Method] The parameterization of the perturbation spaces (exact ranges and distributions for sensor noise/failures and prompt variations such as order randomization or few-shot access) and the precise metrics for detecting hallucinations, uncertainty, or crashes are not fully specified. This leaves the central claim only partially supported, as noted in the soundness assessment.
minor comments (2)
- [Abstract] Clarify in the abstract and method how 'high uncertainty' is operationalized for LLM outputs (e.g., via token probabilities, entropy, or output consistency).
- The manuscript would benefit from additional citations to prior work on LLM robustness testing in planning and control tasks.
Simulated Author's Rebuttal
We thank the referee for their constructive comments and for recognizing the timeliness of robustness analysis for LLM planners. We address each major comment below and will revise the manuscript to strengthen the presentation of our results and methods.
read point-by-point responses
-
Referee: [Experiments] The experiments do not report a quantitative baseline comparison (e.g., failure discovery rate or severity) between MCTS-guided search and uniform random sampling over the same two-dimensional perturbation space. Without this, it is unclear whether the reported discoveries of high-uncertainty or crashing scenarios are attributable to the AST+MCTS formulation or simply to the chosen scenarios and manual perturbations already noted in the abstract.
Authors: We agree that a quantitative baseline comparison is necessary to isolate the benefit of MCTS-guided search. Our current experiments focus on the structure of the generated perturbation trees and the failure modes they surface across scenarios, but we did not include a direct head-to-head evaluation against uniform random sampling. In the revised manuscript we will add this comparison, reporting failure discovery rates and severity metrics for both MCTS and random sampling over identical perturbation spaces. This addition will make the contribution of the search algorithm explicit. revision: yes
-
Referee: [Method] The parameterization of the perturbation spaces (exact ranges and distributions for sensor noise/failures and prompt variations such as order randomization or few-shot access) and the precise metrics for detecting hallucinations, uncertainty, or crashes are not fully specified. This leaves the central claim only partially supported, as noted in the soundness assessment.
Authors: We acknowledge that the original manuscript describes the two perturbation dimensions at a conceptual level without providing the exact numerical ranges, sampling distributions, or formal detection criteria. To address this, we will expand the method section with precise specifications: sensor noise ranges and failure modes, the exact procedure for prompt-order randomization and few-shot access, and the operational definitions (including thresholds) used to classify hallucinations, high-uncertainty actions, and crashes. These details will be added to support reproducibility and strengthen the soundness of the central claims. revision: yes
Circularity Check
No circularity: empirical method proposal grounded in simulation results
full rationale
The paper presents an empirical approach applying adaptive stress testing with MCTS to explore prompt and sensor perturbation spaces for LLM planners in a multi-agent driving simulator. It describes manual perturbation case studies showing hallucinations, then uses the search method to discover additional failure cases, with results reported from simulation experiments. No mathematical derivations, parameter fits, predictions, or self-referential definitions appear in the provided text; the central claim rests on the observed outcomes of the applied search procedure rather than any reduction to inputs by construction. Prior works are referenced for context on prompt perturbations but do not form a load-bearing self-citation chain or uniqueness theorem. The work is self-contained as an application and evaluation of existing techniques.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM planners can be queried as black boxes whose outputs vary meaningfully with prompt phrasing and simulated sensor inputs.
- domain assumption Monte Carlo tree search can efficiently explore the combined space of prompt and sensor perturbations to find high-risk scenarios.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a novel method for efficiently searching the space of prompt perturbations using adaptive stress testing (AST) with Monte-Carlo tree search (MCTS). ... three definitions for the undesirability U(stk+1) ... normalized Shannon entropy ... action diversity D ... negative average reward L
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our AST formulation enables discovery of scenarios, sensor configurations, and prompt phrasing that cause language models to act with high uncertainty or even crash.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.