Characterizing the Robustness of Black-Box LLM Planners Under Perturbed Observations with Adaptive Stress Testing

John Pohovey; Katherine Driggs-Campbell; Melkior Ornik; Neeloy Chakraborty

arxiv: 2505.05665 · v4 · submitted 2025-05-08 · 💻 cs.RO · cs.AI· cs.CL

Characterizing the Robustness of Black-Box LLM Planners Under Perturbed Observations with Adaptive Stress Testing

Neeloy Chakraborty , John Pohovey , Melkior Ornik , Katherine Driggs-Campbell This is my paper

Pith reviewed 2026-05-22 15:25 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CL

keywords LLM plannersrobustnessadaptive stress testingperturbed observationsautonomous drivinghallucinationMonte Carlo tree searchmulti-agent systems

0 comments

The pith

Adaptive stress testing with Monte Carlo tree search uncovers prompt and sensor perturbations that cause LLM planners to hallucinate or crash in driving scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large language models applied to planning tasks produce unsafe or uncertain outputs when observations arrive through noisy sensors or through prompts that rephrase the same facts in different orders or with different examples. Manual tests in a multi-agent driving simulator already demonstrate hallucinations under both kinds of change. Because the full space of possible rephrasings and sensor failures is too large to check by hand, the authors cast the search for failures as an adaptive stress testing problem solved by Monte Carlo tree search. This method builds trees of perturbations and focuses search on branches that produce high uncertainty or outright crashes. A reader would care because the technique gives a practical way to surface hidden failure modes before an LLM planner is placed in a vehicle where sensor problems are routine.

Core claim

Our AST formulation enables discovery of scenarios, sensor configurations, and prompt phrasing that cause language models to act with high uncertainty or even crash. By generating MCTS prompt perturbation trees across diverse scenarios, we show through extensive experiments that offline analyses can be used to proactively understand potential failures that may arise at runtime.

What carries the argument

Adaptive stress testing (AST) using Monte Carlo tree search (MCTS) to explore the combined space of prompt phrasing variations and simulated sensor noise or failures.

If this is right

Offline perturbation trees can be generated once and reused to anticipate runtime failures across multiple scenarios.
Specific sensor failure modes and prompt variations that trigger hallucinations become identifiable without exhaustive manual search.
The same search procedure applies to any multi-agent driving scenario where observations can be perturbed in these two dimensions.
Developers gain a concrete list of prompt-sensor combinations to avoid or harden against before deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same AST approach could be applied to LLM planners in non-driving domains where observations come from imperfect perception systems.
Results imply that robustness testing for language-model agents must jointly consider both linguistic and perceptual sources of error rather than treating them separately.
Safety validation pipelines for autonomous systems could incorporate automated generation of these perturbation trees as a standard offline check.

Load-bearing premise

The two chosen perturbation dimensions of prompt rephrasing and sensor noise or failure sufficiently represent the observation corruptions that matter for deployed LLM planners, and that the MCTS search will locate the safety-critical failures without missing important ones.

What would settle it

Deploying the perturbation trees found by the method in a higher-fidelity simulator or physical testbed and observing that the language models do not produce the predicted high uncertainty or crashes would show that the discovered cases are not genuine failure modes.

read the original abstract

Large language models (LLMs) have recently demonstrated success in decision-making tasks including planning, control, and prediction, but their tendency to hallucinate unsafe and undesired outputs poses risks. This unwanted behavior is further exacerbated in environments where sensors are noisy or unreliable. Characterizing the behavior of LLM planners to varied observations is necessary to proactively avoid failures in safety-critical scenarios. We specifically investigate the response of LLMs along two different perturbation dimensions. Like prior works, one dimension generates semantically similar prompts with varied phrasing by randomizing order of details, modifying access to few-shot examples, etc. Unique to our work, the second dimension simulates access to varied sensors and noise to mimic raw sensor or detection algorithm failures. An initial case study in which perturbations are manually applied show that both dimensions lead LLMs to hallucinate in a multi-agent driving environment. However, manually covering the entire perturbation space for several scenarios is infeasible. As such, we propose a novel method for efficiently searching the space of prompt perturbations using adaptive stress testing (AST) with Monte-Carlo tree search (MCTS). Our AST formulation enables discovery of scenarios, sensor configurations, and prompt phrasing that cause language models to act with high uncertainty or even crash. By generating MCTS prompt perturbation trees across diverse scenarios, we show through extensive experiments that offline analyses can be used to proactively understand potential failures that may arise at runtime. Code is available at https://sites.google.com/illinois.edu/astllm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AST-MCTS finds some prompt and sensor failure cases for LLM planners but the search method itself lacks a needed baseline comparison.

read the letter

The paper takes adaptive stress testing with MCTS and uses it to search for prompt phrasing changes plus simulated sensor noise that make black-box LLM planners produce high-uncertainty outputs or crashes in a multi-agent driving simulator. They first run a manual case study to show both perturbation types can trigger hallucinations, then switch to the automated search because full enumeration is impossible. Experiments across scenarios are used to generate trees of perturbations that surface risky cases for later offline review, and the code is released publicly.

Referee Report

2 major / 2 minor

Summary. The paper claims that Adaptive Stress Testing (AST) with Monte Carlo Tree Search (MCTS) can efficiently discover prompt-phrasing variations and simulated sensor perturbations that cause black-box LLM planners to produce hallucinations, high-uncertainty actions, or crashes in a multi-agent driving environment. It supports this with a manual case study showing failures from both perturbation dimensions and extensive experiments that generate MCTS perturbation trees across scenarios to enable offline robustness analysis.

Significance. If the central claim holds, the work offers a practical extension of stress-testing techniques to LLM-based planners, addressing a timely safety concern in robotics applications where sensor noise and prompt sensitivity can lead to unsafe behavior. The availability of code and the focus on proactive offline analysis are positive aspects that could aid future robustness studies.

major comments (2)

[Experiments] The experiments do not report a quantitative baseline comparison (e.g., failure discovery rate or severity) between MCTS-guided search and uniform random sampling over the same two-dimensional perturbation space. Without this, it is unclear whether the reported discoveries of high-uncertainty or crashing scenarios are attributable to the AST+MCTS formulation or simply to the chosen scenarios and manual perturbations already noted in the abstract.
[Method] The parameterization of the perturbation spaces (exact ranges and distributions for sensor noise/failures and prompt variations such as order randomization or few-shot access) and the precise metrics for detecting hallucinations, uncertainty, or crashes are not fully specified. This leaves the central claim only partially supported, as noted in the soundness assessment.

minor comments (2)

[Abstract] Clarify in the abstract and method how 'high uncertainty' is operationalized for LLM outputs (e.g., via token probabilities, entropy, or output consistency).
The manuscript would benefit from additional citations to prior work on LLM robustness testing in planning and control tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments and for recognizing the timeliness of robustness analysis for LLM planners. We address each major comment below and will revise the manuscript to strengthen the presentation of our results and methods.

read point-by-point responses

Referee: [Experiments] The experiments do not report a quantitative baseline comparison (e.g., failure discovery rate or severity) between MCTS-guided search and uniform random sampling over the same two-dimensional perturbation space. Without this, it is unclear whether the reported discoveries of high-uncertainty or crashing scenarios are attributable to the AST+MCTS formulation or simply to the chosen scenarios and manual perturbations already noted in the abstract.

Authors: We agree that a quantitative baseline comparison is necessary to isolate the benefit of MCTS-guided search. Our current experiments focus on the structure of the generated perturbation trees and the failure modes they surface across scenarios, but we did not include a direct head-to-head evaluation against uniform random sampling. In the revised manuscript we will add this comparison, reporting failure discovery rates and severity metrics for both MCTS and random sampling over identical perturbation spaces. This addition will make the contribution of the search algorithm explicit. revision: yes
Referee: [Method] The parameterization of the perturbation spaces (exact ranges and distributions for sensor noise/failures and prompt variations such as order randomization or few-shot access) and the precise metrics for detecting hallucinations, uncertainty, or crashes are not fully specified. This leaves the central claim only partially supported, as noted in the soundness assessment.

Authors: We acknowledge that the original manuscript describes the two perturbation dimensions at a conceptual level without providing the exact numerical ranges, sampling distributions, or formal detection criteria. To address this, we will expand the method section with precise specifications: sensor noise ranges and failure modes, the exact procedure for prompt-order randomization and few-shot access, and the operational definitions (including thresholds) used to classify hallucinations, high-uncertainty actions, and crashes. These details will be added to support reproducibility and strengthen the soundness of the central claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method proposal grounded in simulation results

full rationale

The paper presents an empirical approach applying adaptive stress testing with MCTS to explore prompt and sensor perturbation spaces for LLM planners in a multi-agent driving simulator. It describes manual perturbation case studies showing hallucinations, then uses the search method to discover additional failure cases, with results reported from simulation experiments. No mathematical derivations, parameter fits, predictions, or self-referential definitions appear in the provided text; the central claim rests on the observed outcomes of the applied search procedure rather than any reduction to inputs by construction. Prior works are referenced for context on prompt perturbations but do not form a load-bearing self-citation chain or uniqueness theorem. The work is self-contained as an application and evaluation of existing techniques.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on standard assumptions from robotics simulation and LLM prompting literature rather than new free parameters or invented entities; the perturbation dimensions are treated as given inputs for the search method.

axioms (2)

domain assumption LLM planners can be queried as black boxes whose outputs vary meaningfully with prompt phrasing and simulated sensor inputs.
This underpins the definition of the two perturbation dimensions and is invoked throughout the abstract's description of the case study and method.
domain assumption Monte Carlo tree search can efficiently explore the combined space of prompt and sensor perturbations to find high-risk scenarios.
This is the core justification for proposing AST-MCTS as a solution to the infeasibility of manual coverage.

pith-pipeline@v0.9.0 · 5817 in / 1290 out tokens · 58415 ms · 2026-05-22T15:25:35.283015+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a novel method for efficiently searching the space of prompt perturbations using adaptive stress testing (AST) with Monte-Carlo tree search (MCTS). ... three definitions for the undesirability U(stk+1) ... normalized Shannon entropy ... action diversity D ... negative average reward L
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our AST formulation enables discovery of scenarios, sensor configurations, and prompt phrasing that cause language models to act with high uncertainty or even crash.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.