Robust Exploratory Stopping under Ambiguity in Reinforcement Learning

Hoi Ying Wong; Junyan Ye; Kyunghyun Park

arxiv: 2510.10260 · v2 · submitted 2025-10-11 · 🧮 math.OC · math.PR· q-fin.MF· stat.ML

Robust Exploratory Stopping under Ambiguity in Reinforcement Learning

Junyan Ye , Hoi Ying Wong , Kyunghyun Park This is my paper

Pith reviewed 2026-05-18 07:40 UTC · model grok-4.3

classification 🧮 math.OC math.PRq-fin.MFstat.ML

keywords robust reinforcement learningoptimal stoppingambiguityg-expectationbackward stochastic differential equationsexploratory controlBernoulli controlspolicy iteration

0 comments

The pith

Optimal stopping under ambiguity can be solved by reformulating it as a robust exploratory control problem using Bernoulli-distributed controls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a continuous-time reinforcement learning approach for optimal stopping decisions when the agent faces ambiguity about the environment. It models ambiguity through the g-expectation framework as multiple probability measures dominated by a reference measure. The stopping problem is rewritten as an exploratory control task where the control is chosen from a Bernoulli distribution. Backward stochastic differential equations characterize the optimal Bernoulli control, which defines a robust exploratory stopping time that approximates the true optimal stopping time under ambiguity. A policy iteration theorem is established and converted into a practical reinforcement learning algorithm whose performance is checked in numerical experiments.

Core claim

The optimal stopping problem under ambiguity is reformulated as a robust exploratory control problem with Bernoulli distributed controls. The optimal Bernoulli control is characterized using backward stochastic differential equations, allowing construction of a robust exploratory stopping time that approximates the optimal stopping time under ambiguity. A policy iteration theorem is established and implemented as a reinforcement learning algorithm.

What carries the argument

The robust exploratory control problem with Bernoulli distributed controls, characterized via backward stochastic differential equations, which enables construction of the robust exploratory stopping time.

If this is right

The robust exploratory stopping time approximates the optimal stopping time under ambiguity.
A policy iteration theorem holds and yields a convergent reinforcement learning algorithm.
The algorithm exhibits convergence, robustness, and scalability across different levels of ambiguity and exploration.
The framework simultaneously supports robust decision-making and learning about the unknown environment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The Bernoulli reformulation might extend to other continuous-time control problems that require both robustness and exploration.
Similar exploratory control techniques could simplify handling of model uncertainty in non-stopping decision tasks.
Testing the approach on discrete-time or path-dependent problems would check whether the BSDE characterization generalizes.

Load-bearing premise

The g-expectation framework with a reference measure and a family of dominated measures accurately represents the agent's ambiguity about the environment, and the resulting BSDE characterization remains valid for the exploratory Bernoulli control reformulation of the stopping problem.

What would settle it

A simulation in which the constructed robust exploratory stopping time fails to approximate the optimal stopping time under ambiguity, or in which the policy iteration algorithm does not converge to a useful policy, would falsify the central claims.

read the original abstract

We propose and analyze a continuous-time robust reinforcement learning framework for optimal stopping under ambiguity. In this framework, an agent chooses a robust exploratory stopping time motivated by two objectives: robust decision-making under ambiguity and learning about the unknown environment. Here, ambiguity refers to considering multiple probability measures dominated by a reference measure, reflecting the agent's awareness that the reference measure representing her learned belief about the environment would be erroneous. Using the $g$-expectation framework, we reformulate the optimal stopping problem under ambiguity as a robust exploratory control problem with Bernoulli distributed controls. We then characterize the optimal Bernoulli distributed control via backward stochastic differential equations and, based on this, construct the robust exploratory stopping time that approximates the optimal stopping time under ambiguity. Last, we establish a policy iteration theorem and implement it as a reinforcement learning algorithm. Numerical experiments demonstrate the convergence, robustness, and scalability of our reinforcement learning algorithm across different levels of ambiguity and exploration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete way to handle optimal stopping under ambiguity in continuous-time RL by mixing g-expectations with Bernoulli exploration and a policy-iteration algorithm, though the BSDE step after randomization needs explicit condition checks.

read the letter

This paper sets up a continuous-time RL method for optimal stopping when the agent does not fully trust the reference probability measure. It models ambiguity with g-expectations, recasts the stopping problem as a robust control task using Bernoulli-distributed controls for exploration, characterizes the optimum through BSDEs, and derives a policy-iteration result that produces a practical RL algorithm. Numerical tests are included to show convergence and behavior across ambiguity levels.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a continuous-time robust reinforcement learning framework for optimal stopping under ambiguity, modeled using g-expectations with a reference measure and dominated alternatives. It reformulates the problem as a robust exploratory control problem with Bernoulli-distributed controls, characterizes the optimal such control via backward stochastic differential equations, constructs a robust exploratory stopping time approximating the ambiguous optimum, establishes a policy iteration theorem, and implements the result as a reinforcement learning algorithm. Numerical experiments are presented to illustrate convergence, robustness, and scalability across ambiguity and exploration levels.

Significance. If the central claims hold, the work offers a theoretically grounded approach to combining ambiguity aversion with exploratory control in continuous-time stopping problems, potentially advancing robust RL methods that handle model uncertainty. The policy iteration theorem and algorithmic implementation provide a concrete bridge from BSDE theory to practice, and the numerical results suggest applicability in settings with varying ambiguity. Strengths include the use of standard g-expectation tools and the explicit construction of an approximating stopping time.

major comments (2)

[BSDE characterization of optimal Bernoulli control] The reformulation of the ambiguous optimal stopping problem as a robust exploratory control problem with Bernoulli-distributed controls (as described in the abstract and the section on reformulation) leads to a claimed BSDE characterization of the optimal control. However, the manuscript does not explicitly verify that the effective driver obtained after taking the expectation over the Bernoulli variable continues to satisfy the uniform Lipschitz and monotonicity conditions required by the standard BSDE representation theorem over the entire family of dominated measures. This verification is load-bearing for the validity of the characterization, the construction of the approximating stopping time, and the subsequent policy iteration theorem.
[Construction of robust exploratory stopping time] The construction of the robust exploratory stopping time as an approximation to the optimal stopping time under ambiguity lacks quantitative error bounds, convergence rates, or sensitivity analysis with respect to the ambiguity level and exploration parameter. Without these, it is difficult to assess how well the constructed stopping time approximates the robust optimum, particularly since these parameters appear to be chosen post-hoc in the numerical experiments.

minor comments (2)

[Numerical experiments] The numerical experiments section would benefit from additional details on the specific stochastic environments, discretization schemes, and quantitative performance metrics (e.g., regret or approximation error) to better support claims of convergence and scalability.
[Preliminaries] Notation for the g-expectation and the family of dominated measures could be clarified with explicit definitions early in the paper to aid readability for readers less familiar with nonlinear expectations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading of our manuscript and the constructive major comments. We address each point below and indicate the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [BSDE characterization of optimal Bernoulli control] The reformulation of the ambiguous optimal stopping problem as a robust exploratory control problem with Bernoulli-distributed controls (as described in the abstract and the section on reformulation) leads to a claimed BSDE characterization of the optimal control. However, the manuscript does not explicitly verify that the effective driver obtained after taking the expectation over the Bernoulli variable continues to satisfy the uniform Lipschitz and monotonicity conditions required by the standard BSDE representation theorem over the entire family of dominated measures. This verification is load-bearing for the validity of the characterization, the construction of the approximating stopping time, and the subsequent policy iteration theorem.

Authors: We agree that an explicit verification is necessary to rigorously justify the application of the standard BSDE representation theorem. The effective driver is a convex combination (via the Bernoulli expectation) of the original driver evaluated at the two admissible control values. Under the standing assumptions that the original driver is uniformly Lipschitz continuous and monotone, and given that the Bernoulli parameter lies in [0,1] and the ambiguity set is dominated and bounded, these properties pass to the effective driver uniformly over the family of measures. We will insert a dedicated lemma immediately after the reformulation section that states and proves this inheritance, thereby closing the gap in the argument for the BSDE characterization, the stopping-time construction, and the policy-iteration theorem. revision: yes
Referee: [Construction of robust exploratory stopping time] The construction of the robust exploratory stopping time as an approximation to the optimal stopping time under ambiguity lacks quantitative error bounds, convergence rates, or sensitivity analysis with respect to the ambiguity level and exploration parameter. Without these, it is difficult to assess how well the constructed stopping time approximates the robust optimum, particularly since these parameters appear to be chosen post-hoc in the numerical experiments.

Authors: We acknowledge that quantitative error bounds would improve the assessment of the approximation quality. The construction is obtained by thresholding the solution of the characterizing BSDE; its convergence to the ambiguous optimum as the exploration parameter tends to zero follows from the continuity of g-expectations with respect to the ambiguity parameter and the stability of BSDE solutions. In the revision we will add a new subsection containing (i) a qualitative convergence statement under the current assumptions and (ii) an expanded numerical study that reports the dependence of the realized stopping time and value on a grid of ambiguity and exploration levels. Deriving explicit rates would require additional regularity on the driver that is not assumed in the present framework and is therefore noted as future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation applies standard g-expectation and BSDE theory to a reformulated problem

full rationale

The paper's chain begins with the standard g-expectation framework (a pre-existing nonlinear expectation tool) to reformulate ambiguous optimal stopping as a robust exploratory control problem using Bernoulli controls. It then invokes the established BSDE representation for the optimal control in this setting and constructs the stopping time from that characterization. The policy iteration theorem is derived as a new result for the resulting RL algorithm. None of these steps reduce by construction to the paper's own inputs or fitted parameters; each applies external, independently verifiable mathematical machinery to the reformulated object. No self-citation is load-bearing for the core claims, and the derivation remains self-contained against external benchmarks in stochastic control theory.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the g-expectation representation of ambiguity and the existence of solutions to the associated BSDEs; no new free parameters are introduced beyond the ambiguity and exploration levels already standard in the cited literature, and no new entities are postulated.

axioms (2)

domain assumption Ambiguity is represented by a family of probability measures absolutely continuous with respect to a reference measure, captured via g-expectation.
Invoked in the reformulation of the optimal stopping problem under ambiguity.
domain assumption The robust exploratory control problem with Bernoulli distributed controls admits a characterization via backward stochastic differential equations.
Central step used to construct the approximate stopping time.

pith-pipeline@v0.9.0 · 5695 in / 1564 out tokens · 35448 ms · 2026-05-18T07:40:07.009813+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We reformulate the optimal stopping problem under ambiguity as a robust exploratory control problem with Bernoulli distributed controls. We then characterize the optimal Bernoulli distributed control via backward stochastic differential equations
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Y^{x;N,λ}_t = ess sup_π E^g_t [J^{x;N,λ,π}_t] with driver F^{x;N,λ} containing N(R−y)π − λH(π)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.