Bistable by Construction: Wall-Clock-Calibrated State Monitors Have No Moment-Detection Regime at Agent Cadence

Manvendra Modgil

arxiv: 2606.19386 · v1 · pith:KR6LDCODnew · submitted 2026-06-15 · 💻 cs.SE · cs.AI· cs.LG

Bistable by Construction: Wall-Clock-Calibrated State Monitors Have No Moment-Detection Regime at Agent Cadence

Manvendra Modgil This is my paper

Pith reviewed 2026-06-27 02:15 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG

keywords runtime monitoringstate monitorsagent systemswall-clock calibrationsaturation trapleaky integratormoment detection

0 comments

The pith

Wall-clock-calibrated monitors on agent streams enter a saturation trap at every realistic cadence and never detect moments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that monitors whose state decays according to wall-clock time produce either constant alarms or silence once inter-action intervals fall between one and thirty seconds. Real agent runs, with median latency of 1.53 seconds, sit inside this interval, so the monitors cannot serve as moment detectors. The same streams fed to a sample-time accumulator such as CUSUM yield firing rates that stay constant across all tested intervals. The trap is therefore a structural property of wall-clock calibration rather than any particular internal state engine.

Core claim

Wall-clock-calibrated leaky-integrator monitors admit no regime in which they act as moment detectors on agent streams; every critical dt lies in (1,30]s and real agent runs measure latency inside the trap regime.

What carries the argument

Wall-clock accumulator that integrates error over real seconds rather than sample count, producing a dt-dependent cliff between constant firing at dt<=1s and silence at dt>=60s.

If this is right

Transition detection with hysteresis produces 0-3 firings per trajectory at every tested cadence.
A sample-time CUSUM on the same stream remains exactly invariant to dt.
Any wall-clock half-life or decay constant places the operating point inside the trap for observed agent latencies.
The published saturation trap on SWE-bench agents is explained by dt=0 inputs rather than the affect engine itself.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers who need reliable moment detection on irregular streams must either switch to sample-time calibration or add explicit edge detection outside the integrator.
The same dt cliff will appear in any EMA-style baseline or affective-state monitor whose decay is expressed in seconds.
Intervention timing recovered by human observers cannot be reproduced by these monitors without additional timing logic.

Load-bearing premise

The uniform-interval sweep on twenty trajectories with a minimal wall-clock accumulator captures the dynamics that matter for absence of a moment-detection regime.

What would settle it

Measure the number of threshold crossings per trajectory when the identical error stream is fed to the wall-clock monitor at fixed dt values spaced through (1,30]s; the count should remain either near-maximal or near-zero in every interval.

Figures

Figures reproduced from arXiv: 2606.19386 by Manvendra Modgil.

**Figure 2.** Figure 2: Post-crossing persistence vs ∆t: 20 thin per-trajectory lines (grey), mean (blue), with the critical band (1, 30]s shaded. Measured real agent cadence (median 1.53 s, p90 2.33 s; dotted) sits deep inside the constant-alarm regime. exploratory throughout. We verified, against the committed pre-registration document, that the H5/H6 wording and metrics are unchanged from the version committed before execution… view at source ↗

**Figure 3.** Figure 3: Frustration vs action index for one trajectory ( [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Zero-parameter predicted critical ∆t (orange points) against interval-censored observations (blue bars) for the 5 pilot trajectories: right order of magnitude, but only a minority land inside their interval. Live operation. The instrumented loop of Section 5 also ran the frozen engine and T3 online with real elapsed ∆t. On the 32-action run, T3 fired exactly once, at the rising edge (frustration 0.603 → 0.… view at source ↗

**Figure 5.** Figure 5: Live instrumented run (real wall-clock ∆t, median 1.53 s): modelled frustration ramps from 0.10 to saturation; T3 fires once at the rising edge (star, 0.603 → 0.750) while A6 is true on 25 of 32 actions (orange shading)—the level-vs-edge contrast in live operation. Calibrated-decay monitors and label variation. Wall-clock-calibrated leaky integrators—exponential-movingaverage behavioural baselines and dri… view at source ↗

read the original abstract

Runtime monitors for autonomous agents commonly threshold an accumulated internal state - a behavioural baseline, a drift statistic, or, in our prior work, a modelled affective state. We previously reported a State Saturation Trap: threshold-on-state triggers over a continuous affect engine become near-constant alarms on SWE-bench debugging agents (Modgil 2026). A post-release audit found the engine received dt=0 between actions, so its exponential decay never operated: the published trap is a pure-accumulator result. We correct the record (erratum, v2) and treat the flaw as an experiment. The key variable it exposes is whether a monitor's dynamics are calibrated in sample time (per observation, as in CUSUM) or wall-clock time (half-lives in seconds, as in affect models and EMA baselines). On fixed-rate streams these coincide; on agent streams, where inter-action time varies by orders of magnitude, they do not. A pre-registered sweep over uniform intervals (dt in {0..600}s) on 20 trajectories shows the wall-clock level trigger has two regimes: at dt<=1s a constant alarm (20/20; median 18 firings); at dt>=60s silent. Every critical dt lies in (1,30]s. Real agent runs measure latency at median 1.53s (p90 2.33s); real coding cadence sits inside the trap regime, vindicating the empirical finding under a corrected mechanism. The structure is a property of the calibration class, not the engine: a minimal wall-clock accumulator over the raw error stream reproduces the same cliff, while a sample-time CUSUM over the identical stream is exactly dt-invariant (20/20). A rising-edge trigger with hysteresis fires 0-3 times per trajectory in every condition. We conclude that wall-clock-calibrated leaky-integrator monitors admit no regime in which they act as moment detectors on agent streams; transition detection escapes the trap at every cadence, but does not recover human intervention timing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper isolates calibration type as the driver of a saturation trap on agent streams and shows CUSUM avoids it, but the uniform-dt sweep leaves the variable-dt case as the weakest link.

read the letter

The core point is that wall-clock calibrated leaky-integrator monitors hit a regime cliff on agent streams at the cadences that actually occur, while sample-time ones do not. The paper uses the earlier dt=0 mistake as a natural experiment to separate the two calibration classes and runs a pre-registered sweep that maps the cliff clearly.

What the work does cleanly is show the structural difference. On the 20 trajectories the wall-clock version produces constant alarms at dt ≤ 1 s and silence at dt ≥ 60 s, with every transition dt falling in (1, 30] s. Real measured latencies sit inside that band. The minimal accumulator reproduces the same step, CUSUM stays invariant across all dt, and the rising-edge trigger with hysteresis stays low and stable. That comparison is useful and directly addresses the prior saturation-trap claim.

The soft spot is the uniform fixed-dt design of the sweep. Agent streams have irregular intervals that can bunch or stretch inside a single run, and the central claim is that no moment-detection regime exists under those conditions. Short clusters could push saturation while long gaps allow decay, potentially creating crossings that fixed-dt cases miss. The abstract states the minimal model reproduces the cliff but does not say it was exercised on the actual timestamp sequences. If the full paper does not add variable-dt runs or show the accumulator on real traces, the "no regime" conclusion rests on an approximation whose representativeness is not yet demonstrated.

This is worth referee time for anyone designing runtime monitors for agents or other irregular streams. The empirical pattern and the calibration distinction are concrete enough to review even if the variable-dt gap requires more data.

Referee Report

3 major / 2 minor

Summary. The paper claims that wall-clock-calibrated leaky-integrator monitors (e.g., affect models) on autonomous agent streams admit no moment-detection regime: a pre-registered uniform-dt sweep on 20 trajectories reveals a sharp cliff (constant alarms at dt≤1s, silence at dt≥60s, all critical values in (1,30]s), real agent median latency 1.53s falls inside the trap, a minimal wall-clock accumulator reproduces the cliff on the error stream, while sample-time CUSUM is exactly dt-invariant and a rising-edge trigger with hysteresis fires only 0-3 times per trajectory in every condition. The structure is a property of calibration class, not the specific engine; the prior saturation-trap report is corrected as a pure-accumulator artifact.

Significance. If the central empirical distinction holds, the result is significant for runtime monitoring in agent systems: it shows that wall-clock calibration (common in EMA and affective baselines) systematically precludes reliable transition detection at realistic cadences, while sample-time methods escape the trap. Strengths include the pre-registered sweep, explicit reproduction of the cliff by a minimal accumulator, and the clean dt-invariance result for CUSUM; these provide falsifiable, reproducible evidence separating calibration classes.

major comments (3)

[Abstract / empirical sweep] Abstract and empirical sweep section: the claim that wall-clock monitors admit no moment-detection regime on agent streams rests on a uniform fixed-dt sweep (dt ∈ {0..600}s); the manuscript states the minimal accumulator reproduces the cliff but does not report results when the accumulator is driven by the actual variable timestamp sequences of the 20 trajectories, leaving open whether short-dt clusters or long gaps could produce selective crossings absent from the uniform case.
[Real agent runs] Real-agent latency mapping: the median 1.53s (p90 2.33s) is asserted to place real coding cadence inside the trap, yet no verification is given that the 20 trajectories' inter-action distributions are representative or that the uniform sweep bounds the behavior under the observed variable-dt statistics.
[Minimal accumulator comparison] CUSUM invariance result: while the paper correctly shows sample-time CUSUM is dt-invariant (20/20), the corresponding wall-clock accumulator is only shown under uniform dt; without the variable-dt exercise, the contrast does not yet fully establish that the absence of a moment-detection regime is a general property of the calibration class under realistic agent timing.

minor comments (2)

[Abstract] Abstract: no error bars, confidence intervals, or raw per-trajectory firing counts are reported for the 20/20 constant-alarm result, reducing immediate assessability of robustness.
[Introduction / erratum note] The erratum correction for the prior dt=0 artifact is noted but the manuscript could more explicitly tabulate how the corrected wall-clock decay parameter interacts with the observed latency distribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which correctly identify the need to extend our analysis to variable inter-action timings drawn from the trajectories. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract / empirical sweep] Abstract and empirical sweep section: the claim that wall-clock monitors admit no moment-detection regime on agent streams rests on a uniform fixed-dt sweep (dt ∈ {0..600}s); the manuscript states the minimal accumulator reproduces the cliff but does not report results when the accumulator is driven by the actual variable timestamp sequences of the 20 trajectories, leaving open whether short-dt clusters or long gaps could produce selective crossings absent from the uniform case.

Authors: We agree that the variable-dt case for the minimal wall-clock accumulator must be reported to close this gap. In the revision we will add results obtained by driving the accumulator with the actual timestamp sequences from each of the 20 trajectories; these will show whether clusters or gaps produce crossings outside the uniform-sweep regimes. revision: yes
Referee: [Real agent runs] Real-agent latency mapping: the median 1.53s (p90 2.33s) is asserted to place real coding cadence inside the trap, yet no verification is given that the 20 trajectories' inter-action distributions are representative or that the uniform sweep bounds the behavior under the observed variable-dt statistics.

Authors: The reported latency statistics were computed directly from the same 20 trajectories. To verify that the uniform sweep bounds the observed variable-dt statistics, the revision will include a supplementary comparison of the empirical inter-action distribution against the critical interval (1,30]s, confirming that the bulk of real cadences lie inside the trap regime. revision: yes
Referee: [Minimal accumulator comparison] CUSUM invariance result: while the paper correctly shows sample-time CUSUM is dt-invariant (20/20), the corresponding wall-clock accumulator is only shown under uniform dt; without the variable-dt exercise, the contrast does not yet fully establish that the absence of a moment-detection regime is a general property of the calibration class under realistic agent timing.

Authors: We accept that the wall-clock accumulator must also be evaluated on the variable timestamps to complete the calibration-class contrast. The revision will incorporate this analysis (as noted in the response to the first comment), thereby demonstrating that the dt-invariance distinction holds under the trajectories' actual timing. revision: yes

Circularity Check

0 steps flagged

Minor self-citation for prior mechanism; central empirical claim independent

full rationale

The paper's derivation rests on a pre-registered uniform-dt sweep across 20 trajectories plus real-agent latency measurements (median 1.53 s) that place observed cadences inside the identified trap regime. A minimal wall-clock accumulator is shown to reproduce the same cliff while CUSUM remains invariant. The single self-citation (Modgil 2026) is invoked only to describe the corrected mechanism and the original dt=0 flaw; it does not supply the load-bearing evidence for the new claim that wall-clock monitors admit no moment-detection regime. No fitted parameters are renamed as predictions, no equations reduce by construction to their inputs, and the result is not forced by any uniqueness theorem or ansatz imported from prior work by the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the uniform-interval sweep captures the relevant dynamics and that the minimal accumulator reproduces the essential behavior of the original affect engine. No free parameters are fitted; the result is presented as a structural property of the calibration class.

axioms (2)

standard math Exponential decay operates only when dt > 0
Invoked when explaining why dt=0 produced a pure accumulator in the original experiment.
domain assumption Uniform sampling of dt intervals is representative of agent cadence distributions
Used to map the (1,30]s critical band onto real median latency of 1.53s.

pith-pipeline@v0.9.1-grok · 5913 in / 1415 out tokens · 27560 ms · 2026-06-27T02:15:05.171051+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 5 canonical work pages

[2]

and Sun, Jun , booktitle =

Wang, Haoyu and Poskitt, Christopher M. and Sun, Jun , booktitle =. 2026 , eprint =

2026
[4]

2025 , eprint =

Andriushchenko, Maksym and Souly, Alexandra and Dziemian, Mateusz and Duenas, Derek and Lin, Maxwell and Wang, Justin and Hendrycks, Dan and Zou, Andy and Kolter, Zico and Fredrikson, Matt and Winsor, Eric and Wynne, Jerome and Gal, Yarin and Davies, Xander , booktitle =. 2025 , eprint =

2025
[5]

Biometrika , volume =

Continuous Inspection Schemes , author =. Biometrika , volume =. 1954 , doi =

1954
[6]

The Annals of Mathematical Statistics , volume =

Procedures for Reacting to a Change in Distribution , author =. The Annals of Mathematical Statistics , volume =. 1971 , doi =

1971
[7]

Signal Processing , volume =

Selective Review of Offline Change Point Detection Methods , author =. Signal Processing , volume =. 2020 , doi =

2020
[8]

Journal of Scientific Instruments , volume =

A Thermionic Trigger , author =. Journal of Scientific Instruments , volume =. 1938 , doi =

1938
[9]

and Zhang, Hao and Gonzalez, Joseph E

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle =. Judging. 2023 , eprint =

2023
[10]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

The ``Problem'' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation , author =. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =. 2022 , doi =

2022
[11]

Content Analysis: An Introduction to Its Methodology , author =
[12]

AgentHarm : A benchmark for measuring harmfulness of LLM agents

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies. AgentHarm : A benchmark for measuring harmfulness of LLM agents. In International Conference on Learning Representations (ICLR), 2025. arXiv:2410.09024

Pith/arXiv arXiv 2025
[13]

Content Analysis: An Introduction to Its Methodology

Klaus Krippendorff. Content Analysis: An Introduction to Its Methodology. Sage Publications, 2nd edition, 2004

2004
[14]

Procedures for reacting to a change in distribution

Gary Lorden. Procedures for reacting to a change in distribution. The Annals of Mathematical Statistics, 42 0 (6): 0 1897--1908, 1971. doi:10.1214/aoms/1177693055

work page doi:10.1214/aoms/1177693055 1908
[15]

The saturation trap and the subjectivity of intervention timing: Why affect-based triggers and LLM judges fail to time interventions on autonomous agents, 2026

Manvendra Modgil. The saturation trap and the subjectivity of intervention timing: Why affect-based triggers and LLM judges fail to time interventions on autonomous agents, 2026. arXiv:2606.04296

Pith/arXiv arXiv 2026
[16]

E. S. Page. Continuous inspection schemes. Biometrika, 41 0 (1/2): 0 100--115, 1954. doi:10.1093/biomet/41.1-2.100

work page doi:10.1093/biomet/41.1-2.100 1954
[17]

The ``Problem'' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation

Barbara Plank. The ``problem'' of human label variation: On ground truth in data, modeling and evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 10671--10682, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.emnlp-main.731

work page doi:10.18653/v1/2022.emnlp-main.731 2022
[18]

Otto H. Schmitt. A thermionic trigger. volume 15, pages 24--26. 1938. doi:10.1088/0950-7671/15/1/305. Origin of the two-threshold (hysteresis) trigger; the bistable primitive our edge trigger implements

work page doi:10.1088/0950-7671/15/1/305 1938
[19]

Selective review of offline change point detection methods

Charles Truong, Laurent Oudre, and Nicolas Vayatis. Selective review of offline change point detection methods. Signal Processing, 167: 0 107299, 2020. doi:10.1016/j.sigpro.2019.107299

work page doi:10.1016/j.sigpro.2019.107299 2020
[20]

Poskitt, Jiali Wei, and Jun Sun

Haoyu Wang, Christopher M. Poskitt, Jiali Wei, and Jun Sun. ProbGuard : Probabilistic runtime monitoring for LLM agent safety, 2025. arXiv:2508.00500. Note: arXiv:2508.00500 appears under the title ``ProbGuard'' on the abstract page and ``Pro2Guard'' on an earlier PDF version; cite as ProbGuard per the current abstract page

arXiv 2025
[21]

Poskitt, and Jun Sun

Haoyu Wang, Christopher M. Poskitt, and Jun Sun. AgentSpec : Customizable runtime enforcement for safe and reliable LLM agents. In Proceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE), 2026. arXiv:2503.18666; accepted to ICSE 2026

Pith/arXiv arXiv 2026
[22]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM -as-a-judge with MT -bench and chatbot arena. In Advances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2306.05685

Pith/arXiv arXiv 2023

[1] [2]

and Sun, Jun , booktitle =

Wang, Haoyu and Poskitt, Christopher M. and Sun, Jun , booktitle =. 2026 , eprint =

2026

[2] [4]

2025 , eprint =

Andriushchenko, Maksym and Souly, Alexandra and Dziemian, Mateusz and Duenas, Derek and Lin, Maxwell and Wang, Justin and Hendrycks, Dan and Zou, Andy and Kolter, Zico and Fredrikson, Matt and Winsor, Eric and Wynne, Jerome and Gal, Yarin and Davies, Xander , booktitle =. 2025 , eprint =

2025

[3] [5]

Biometrika , volume =

Continuous Inspection Schemes , author =. Biometrika , volume =. 1954 , doi =

1954

[4] [6]

The Annals of Mathematical Statistics , volume =

Procedures for Reacting to a Change in Distribution , author =. The Annals of Mathematical Statistics , volume =. 1971 , doi =

1971

[5] [7]

Signal Processing , volume =

Selective Review of Offline Change Point Detection Methods , author =. Signal Processing , volume =. 2020 , doi =

2020

[6] [8]

Journal of Scientific Instruments , volume =

A Thermionic Trigger , author =. Journal of Scientific Instruments , volume =. 1938 , doi =

1938

[7] [9]

and Zhang, Hao and Gonzalez, Joseph E

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle =. Judging. 2023 , eprint =

2023

[8] [10]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

The ``Problem'' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation , author =. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =. 2022 , doi =

2022

[9] [11]

Content Analysis: An Introduction to Its Methodology , author =

[10] [12]

AgentHarm : A benchmark for measuring harmfulness of LLM agents

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies. AgentHarm : A benchmark for measuring harmfulness of LLM agents. In International Conference on Learning Representations (ICLR), 2025. arXiv:2410.09024

Pith/arXiv arXiv 2025

[11] [13]

Content Analysis: An Introduction to Its Methodology

Klaus Krippendorff. Content Analysis: An Introduction to Its Methodology. Sage Publications, 2nd edition, 2004

2004

[12] [14]

Procedures for reacting to a change in distribution

Gary Lorden. Procedures for reacting to a change in distribution. The Annals of Mathematical Statistics, 42 0 (6): 0 1897--1908, 1971. doi:10.1214/aoms/1177693055

work page doi:10.1214/aoms/1177693055 1908

[13] [15]

The saturation trap and the subjectivity of intervention timing: Why affect-based triggers and LLM judges fail to time interventions on autonomous agents, 2026

Manvendra Modgil. The saturation trap and the subjectivity of intervention timing: Why affect-based triggers and LLM judges fail to time interventions on autonomous agents, 2026. arXiv:2606.04296

Pith/arXiv arXiv 2026

[14] [16]

E. S. Page. Continuous inspection schemes. Biometrika, 41 0 (1/2): 0 100--115, 1954. doi:10.1093/biomet/41.1-2.100

work page doi:10.1093/biomet/41.1-2.100 1954

[15] [17]

The ``Problem'' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation

Barbara Plank. The ``problem'' of human label variation: On ground truth in data, modeling and evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 10671--10682, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.emnlp-main.731

work page doi:10.18653/v1/2022.emnlp-main.731 2022

[16] [18]

Otto H. Schmitt. A thermionic trigger. volume 15, pages 24--26. 1938. doi:10.1088/0950-7671/15/1/305. Origin of the two-threshold (hysteresis) trigger; the bistable primitive our edge trigger implements

work page doi:10.1088/0950-7671/15/1/305 1938

[17] [19]

Selective review of offline change point detection methods

Charles Truong, Laurent Oudre, and Nicolas Vayatis. Selective review of offline change point detection methods. Signal Processing, 167: 0 107299, 2020. doi:10.1016/j.sigpro.2019.107299

work page doi:10.1016/j.sigpro.2019.107299 2020

[18] [20]

Poskitt, Jiali Wei, and Jun Sun

Haoyu Wang, Christopher M. Poskitt, Jiali Wei, and Jun Sun. ProbGuard : Probabilistic runtime monitoring for LLM agent safety, 2025. arXiv:2508.00500. Note: arXiv:2508.00500 appears under the title ``ProbGuard'' on the abstract page and ``Pro2Guard'' on an earlier PDF version; cite as ProbGuard per the current abstract page

arXiv 2025

[19] [21]

Poskitt, and Jun Sun

Haoyu Wang, Christopher M. Poskitt, and Jun Sun. AgentSpec : Customizable runtime enforcement for safe and reliable LLM agents. In Proceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE), 2026. arXiv:2503.18666; accepted to ICSE 2026

Pith/arXiv arXiv 2026

[20] [22]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM -as-a-judge with MT -bench and chatbot arena. In Advances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2306.05685

Pith/arXiv arXiv 2023