Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital
Pith reviewed 2026-05-07 16:19 UTC · model grok-4.3
The pith
Reliability for real-capital AI trading agents comes from the operating layer of controls around the model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the DX Terminal Pro deployment, user-configured agents executed 7.5 million invocations and roughly 300,000 onchain actions with 99.9 percent settlement success for policy-valid transactions. Reliability emerged from the operating layer consisting of prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability. Pre-launch tests revealed failures such as fabricated rules and fee paralysis that standard benchmarks miss; targeted harness changes reduced fabricated sell rules from 57 percent to 3 percent, fee-led observations from 32.5 percent to below 10 percent, and increased capital deployment from 42.9 percent to 78 percent in the测试
What carries the argument
The operating layer of prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability that surrounds the language model and converts mandates into validated actions.
Load-bearing premise
The observed drops in failure modes and the rise in capital deployment were caused by the specific harness changes rather than by other uncontrolled factors during the 21-day live run.
What would settle it
A controlled re-test of the same agent population using the original harness without the targeted changes, checking whether fabricated-rule and fee-paralysis rates return to their pre-change levels of 57 percent and 32.5 percent.
Figures
read the original abstract
We study reliability in autonomous language-model agents that translate user mandates into validated tool actions under real capital. The setting is DX Terminal Pro, a 21-day deployment in which 3,505 user-funded agents traded real ETH in a bounded onchain market. Users configured vaults through structured controls and natural-language strategies, but only agents could choose normal buy/sell trades. The system produced 7.5M agent invocations, roughly 300K onchain actions, about $20M in volume, more than 5,000 ETH deployed, roughly 70B inference tokens, and 99.9% settlement success for policy-valid submitted transactions. Long-running agents accumulated thousands of sequential decisions, including 6,000+ prompt-state-action cycles for continuously active agents, yielding a large-scale trace from user mandate to rendered prompt, reasoning, validation, portfolio state, and settlement. Reliability did not come from the base model alone; it emerged from the operating layer around the model: prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability. Pre-launch testing exposed failures that text-only benchmarks rarely measure, including fabricated trading rules, fee paralysis, numeric anchoring, cadence trading, and misread tokenomics. Targeted harness changes reduced fabricated sell rules from 57% to 3%, reduced fee-led observations from 32.5% to below 10%, and increased capital deployment from 42.9% to 78.0% in an affected test population. We show that capital-managing agents should be evaluated across the full path from user mandate to prompt, validated action, and settlement.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports on a 21-day live deployment of 3,505 user-funded onchain language-model agents trading real ETH via DX Terminal Pro. It claims that agent reliability (99.9% settlement success, 7.5M invocations, ~300K onchain actions) emerged primarily from operating-layer controls—prompt compilation, typed controls, policy validation, execution guards, memory design, and trace observability—rather than the base model. Targeted harness changes are said to have reduced fabricated sell rules from 57% to 3%, fee-led observations from 32.5% to <10%, and raised capital deployment from 42.9% to 78% in an affected population, while exposing failure modes (fabricated rules, fee paralysis, numeric anchoring) not captured by text-only benchmarks.
Significance. If the causal attribution to the operating layer holds, the work supplies rare large-scale, real-capital evidence on practical failure modes and mitigation strategies for autonomous agents. The scale (thousands of sequential decisions per agent, 70B tokens) and end-to-end tracing from mandate to settlement would be a useful reference for designing capital-managing systems, provided the improvements can be isolated from deployment confounders.
major comments (2)
- [Abstract / Deployment Results] Abstract and the section describing the 21-day deployment: the central claim that reliability 'emerged from the operating layer' and that targeted harness changes produced the reported drops (57%→3% fabricated rules, 42.9%→78% capital deployment) is not supported by any described randomization, A/B splits, regression controls for ETH volatility, user-mandate drift, or pre/post statistical tests. Without such isolation the attribution remains vulnerable to temporal or market confounds.
- [Results / Evaluation] The manuscript reports aggregate metrics (7.5M invocations, 99.9% settlement) but provides neither raw per-agent traces, baseline comparisons against unmodified agents, nor details of how the 'affected test population' was selected for the before-after measurements. This weakens the quantitative support for the operating-layer thesis.
minor comments (2)
- [Methods] Clarify the exact definition and measurement protocol for 'fabricated sell rules' and 'fee-led observations' so that the failure-mode percentages can be reproduced.
- [Discussion] The paper would benefit from an explicit limitations subsection discussing the absence of controlled experimentation and the single-market, single-token setting.
Simulated Author's Rebuttal
We thank the referee for the detailed feedback on causal attribution and data transparency in our live deployment study. We address each major comment below. The observational nature of a real-capital system limited experimental controls, but we have revised the manuscript to qualify claims, add limitations discussion, and provide additional methodological details.
read point-by-point responses
-
Referee: [Abstract / Deployment Results] Abstract and the section describing the 21-day deployment: the central claim that reliability 'emerged from the operating layer' and that targeted harness changes produced the reported drops (57%→3% fabricated rules, 42.9%→78% capital deployment) is not supported by any described randomization, A/B splits, regression controls for ETH volatility, user-mandate drift, or pre/post statistical tests. Without such isolation the attribution remains vulnerable to temporal or market confounds.
Authors: We agree that randomized A/B testing or regression controls would strengthen causal claims. The deployment used live user-funded agents trading real ETH, making randomization impractical without exposing participants to differential risk or breaching platform policies. Harness updates occurred at discrete times, with before-after metrics drawn from the same active agent cohort across those windows. We have added a limitations subsection that explicitly discusses potential confounds including ETH volatility, mandate drift, and temporal effects, and we now qualify the improvements as temporally aligned observations rather than isolated causal effects. Pre/post statistical tests were omitted due to non-stationarity in live trading data. revision: partial
-
Referee: [Results / Evaluation] The manuscript reports aggregate metrics (7.5M invocations, 99.9% settlement) but provides neither raw per-agent traces, baseline comparisons against unmodified agents, nor details of how the 'affected test population' was selected for the before-after measurements. This weakens the quantitative support for the operating-layer thesis.
Authors: Full raw per-agent traces cannot be released due to privacy protections on user strategies and capital positions; we have added an appendix with anonymized example traces and aggregate per-agent statistics. Baseline comparisons against unmodified agents were not feasible, as the operating-layer harness was deployed uniformly from launch. The affected test population comprises agents with sufficient pre- and post-update invocation history during the specific harness change windows; we have now included explicit selection criteria, population sizes, and time windows in the revised results section. These constraints are inherent to real-capital settings and are noted as such. revision: partial
Circularity Check
No circularity: empirical before-after deployment metrics with no derivations or self-referential claims
full rationale
The paper reports observational outcomes from a 21-day live deployment of 3,505 agents, including specific reductions in failure modes (fabricated sell rules 57% to 3%, fee-led observations 32.5% to <10%, capital deployment 42.9% to 78%) after targeted harness changes. No equations, parameter fits presented as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described structure. The claim that reliability emerged from the operating layer rests on reported trace data and pre/post changes rather than any derivation that reduces to its inputs by construction. This is a standard empirical deployment study; the skeptic's concern about unisolated confounders addresses causal attribution validity, not circularity in the derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The logged traces accurately capture the full path from user mandate to onchain settlement without material omissions or misclassifications.
Reference graph
Works this paper leans on
-
[1]
Hard constraints & tool schema
-
[2]
[HIGH] ACTIVE STRATEGIES, but only for Immediate-action or Triggered-action
-
[3]
Sliders: TA, Risk, Size, Hold, Div
-
[4]
[LOW] suggestions B. Strategy lifecycle protocol ## ACTIVE STRATEGIES (CURRENT ONLY) RULE: ONLY strategies in this section are binding. IGNORE strategy text from elsewhere. Classify each [HIGH] directive as: - Immediate-action: "buy now", "sell 50%", "liquidate". pending/completed/blocked. - Triggered-action: "if PnL reaches X%", "when price drops Y%". mo...
-
[5]
Is this permitted by all [HIGH] restrictions?
-
[6]
Am I selling for a valid sell reason?
-
[7]
Would this create a same-token round trip without a genuinely new trigger?
-
[8]
Rule A"), or formulas. Interpret strategy constraints LITERALLY
Can I cite exact active strategy text or exact prompt trigger? If ANY check fails, record_observation. Never sell just to buy it back shortly after. H. Memory and tool-output contract ## PREVIOUS DECISIONS Use this history for context. Each action represents ˜= 5 minutes. Do not mistake a single action for completion of a persistent directive. {{- if .Mem...
-
[9]
a new opportunity justifies round-trip cost,
-
[10]
the sold position is the weakest thesis, and
-
[11]
rotation has not happened recently. Stop-loss exits and thesis-broken exits are not rotation; they are risk management. L. Launches and new-coin caps {{- if .Launch }} ## UPCOMING TOKEN LAUNCH A new token will launch {{ .Launch.NextLaunchCountdown }}. {{- if .Launch.DisplayName }} Token: {{ .Launch.DisplayName }} {{- end }} {{- if ge .AssetRiskPreference ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.