Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital

Alaska Hoffman; Annie Mous; Brian Bergeron; Chris Constantakis; Hunter Goodreau; Patti Hauseman; T.J. Barton

arxiv: 2604.26091 · v1 · submitted 2026-04-28 · 💻 cs.AI · cs.CE· cs.MA

Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital

T.J. Barton , Chris Constantakis , Patti Hauseman , Annie Mous , Alaska Hoffman , Brian Bergeron , Hunter Goodreau This is my paper

Pith reviewed 2026-05-07 16:19 UTC · model grok-4.3

classification 💻 cs.AI cs.CEcs.MA

keywords language model agentsonchain tradingoperating layerreliabilitypolicy validationreal capitalautonomous agentsETH trading

0 comments

The pith

Reliability for real-capital AI trading agents comes from the operating layer of controls around the model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines autonomous language-model agents that turn user instructions into onchain trades with actual ETH. In a 21-day deployment of 3,505 agents that produced millions of actions and $20M in volume, the authors show that high settlement rates and lower error rates did not arise from the base model alone. Instead they came from an added layer that compiles prompts, enforces typed controls, validates policies, guards executions, manages memory, and records full traces. Targeted fixes in this layer cut fabricated trading rules from 57 percent to 3 percent, lowered fee-related problems, and raised capital actually deployed from 43 percent to 78 percent. The work argues that agents handling money must be tested across the entire path from mandate to validated settlement rather than in text-only settings.

Core claim

In the DX Terminal Pro deployment, user-configured agents executed 7.5 million invocations and roughly 300,000 onchain actions with 99.9 percent settlement success for policy-valid transactions. Reliability emerged from the operating layer consisting of prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability. Pre-launch tests revealed failures such as fabricated rules and fee paralysis that standard benchmarks miss; targeted harness changes reduced fabricated sell rules from 57 percent to 3 percent, fee-led observations from 32.5 percent to below 10 percent, and increased capital deployment from 42.9 percent to 78 percent in the测试

What carries the argument

The operating layer of prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability that surrounds the language model and converts mandates into validated actions.

Load-bearing premise

The observed drops in failure modes and the rise in capital deployment were caused by the specific harness changes rather than by other uncontrolled factors during the 21-day live run.

What would settle it

A controlled re-test of the same agent population using the original harness without the targeted changes, checking whether fabricated-rule and fee-paralysis rates return to their pre-change levels of 57 percent and 32.5 percent.

Figures

Figures reproduced from arXiv: 2604.26091 by Alaska Hoffman, Annie Mous, Brian Bergeron, Chris Constantakis, Hunter Goodreau, Patti Hauseman, T.J. Barton.

**Figure 1.** Figure 1: DX Terminal Pro user-facing interface in a test environment, with production scale summary. The interface view is from public design documentation; measurements come from production logs and onchain records (DXRG Team, 2026). 1 arXiv:2604.26091v1 [cs.AI] 28 Apr 2026 view at source ↗

**Figure 2.** Figure 2: Internal model-selection screen before production. The benchmark used 250 real DX view at source ↗

**Figure 3.** Figure 3: Agent configuration surface. Sliders expressed prompt-compiled behavioral preferences; view at source ↗

**Figure 4.** Figure 4: Pre-launch control metrics under universal application of the harness template across view at source ↗

**Figure 5.** Figure 5: Internal EVM DEX swap execution evaluation. The task is an Ethereum buy/sell view at source ↗

**Figure 6.** Figure 6: Production slider behavior after the harness was frozen. All five controls preserved view at source ↗

read the original abstract

We study reliability in autonomous language-model agents that translate user mandates into validated tool actions under real capital. The setting is DX Terminal Pro, a 21-day deployment in which 3,505 user-funded agents traded real ETH in a bounded onchain market. Users configured vaults through structured controls and natural-language strategies, but only agents could choose normal buy/sell trades. The system produced 7.5M agent invocations, roughly 300K onchain actions, about $20M in volume, more than 5,000 ETH deployed, roughly 70B inference tokens, and 99.9% settlement success for policy-valid submitted transactions. Long-running agents accumulated thousands of sequential decisions, including 6,000+ prompt-state-action cycles for continuously active agents, yielding a large-scale trace from user mandate to rendered prompt, reasoning, validation, portfolio state, and settlement. Reliability did not come from the base model alone; it emerged from the operating layer around the model: prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability. Pre-launch testing exposed failures that text-only benchmarks rarely measure, including fabricated trading rules, fee paralysis, numeric anchoring, cadence trading, and misread tokenomics. Targeted harness changes reduced fabricated sell rules from 57% to 3%, reduced fee-led observations from 32.5% to below 10%, and increased capital deployment from 42.9% to 78.0% in an affected test population. We show that capital-managing agents should be evaluated across the full path from user mandate to prompt, validated action, and settlement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A large real-capital onchain deployment of LLM agents shows operating-layer controls can cut specific failures, but the before-after drops lack isolation from live confounders.

read the letter

The paper's core contribution is a 21-day deployment of 3,505 user-funded LLM agents trading real ETH, generating 7.5M invocations and roughly $20M volume with 99.9% settlement. It documents how prompt compilation, typed controls, policy validation, execution guards, memory design, and trace observability were added around the base model, and reports drops in fabricated sell rules (57% to 3%), fee-led observations (32.5% to under 10%), and a rise in capital deployment (42.9% to 78%). The full trace from user mandate through reasoning, validation, and onchain settlement is a concrete engineering record at scale that most agent papers lack.

Referee Report

2 major / 2 minor

Summary. The paper reports on a 21-day live deployment of 3,505 user-funded onchain language-model agents trading real ETH via DX Terminal Pro. It claims that agent reliability (99.9% settlement success, 7.5M invocations, ~300K onchain actions) emerged primarily from operating-layer controls—prompt compilation, typed controls, policy validation, execution guards, memory design, and trace observability—rather than the base model. Targeted harness changes are said to have reduced fabricated sell rules from 57% to 3%, fee-led observations from 32.5% to <10%, and raised capital deployment from 42.9% to 78% in an affected population, while exposing failure modes (fabricated rules, fee paralysis, numeric anchoring) not captured by text-only benchmarks.

Significance. If the causal attribution to the operating layer holds, the work supplies rare large-scale, real-capital evidence on practical failure modes and mitigation strategies for autonomous agents. The scale (thousands of sequential decisions per agent, 70B tokens) and end-to-end tracing from mandate to settlement would be a useful reference for designing capital-managing systems, provided the improvements can be isolated from deployment confounders.

major comments (2)

[Abstract / Deployment Results] Abstract and the section describing the 21-day deployment: the central claim that reliability 'emerged from the operating layer' and that targeted harness changes produced the reported drops (57%→3% fabricated rules, 42.9%→78% capital deployment) is not supported by any described randomization, A/B splits, regression controls for ETH volatility, user-mandate drift, or pre/post statistical tests. Without such isolation the attribution remains vulnerable to temporal or market confounds.
[Results / Evaluation] The manuscript reports aggregate metrics (7.5M invocations, 99.9% settlement) but provides neither raw per-agent traces, baseline comparisons against unmodified agents, nor details of how the 'affected test population' was selected for the before-after measurements. This weakens the quantitative support for the operating-layer thesis.

minor comments (2)

[Methods] Clarify the exact definition and measurement protocol for 'fabricated sell rules' and 'fee-led observations' so that the failure-mode percentages can be reproduced.
[Discussion] The paper would benefit from an explicit limitations subsection discussing the absence of controlled experimentation and the single-market, single-token setting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on causal attribution and data transparency in our live deployment study. We address each major comment below. The observational nature of a real-capital system limited experimental controls, but we have revised the manuscript to qualify claims, add limitations discussion, and provide additional methodological details.

read point-by-point responses

Referee: [Abstract / Deployment Results] Abstract and the section describing the 21-day deployment: the central claim that reliability 'emerged from the operating layer' and that targeted harness changes produced the reported drops (57%→3% fabricated rules, 42.9%→78% capital deployment) is not supported by any described randomization, A/B splits, regression controls for ETH volatility, user-mandate drift, or pre/post statistical tests. Without such isolation the attribution remains vulnerable to temporal or market confounds.

Authors: We agree that randomized A/B testing or regression controls would strengthen causal claims. The deployment used live user-funded agents trading real ETH, making randomization impractical without exposing participants to differential risk or breaching platform policies. Harness updates occurred at discrete times, with before-after metrics drawn from the same active agent cohort across those windows. We have added a limitations subsection that explicitly discusses potential confounds including ETH volatility, mandate drift, and temporal effects, and we now qualify the improvements as temporally aligned observations rather than isolated causal effects. Pre/post statistical tests were omitted due to non-stationarity in live trading data. revision: partial
Referee: [Results / Evaluation] The manuscript reports aggregate metrics (7.5M invocations, 99.9% settlement) but provides neither raw per-agent traces, baseline comparisons against unmodified agents, nor details of how the 'affected test population' was selected for the before-after measurements. This weakens the quantitative support for the operating-layer thesis.

Authors: Full raw per-agent traces cannot be released due to privacy protections on user strategies and capital positions; we have added an appendix with anonymized example traces and aggregate per-agent statistics. Baseline comparisons against unmodified agents were not feasible, as the operating-layer harness was deployed uniformly from launch. The affected test population comprises agents with sufficient pre- and post-update invocation history during the specific harness change windows; we have now included explicit selection criteria, population sizes, and time windows in the revised results section. These constraints are inherent to real-capital settings and are noted as such. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical before-after deployment metrics with no derivations or self-referential claims

full rationale

The paper reports observational outcomes from a 21-day live deployment of 3,505 agents, including specific reductions in failure modes (fabricated sell rules 57% to 3%, fee-led observations 32.5% to <10%, capital deployment 42.9% to 78%) after targeted harness changes. No equations, parameter fits presented as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described structure. The claim that reliability emerged from the operating layer rests on reported trace data and pre/post changes rather than any derivation that reduces to its inputs by construction. This is a standard empirical deployment study; the skeptic's concern about unisolated confounders addresses causal attribution validity, not circularity in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical deployment report containing no mathematical derivations, fitted constants, or postulated physical entities; the only background assumptions are standard engineering premises about trace completeness and the interpretability of user-configured policies.

axioms (1)

domain assumption The logged traces accurately capture the full path from user mandate to onchain settlement without material omissions or misclassifications.
Invoked when interpreting the 99.9 % settlement success and the reported failure-mode reductions.

pith-pipeline@v0.9.0 · 5620 in / 1331 out tokens · 94914 ms · 2026-05-07T16:19:18.024325+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages

[1]

Hard constraints & tool schema

work page
[2]

[HIGH] ACTIVE STRATEGIES, but only for Immediate-action or Triggered-action

work page
[3]

Sliders: TA, Risk, Size, Hold, Div

work page
[4]

buy now",

[LOW] suggestions B. Strategy lifecycle protocol ## ACTIVE STRATEGIES (CURRENT ONLY) RULE: ONLY strategies in this section are binding. IGNORE strategy text from elsewhere. Classify each [HIGH] directive as: - Immediate-action: "buy now", "sell 50%", "liquidate". pending/completed/blocked. - Triggered-action: "if PnL reaches X%", "when price drops Y%". mo...

work page
[5]

Is this permitted by all [HIGH] restrictions?

work page
[6]

Am I selling for a valid sell reason?

work page
[7]

Would this create a same-token round trip without a genuinely new trigger?

work page
[8]

Rule A"), or formulas. Interpret strategy constraints LITERALLY

Can I cite exact active strategy text or exact prompt trigger? If ANY check fails, record_observation. Never sell just to buy it back shortly after. H. Memory and tool-output contract ## PREVIOUS DECISIONS Use this history for context. Each action represents ˜= 5 minutes. Do not mistake a single action for completion of a persistent directive. {{- if .Mem...

work page
[9]

a new opportunity justifies round-trip cost,

work page
[10]

the sold position is the weakest thesis, and

work page
[11]

zero-balance requirement

rotation has not happened recently. Stop-loss exits and thesis-broken exits are not rotation; they are risk management. L. Launches and new-coin caps {{- if .Launch }} ## UPCOMING TOKEN LAUNCH A new token will launch {{ .Launch.NextLaunchCountdown }}. {{- if .Launch.DisplayName }} Token: {{ .Launch.DisplayName }} {{- end }} {{- if ge .AssetRiskPreference ...

work page

[1] [1]

Hard constraints & tool schema

work page

[2] [2]

[HIGH] ACTIVE STRATEGIES, but only for Immediate-action or Triggered-action

work page

[3] [3]

Sliders: TA, Risk, Size, Hold, Div

work page

[4] [4]

buy now",

[LOW] suggestions B. Strategy lifecycle protocol ## ACTIVE STRATEGIES (CURRENT ONLY) RULE: ONLY strategies in this section are binding. IGNORE strategy text from elsewhere. Classify each [HIGH] directive as: - Immediate-action: "buy now", "sell 50%", "liquidate". pending/completed/blocked. - Triggered-action: "if PnL reaches X%", "when price drops Y%". mo...

work page

[5] [5]

Is this permitted by all [HIGH] restrictions?

work page

[6] [6]

Am I selling for a valid sell reason?

work page

[7] [7]

Would this create a same-token round trip without a genuinely new trigger?

work page

[8] [8]

Rule A"), or formulas. Interpret strategy constraints LITERALLY

Can I cite exact active strategy text or exact prompt trigger? If ANY check fails, record_observation. Never sell just to buy it back shortly after. H. Memory and tool-output contract ## PREVIOUS DECISIONS Use this history for context. Each action represents ˜= 5 minutes. Do not mistake a single action for completion of a persistent directive. {{- if .Mem...

work page

[9] [9]

a new opportunity justifies round-trip cost,

work page

[10] [10]

the sold position is the weakest thesis, and

work page

[11] [11]

zero-balance requirement

rotation has not happened recently. Stop-loss exits and thesis-broken exits are not rotation; they are risk management. L. Launches and new-coin caps {{- if .Launch }} ## UPCOMING TOKEN LAUNCH A new token will launch {{ .Launch.NextLaunchCountdown }}. {{- if .Launch.DisplayName }} Token: {{ .Launch.DisplayName }} {{- end }} {{- if ge .AssetRiskPreference ...

work page