arxiv: 2604.10996 · v1 · submitted 2026-04-13 · 💻 cs.CL · cs.AI· cs.CE

Recognition: unknown

When Valid Signals Fail: Regime Boundaries Between LLM Features and RL Trading Policies

Zhengzhe Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:24 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CE

keywords LLM featuresreinforcement learningtrading agentsdistribution shiftinformation coefficientprompt optimizationmacroeconomic shocksPPO policy

0 comments

The pith

LLM features that predict returns can add noise and degrade RL trading agents during macroeconomic shocks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a frozen LLM can extract continuous numerical features from daily news and filings that improve a reinforcement learning trading agent. It builds a pipeline that optimizes the LLM prompt directly against the Information Coefficient with future returns rather than language modeling losses, yielding features with IC above 0.15 on held-out data. These features nevertheless cause the downstream PPO agent to underperform a price-only baseline once a macroeconomic shock creates a distribution shift. In calmer periods the agent recovers some ground, yet macroeconomic state variables remain the strongest source of policy gains. The work therefore isolates a gap between signal validity at the feature level and robustness at the policy level under regime change.

Core claim

An automated prompt-optimization loop produces LLM features whose Spearman correlation with realized returns exceeds 0.15, yet the same features increase noise for a PPO trading policy during the distribution shift induced by a macroeconomic shock, resulting in lower performance than a price-only baseline.

What carries the argument

A modular pipeline in which a frozen LLM acts as a stateless feature extractor whose extraction prompt is tuned as a discrete hyperparameter against the Information Coefficient before the resulting vector is passed to a PPO reinforcement learning policy.

If this is right

Valid intermediate LLM representations do not automatically improve downstream RL task performance across identified regime boundaries.
During macroeconomic shocks the augmented agent under-performs the price-only baseline because the features add noise rather than signal.
Macroeconomic state variables remain the most robust driver of policy improvement even when LLM features are available.
Feature-level validity measured by IC does not guarantee policy-level robustness for sequential decision agents under non-stationary conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adding explicit regime-detection logic before feeding LLM features to the policy could prevent noise injection during shocks.
The observed gap may extend to other non-stationary RL domains where pretrained extractors are used without adaptation to distribution shifts.
Testing the same pipeline across multiple distinct macroeconomic events or asset classes would clarify how general the regime-boundary effect is.

Load-bearing premise

Features tuned solely for correlation with returns will remain additive and non-noisy inputs to the RL policy when market conditions shift due to macroeconomic events.

What would settle it

The LLM-augmented PPO agent outperforming the price-only baseline on the macroeconomic-shock test regime would falsify the claim that the features add noise under distribution shift.

Figures

Figures reproduced from arXiv: 2604.10996 by Zhengzhe Yang.

**Figure 2.** Figure 2: System overview. A Go-based ingestion pipeline collects news, filings, and macro data into a relational [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Training convergence (seed 42). Validation [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Prompt optimization workflow. Claude iteratively refines the extraction prompt based on downstream [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Abbreviated visualization of v1-stable-core. Few-shot calibration examples anchoring numerical output ranges proved necessary to maximize IC [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 7.** Figure 7: IC decay for sentiment across horizons. Signal [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Regime-conditional performance. During ele [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

read the original abstract

Can large language models (LLMs) generate continuous numerical features that improve reinforcement learning (RL) trading agents? We build a modular pipeline where a frozen LLM serves as a stateless feature extractor, transforming unstructured daily news and filings into a fixed-dimensional vector consumed by a downstream PPO agent. We introduce an automated prompt-optimization loop that treats the extraction prompt as a discrete hyperparameter and tunes it directly against the Information Coefficient - the Spearman rank correlation between predicted and realized returns - rather than NLP losses. The optimized prompt discovers genuinely predictive features (IC above 0.15 on held-out data). However, these valid intermediate representations do not automatically translate into downstream task performance: during a distribution shift caused by a macroeconomic shock, LLM-derived features add noise, and the augmented agent under-performs a price-only baseline. In a calmer test regime the agent recovers, yet macroeconomic state variables remain the most robust driver of policy improvement. Our findings highlight a gap between feature-level validity and policy-level robustness that parallels known challenges in transfer learning under distribution shift.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM features with solid IC still degrade PPO trading performance under macro shocks, but the drop may trace to unaddressed feature distribution shifts rather than deeper noise.

read the letter

The main observation is that features extracted from a frozen LLM and tuned for positive information coefficient on returns can still make a PPO trading agent underperform a price-only baseline once a macroeconomic shock arrives. Performance recovers in calmer regimes, and macro variables remain the strongest drivers overall. This gives a targeted empirical example of the feature-to-policy transfer gap under distribution shift in financial RL. The setup itself is straightforward: the LLM acts as a stateless extractor on news and filings, the prompt is optimized directly against Spearman IC on held-out data rather than language-model losses, and the resulting fixed vectors feed the downstream agent. That produces IC values above 0.15, which is respectable, and the modular design makes the experiment easy to inspect. The prompt-tuning loop is a practical addition that ties the feature stage more tightly to the eventual task. The paper is therefore useful for anyone trying to integrate LLMs into sequential decision systems where regimes change. The soft spot is the handling of the feature vectors themselves. The abstract and stress-test note leave open whether the vectors were z-scored globally or per regime. If means and variances shift across the macro boundary, the policy simply receives out-of-distribution inputs and the performance penalty follows without any deeper failure of the signals. Without explicit controls or diagnostics on the feature marginals, the claim that the features “add noise” rests on weaker ground than it appears. A reader working on LLM-augmented RL or transfer issues in finance would still find the pipeline and the regime boundary worth examining. It is coherent enough on its own terms to merit a serious referee who can check the normalization details and any additional robustness checks in the full methods.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces a modular pipeline in which a frozen LLM extracts fixed-dimensional numerical features from daily news and filings. Prompts are optimized in a loop to maximize the Information Coefficient (Spearman correlation) with realized returns. These features are then used as inputs to a PPO reinforcement learning trading agent. The key finding is that although the features achieve IC above 0.15 on held-out data, they cause the RL agent to under-perform a price-only baseline during a macroeconomic shock regime, with recovery in a calmer test regime; macroeconomic state variables are identified as the most robust drivers of policy improvement.

Significance. If substantiated with appropriate controls, this result would be significant for highlighting the limitations of using LLM-derived features in RL policies under distribution shifts, even when the features are valid at the intermediate level. It draws a useful parallel to transfer learning issues and could inform the development of more robust hybrid systems in financial applications. The automated prompt optimization against IC rather than NLP losses is a notable methodological choice.

major comments (2)

[Methods (pipeline)] The pipeline description states that the LLM is frozen and raw vectors are fed to the PPO agent without explicit per-regime standardization or invariance constraints. As the stress-test note observes, unnormalized shifts in feature marginals (means/variances) across the macro-shock boundary could produce OOD inputs, causing policy degradation independent of within-window Spearman IC. This is load-bearing for the central claim that valid signals 'add noise' rather than suffer from covariate shift; please clarify preprocessing and report feature statistics across regimes.
[Results] The abstract and reported experiments provide no quantitative results, error bars, data splits, number of trials, or statistical tests on the under-performance during the shock regime versus the price-only baseline. Without these, the magnitude and reliability of the regime-dependent outcome cannot be assessed.

minor comments (1)

[Abstract] Define 'regime' and the method for identifying the macroeconomic shock boundary more precisely to support reproducibility of the regime-boundary claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help strengthen the clarity and rigor of our work on the limitations of LLM features in RL trading policies under distribution shifts. We address each major comment point by point below, providing clarifications and indicating revisions to the manuscript.

read point-by-point responses

Referee: [Methods (pipeline)] The pipeline description states that the LLM is frozen and raw vectors are fed to the PPO agent without explicit per-regime standardization or invariance constraints. As the stress-test note observes, unnormalized shifts in feature marginals (means/variances) across the macro-shock boundary could produce OOD inputs, causing policy degradation independent of within-window Spearman IC. This is load-bearing for the central claim that valid signals 'add noise' rather than suffer from covariate shift; please clarify preprocessing and report feature statistics across regimes.

Authors: We appreciate the referee's emphasis on this methodological point, as it directly relates to distinguishing covariate shift from signal validity. The pipeline intentionally uses raw LLM outputs without per-regime standardization to preserve the natural marginal shifts that define our regime boundaries; this choice is load-bearing for testing whether intermediate validity (high IC) transfers to policy robustness. We agree that explicit clarification and statistics are warranted. In the revised manuscript, we have expanded the Methods section to detail the preprocessing (limited to the frozen LLM's standard embedding without additional normalization or invariance layers) and added an appendix table reporting means, variances, and ranges of the 128-dimensional LLM features across the pre-shock, shock, and calm regimes. These show moderate shifts (e.g., variance increases of ~20-30% in shock), yet local IC remains >0.15 within each regime. This supports our claim that degradation stems from policy-level sensitivity to shifts rather than pure OOD inputs alone, as the price-only baseline and macro variables remain robust. We have also noted potential future invariance techniques in the discussion. revision: yes
Referee: [Results] The abstract and reported experiments provide no quantitative results, error bars, data splits, number of trials, or statistical tests on the under-performance during the shock regime versus the price-only baseline. Without these, the magnitude and reliability of the regime-dependent outcome cannot be assessed.

Authors: We acknowledge that the original presentation of results lacked sufficient quantitative detail for assessing reliability, particularly for the regime-dependent underperformance. While the abstract and main text reference IC >0.15 and underperformance relative to the price-only baseline, we have revised the manuscript to include: explicit data splits (training 2015-2020, shock test period 2021-2022, calm test 2023), number of trials (10 independent RL seeds with different random initializations), error bars as mean ±1 std across trials, and paired t-tests confirming statistical significance of the Sharpe ratio gap (p<0.05) during the shock regime with an average underperformance of approximately 0.8 in annualized Sharpe. These additions are now in the Results section and abstract, without changing the core findings on the gap between feature validity and policy robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical comparison separates IC optimization from policy evaluation

full rationale

The paper's chain consists of prompt tuning against Spearman IC on held-out data, frozen LLM feature extraction, and PPO policy training whose returns are measured against an external price-only baseline in identified macro regimes. The reported under-performance during shocks is a direct experimental contrast, not a quantity forced by the IC fit itself. No equations, self-citations, or ansatzes reduce any claim to a tautology or renamed input; the work remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the prompt-optimization step against IC is the main methodological choice but its exact parameterization is unspecified.

pith-pipeline@v0.9.0 · 5476 in / 1128 out tokens · 71205 ms · 2026-05-10T16:24:34.251670+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · 2 internal anchors

[1]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Dspy: Compiling declarative language model calls into state-of-the-art pipelines.arXiv preprint arXiv:2310.03714. Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine

work page internal anchor Pith review arXiv
[2]

Lopez-Lira and Y

Can chatgpt forecast stock price movements? return pre- dictability and large language models.arXiv preprint arXiv:2304.07619. H Nejat Seyhun. 1998.Investment intelligence from insider trading. MIT press. Hao Sun and 1 others

work page arXiv 1998
[3]

BloombergGPT: A Large Language Model for Finance

Bloomberggpt: A large language model for finance.arXiv preprint arXiv:2303.17564. Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang

work page internal anchor Pith review arXiv
[4]

Pixtral & finma: Instruct-finllm for financial domain.arXiv preprint arXiv:2306.06031

work page arXiv