Recognition: unknown
Interpretable Systematic Risk around the Clock
Pith reviewed 2026-05-10 12:26 UTC · model grok-4.3
The pith
Decomposing market jumps via LLM-classified news shows macroeconomic announcements carry the largest and most persistent risk premium.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Combining high-frequency market data with news narratives identified as jump causes and classified by a state-of-the-art open-source reasoning LLM decomposes systematic jump risk into interpretable categories. These categories display clear heterogeneity in risk premia, with macroeconomic news delivering the largest and most persistent premium. The resulting insight supports construction of an annually rebalanced real-time Fama-MacBeth factor-mimicking portfolio that isolates the most strongly priced jump risk and achieves high out-of-sample Sharpe ratios plus significant alphas relative to standard factor models.
What carries the argument
LLM-based classification of contemporaneous news narratives that cause market jumps, which decomposes total jump risk into categories and isolates those with priced premia for portfolio construction.
If this is right
- Macroeconomic news jumps exhibit the largest and most persistent risk premia among all categories.
- The real-time Fama-MacBeth portfolio isolating priced jump risk achieves high out-of-sample Sharpe ratios.
- This portfolio generates significant alphas after controlling for standard factor models.
- Around-the-clock data uncovers priced risks invisible in daytime-only samples.
- LLM narrative classification enables practical, interpretable identification of systematic risks.
Where Pith is reading between the lines
- The same classification pipeline could be applied to bonds, currencies, or commodities to extract priced jump factors in those markets.
- As LLM accuracy on financial news improves, the separation between priced and unpriced jump categories may become sharper.
- Regulators might monitor real-time category exposures to detect emerging concentrations in macroeconomic jump risk.
- Traditional statistical factor models could be augmented with these narrative-derived factors for better explanatory power.
Load-bearing premise
The open-source reasoning LLM accurately and without systematic bias identifies the true underlying cause of each market jump from the available news narratives.
What would settle it
If the annually rebalanced factor-mimicking portfolio for the highest-premium jump category fails to deliver a high out-of-sample Sharpe ratio or significant alphas on new data periods.
read the original abstract
In this paper, I present the first comprehensive, around-the-clock analysis of systematic jump risk by combining high-frequency market data with contemporaneous news narratives identified as the underlying causes of market jumps. These narratives are retrieved and classified using a state-of-the-art open-source reasoning LLM. Decomposing market risk into interpretable jump categories reveals significant heterogeneity in risk premia, with macroeconomic news commanding the largest and most persistent premium. Leveraging this insight, I construct an annually rebalanced real-time Fama-MacBeth factor-mimicking portfolio that isolates the most strongly priced jump risk, achieving a high out-of-sample Sharpe ratio and delivering significant alphas relative to standard factor models. The results highlight the value of around-the-clock analysis and LLM-based narrative understanding for identifying and managing priced risks in real time.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to deliver the first around-the-clock decomposition of systematic jump risk by pairing high-frequency market data with contemporaneous news narratives whose causes are classified by a state-of-the-art open-source reasoning LLM. It reports significant heterogeneity in jump-risk premia across categories (macroeconomic news commanding the largest and most persistent premium) and constructs an annually rebalanced real-time Fama-MacBeth factor-mimicking portfolio that isolates the most strongly priced category, achieving high out-of-sample Sharpe ratios and significant alphas relative to standard factor models.
Significance. If the LLM classifications prove accurate and unbiased, the work would provide a valuable contribution by linking interpretable news-driven jump categories to priced risk premia and by demonstrating a practical, real-time portfolio construction that generates alphas. The out-of-sample testing and annual rebalancing are strengths that support the portfolio claim; however, the absence of any validation for the core classification step limits the immediate impact.
major comments (3)
- [Methodology] The methodology section provides no details on jump detection thresholds, window sizes, or data exclusion rules (listed as free parameters in the analysis). These choices directly determine which jumps enter the LLM classification step and therefore affect all downstream heterogeneity results and portfolio performance metrics.
- [LLM classification procedure] No human validation, inter-rater checks, prompt-robustness tests, or alternative-LLM comparisons are reported for the LLM-based cause classification. Because the category-specific betas, risk premia, and the selection of the 'most strongly priced' category for the mimicking portfolio rest entirely on these labels, the lack of validation is load-bearing for the central claims of heterogeneity and out-of-sample Sharpe ratios.
- [Portfolio construction] The annually rebalanced Fama-MacBeth mimicking portfolio is built directly from the same in-sample jump-risk premia estimates used to document heterogeneity; while the out-of-sample test mitigates some circularity, the choice of which category to isolate is informed by the very heterogeneity the paper measures, requiring explicit discussion of potential selection bias.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from an explicit statement of the sample period, data sources (e.g., specific high-frequency index or futures), and number of jumps analyzed.
- [Empirical results] Notation for the category-specific jump-risk premia and the mimicking-portfolio weights should be defined more clearly before the empirical results are presented.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments on our manuscript. We appreciate the emphasis on methodological transparency, validation of the LLM classifications, and potential biases in portfolio construction. Below, we provide point-by-point responses to the major comments and outline the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [Methodology] The methodology section provides no details on jump detection thresholds, window sizes, or data exclusion rules (listed as free parameters in the analysis). These choices directly determine which jumps enter the LLM classification step and therefore affect all downstream heterogeneity results and portfolio performance metrics.
Authors: We agree that the lack of specific details on these parameters is a shortcoming that needs to be addressed. In the revised manuscript, we will expand the methodology section to include a comprehensive description of the jump detection algorithm, specifying the exact thresholds used (such as the multiple of standard deviation for identifying jumps), the window sizes for volatility estimation, and all data exclusion rules (e.g., handling of market closures, low-volume periods, or overnight returns). Additionally, we will include robustness analyses showing how variations in these parameters affect the main results on risk premia heterogeneity and portfolio performance. revision: yes
-
Referee: [LLM classification procedure] No human validation, inter-rater checks, prompt-robustness tests, or alternative-LLM comparisons are reported for the LLM-based cause classification. Because the category-specific betas, risk premia, and the selection of the 'most strongly priced' category for the mimicking portfolio rest entirely on these labels, the lack of validation is load-bearing for the central claims of heterogeneity and out-of-sample Sharpe ratios.
Authors: This is a valid concern, as the reliability of the LLM classifications is central to our findings. Although the original manuscript did not include explicit validation steps, we recognize their importance. In the revision, we will add a new subsection on classification validation. This will include: (1) human annotation of a stratified random sample of 300 jump events by two independent annotators, with inter-rater agreement metrics (e.g., Cohen's kappa); (2) comparison of LLM outputs against these human labels to report accuracy, precision, and recall per category; (3) sensitivity tests to prompt variations and temperature settings; and (4) a comparison with classifications from an alternative model such as Llama-3 or GPT-4o. We believe these additions will substantiate the use of the LLM and bolster confidence in the heterogeneity results. revision: yes
-
Referee: [Portfolio construction] The annually rebalanced Fama-MacBeth mimicking portfolio is built directly from the same in-sample jump-risk premia estimates used to document heterogeneity; while the out-of-sample test mitigates some circularity, the choice of which category to isolate is informed by the very heterogeneity the paper measures, requiring explicit discussion of potential selection bias.
Authors: We thank the referee for highlighting this potential issue of selection bias. It is true that the choice of the macroeconomic news category for the mimicking portfolio is guided by the full-sample heterogeneity analysis. However, the annual rebalancing and out-of-sample evaluation are performed in a forward-looking manner using only information available at the time of rebalancing. To address the concern explicitly, we will revise the portfolio construction section to discuss the selection process in detail, including why macro news was chosen based on economic rationale and persistence. Furthermore, we will add robustness checks: (i) results for mimicking portfolios based on all categories, (ii) a version where the category is selected using only the first half of the sample and held fixed thereafter, and (iii) a discussion of how this affects the interpretation of the out-of-sample alphas and Sharpe ratios. We maintain that the real-time nature and out-of-sample testing provide substantial protection against overfitting, but agree that explicit discussion is warranted. revision: partial
Circularity Check
No significant circularity: derivation relies on external LLM labels and genuine out-of-sample portfolio tests
full rationale
The paper's core chain—LLM classification of jump narratives to reveal premia heterogeneity, followed by construction of an annually rebalanced Fama-MacBeth mimicking portfolio and evaluation of its out-of-sample Sharpe and alphas—does not reduce to self-definition or fitted inputs by construction. The LLM step is an external measurement tool whose accuracy is assumed rather than derived from the target results; the portfolio selection uses in-sample estimates but the performance claims are explicitly out-of-sample and therefore falsifiable on future data. No equations equate a 'prediction' to its own fitted parameters, no self-citations bear the central load, and no ansatz or renaming is smuggled in. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- annual rebalancing frequency
- jump detection threshold and window
axioms (2)
- domain assumption LLM classifications of news narratives accurately reflect the true economic cause of each price jump
- standard math Fama-MacBeth cross-sectional regression recovers priced risk factors
Reference graph
Works this paper leans on
-
[1]
Split the training pool into train/validation (80/20) stratified by the topic
-
[2]
Optimize cross-entropy with AdamW (lr=2×10 −5, weight decay=0.01)
-
[3]
Train up to 50 epochs with early stopping (patience=3) on weighted F1 of the validation set; keep the best checkpoint
-
[4]
regular-market
Evaluate on the held-out test set in yeary. I consider accuracy, precision, recall, and (weighted) F1 of the model. The final out-of-sample classification from ChronoBERT is obtained by concatenating the classifications from all test sets across all years. 65 6.4 Processing High-frequency Data In this section, I discuss the approach of processing the high...
1997
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.