MacroLens: A Multi-Task Benchmark for Contextual Financial Reasoning under Macroeconomic Scenarios
Pith reviewed 2026-06-26 00:57 UTC · model grok-4.3
The pith
MacroLens is the first public benchmark to combine price history, accounting fundamentals, macroeconomic series and text for seven contextual financial tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MacroLens constructs a single point-in-time panel over 4,416 equities that correctly gates all text by publication date, applies the required lags to quarterly fundamentals, removes redundancy between filings and numerical fields, and prevents macroeconomic regime leakage, thereby enabling seven tasks that jointly use prices, 46.8M XBRL facts, 53 macro series, 295,860 SEC filings, 215,882 news articles and 1,130 macro events for contextual financial reasoning.
What carries the argument
The point-in-time panel that gates text by publication date, applies one- to ninety-day lags to fundamentals, removes filing redundancy and blocks macro regime leakage across splits.
If this is right
- Models can be evaluated on realistic multi-signal financial tasks that respect publication timing and reporting lags.
- Ablation results quantify how much each of the four signals contributes to performance on forecasting and valuation.
- The scenario layer permits direct testing of how models respond to specific macroeconomic events rendered as natural language.
- The shared panel allows head-to-head comparison of heuristics, time-series foundation models and zero-shot LLMs on identical data.
Where Pith is reading between the lines
- The automatic macro-event detection layer could be reused to generate scenario text for other asset classes or geographies.
- Strong performance on MacroLens tasks may indicate better handling of look-ahead bias in live trading or investment systems.
- Extending the panel construction rules to non-U.S. equities would test whether the same four-signal integration works outside the current market.
Load-bearing premise
The point-in-time panel construction correctly gates every text item by its actual publication date and applies the stated lags without leakage.
What would settle it
Any instance in the released dataset where a filing, news article or macro series appears before its real-world publication date would show the panel construction failed.
Figures
read the original abstract
Financial decision-making is contextual: forecasting prices, valuing companies, and assessing event exposure weigh price history, accounting fundamentals, macroeconomic regime, and contemporaneous text. A benchmark over these four signals is hard to build because finance violates four assumptions of time-series evaluation: text must be gated by its publication date to prevent look-ahead, quarterly fundamentals are reported with a one- to ninety-day lag, filing text is partly redundant with the numerical statement fields it accompanies, and macroeconomic regimes leak across calendar splits. No public benchmark addresses all four signals jointly. MacroLens covers 4,416 U.S. small- and micro-cap equities over 2021-2026. Seven tasks share one point-in-time panel of prices, 46.8M XBRL accounting facts, 53 macroeconomic series, 295,860 SEC filings, and 215,882 news articles, plus a scenario layer of 1,130 macroeconomic events across 49 types automatically detected and rendered as natural language. Tasks span contextual forecasting, public and private valuation, statement generation from fundamentals and descriptions, scenario-conditioned returns, and real-estate valuation. We evaluate 19 methods across six families spanning naive heuristics through time-series foundation models, fine-tuned LLM-based time-series models, and zero-shot large language models (LLMs), plus a five-step feature-context ablation on two frontier LLMs and a gradient-boosted baseline. MacroLens is released at https://huggingface.co/datasets/DeepAuto-AI/MacroLens.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce MacroLens, the first public multi-task benchmark jointly addressing price history, XBRL accounting facts, macroeconomic series, and text from SEC filings/news for contextual financial reasoning under macro scenarios. It covers 4,416 U.S. small- and micro-cap equities (2021-2026) via a shared point-in-time panel (46.8M XBRL facts, 53 macro series, 295,860 filings, 215,882 news items) plus 1,130 detected macro events, with seven tasks spanning forecasting, valuation, statement generation, and scenario-conditioned returns. The work evaluates 19 methods across six families and releases the dataset on Hugging Face.
Significance. If the point-in-time panel is correctly constructed, MacroLens would be a significant contribution by supplying the first public artifact that simultaneously handles all four finance-specific time-series violations (look-ahead on text, reporting lags, text-numeric redundancy, macro-regime leakage) within one panel. The scale, multi-task design, scenario layer, and broad baseline evaluation (naive heuristics through LLMs and time-series foundation models) plus public release are clear strengths that would enable more realistic model assessment in financial ML.
major comments (1)
- [Abstract] Abstract: The headline claim that MacroLens 'jointly addresses all four signals' and that 'no public benchmark' previously did so is load-bearing on the assertion that the shared point-in-time panel correctly implements the four gating rules (publication-date gating of every filing/news item, 1-90 day lags on each XBRL fact, removal of text redundant with numerical fields, and no macro-regime leakage across calendar splits). No pseudocode, SQL logic, algorithm description, or small-scale audit (e.g., example of the latest allowed filing for a given ticker-date) is supplied, despite every downstream task inheriting from this panel.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for highlighting the centrality of the point-in-time panel construction. We agree that the manuscript must supply explicit implementation details to support the headline claims. We will revise accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claim that MacroLens 'jointly addresses all four signals' and that 'no public benchmark' previously did so is load-bearing on the assertion that the shared point-in-time panel correctly implements the four gating rules (publication-date gating of every filing/news item, 1-90 day lags on each XBRL fact, removal of text redundant with numerical fields, and no macro-regime leakage across calendar splits). No pseudocode, SQL logic, algorithm description, or small-scale audit (e.g., example of the latest allowed filing for a given ticker-date) is supplied, despite every downstream task inheriting from this panel.
Authors: We accept the referee's observation. The current version of the manuscript describes the four gating rules at a high level but does not include the concrete implementation artifacts requested. In the revised manuscript we will add a new subsection (Section 3.2) that (i) provides pseudocode for the four gating procedures, (ii) supplies the exact SQL-style logic used to assemble the shared panel, and (iii) includes a worked audit example that lists, for a representative ticker-date pair, the latest permissible filing, the applied lag window on each XBRL fact, the removed redundant text fields, and the macro-regime split used. These additions will allow independent verification that every downstream task inherits from a correctly gated panel. revision: yes
Circularity Check
No circularity; dataset benchmark with no derivations or fitted predictions
full rationale
The paper releases a multi-task benchmark dataset (MacroLens) covering equities, XBRL facts, macro series, filings, and news with asserted point-in-time properties. No mathematical derivations, equations, parameter fitting, predictions, or first-principles results appear in the abstract or described content. The contribution is the data artifact and task definitions themselves; no step reduces by construction to prior inputs, self-citations, or ansatzes. This is a standard empirical benchmark release and receives score 0.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2508.10925 , year=
gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=
-
[2]
arXiv preprint arXiv:2501.08313 , year=
Minimax-01: Scaling foundation models with lightning attention , author=. arXiv preprint arXiv:2501.08313 , year=
-
[3]
arXiv preprint arXiv:2405.04434 , year=
Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model , author=. arXiv preprint arXiv:2405.04434 , year=
-
[4]
arXiv preprint arXiv:2601.01739 , year=
K-EXAONE Technical Report , author=. arXiv preprint arXiv:2601.01739 , year=
-
[5]
arXiv preprint arXiv:2508.06471 , year=
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models , author=. arXiv preprint arXiv:2508.06471 , year=
-
[6]
arXiv preprint arXiv:2505.09388 , year=
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
-
[7]
The Llama 4 Herd: The Beginning of a New Era of Natively Multimodal
-
[8]
2025 , eprint=
Gemma 3 Technical Report , author=. 2025 , eprint=
2025
-
[9]
Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=
Monash Time Series Forecasting Archive , author=. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=
-
[10]
International Journal of forecasting , volume=
The M4 Competition: Results, findings, conclusion and way forward , author=. International Journal of forecasting , volume=
-
[11]
International Journal of Forecasting , volume=
M5 accuracy competition: Results, findings, and conclusions , author=. International Journal of Forecasting , volume=
-
[12]
International Journal of Forecasting , volume=
Another look at measures of forecast accuracy , author=. International Journal of Forecasting , volume=. 2006 , publisher=
2006
-
[13]
Journal of the American statistical Association , volume=
Strictly proper scoring rules, prediction, and estimation , author=. Journal of the American statistical Association , volume=
-
[14]
International journal of forecasting , volume=
DeepAR: Probabilistic forecasting with autoregressive recurrent networks , author=. International journal of forecasting , volume=
-
[15]
arXiv preprint arXiv:2211.14730 , year=
A time series is worth 64 words: Long-term forecasting with transformers , author=. arXiv preprint arXiv:2211.14730 , year=
-
[16]
International Conference on Learning Representations , year=
N-BEATS: Neural basis expansion analysis for interpretable time series forecasting , author=. International Conference on Learning Representations , year=
-
[17]
Proceedings of the AAAI conference on artificial intelligence , volume=
Are transformers effective for time series forecasting? , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
-
[18]
The Eleventh International Conference on Learning Representations , year=
A Time Series is Worth 64 Words: Long-term Forecasting with Transformers , author=. The Eleventh International Conference on Learning Representations , year=
-
[19]
The Twelfth International Conference on Learning Representations , year=
iTransformer: Inverted Transformers Are Effective for Time Series Forecasting , author=. The Twelfth International Conference on Learning Representations , year=
-
[20]
Forty-second International Conference on Machine Learning , year=
Sundial: A Family of Highly Capable Time Series Foundation Models , author=. Forty-second International Conference on Machine Learning , year=
-
[21]
Transactions on Machine Learning Research , issn=
Chronos: Learning the Language of Time Series , author=. Transactions on Machine Learning Research , issn=. 2024 , note=
2024
-
[22]
arXiv preprint arXiv:2510.15821 , year=
Chronos-2: From univariate to universal forecasting , author=. arXiv preprint arXiv:2510.15821 , year=
-
[23]
Forty-first International Conference on Machine Learning , year=
Unified training of universal time series forecasting transformers , author=. Forty-first International Conference on Machine Learning , year=
-
[24]
Forty-second International Conference on Machine Learning , year=
Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts , author=. Forty-second International Conference on Machine Learning , year=
-
[25]
arXiv preprint arXiv:2511.11698 , year=
Moirai 2.0: When less is more for time series forecasting , author=. arXiv preprint arXiv:2511.11698 , year=
-
[26]
Forty-first international conference on machine learning , year=
A decoder-only foundation model for time-series forecasting , author=. Forty-first international conference on machine learning , year=
-
[27]
arXiv preprint arXiv:2310.03589 , year=
TimeGPT-1 , author=. arXiv preprint arXiv:2310.03589 , year=
-
[28]
Taha Aksu and Gerald Woo and Juncheng Liu and Xu Liu and Chenghao Liu and Silvio Savarese and Caiming Xiong and Doyen Sahoo , booktitle=
-
[29]
and Yang, Bin , title =
Li, Zhe and Qiu, Xiangfei and Chen, Peng and Wang, Yihang and Cheng, Hanyin and Shu, Yang and Hu, Jilin and Guo, Chenjuan and Zhou, Aoying and Jensen, Christian S. and Yang, Bin , title =. 2025 , booktitle =
2025
-
[30]
Forty-second International Conference on Machine Learning , year=
Context is Key: A Benchmark for Forecasting with Essential Textual Information , author=. Forty-second International Conference on Machine Learning , year=
-
[31]
arXiv preprint arXiv:2601.08509 , year=
What If TSF: A Benchmark for Reframing Forecasting as Scenario-Guided Multimodal Forecasting , author=. arXiv preprint arXiv:2601.08509 , year=
-
[32]
arXiv preprint arXiv:2405.13522 , year=
Intervention-aware forecasting: Breaking historical limits from a system perspective , author=. arXiv preprint arXiv:2405.13522 , year=
-
[33]
Stock Movement Prediction from Tweets and Historical Prices
Xu, Yumo and Cohen, Shay B. Stock Movement Prediction from Tweets and Historical Prices. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018
2018
-
[34]
2024 , booktitle =
Dong, Zihan and Fan, Xinyu and Peng, Zhiyuan , title =. 2024 , booktitle =
2024
-
[35]
Qianqian Xie and Weiguang Han and Xiao Zhang and Yanzhao Lai and Min Peng and Alejandro Lopez-Lira and Jimin Huang , booktitle=
-
[36]
arXiv preprint arXiv:2311.11944 , year=
Financebench: A new benchmark for financial question answering , author=. arXiv preprint arXiv:2311.11944 , year=
-
[37]
The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
FinBen: An Holistic Financial Benchmark for Large Language Models , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
-
[38]
arXiv preprint arXiv:2303.17564 , year=
BloombergGPT: A Large Language Model for Finance , author=. arXiv preprint arXiv:2303.17564 , year=
-
[39]
arXiv preprint arXiv:1908.10063 , year=
FinBERT: Financial Sentiment Analysis with Pre-trained Language Models , author=. arXiv preprint arXiv:1908.10063 , year=
Pith/arXiv arXiv 1908
-
[40]
arXiv preprint arXiv:2306.06031 , year=
FinGPT: Open-Source Financial Large Language Models , author=. arXiv preprint arXiv:2306.06031 , year=
-
[41]
and Sheng, Zhenli and Yang, Bin , title =
Qiu, Xiangfei and Hu, Jilin and Zhou, Lekui and Wu, Xingjian and Du, Junyang and Zhang, Buang and Guo, Chenjuan and Zhou, Aoying and Jensen, Christian S. and Sheng, Zhenli and Yang, Bin , title =. Proc. VLDB Endow. , pages =. 2024 , volume =
2024
-
[42]
arXiv preprint arXiv:2411.06735 , year=
Multi-modal forecaster: Jointly predicting time series and textual data , author=. arXiv preprint arXiv:2411.06735 , year=
-
[43]
Sasanur and Megha Sharma and Jiaming Cui and Qingsong Wen and Chao Zhang and B
Haoxin Liu and Shangqing Xu and Zhiyuan Zhao and Lingkai Kong and Harshavardhan Kamarthi and Aditya B. Sasanur and Megha Sharma and Jiaming Cui and Qingsong Wen and Chao Zhang and B. Aditya Prakash , booktitle=. Time-
-
[44]
Financial Forecasting from Textual and Tabular Time Series
Koval, Ross and Andrews, Nicholas and Yan, Xifeng. Financial Forecasting from Textual and Tabular Time Series. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024
2024
-
[45]
arXiv preprint arXiv:2509.19628 , year=
Multimodal Language Models with Modality-Specific Experts for Financial Forecasting from Interleaved Sequences of Text and Time Series , author=. arXiv preprint arXiv:2509.19628 , year=
-
[46]
arXiv preprint arXiv:2602.07294 , year=
Fin-RATE: A Real-world Financial Analytics and Tracking Evaluation Benchmark for LLMs on SEC Filings , author=. arXiv preprint arXiv:2602.07294 , year=
-
[47]
arXiv preprint arXiv:2502.18834 , year=
Fintsb: A comprehensive and practical benchmark for financial time series forecasting , author=. arXiv preprint arXiv:2502.18834 , year=
-
[48]
Communications of the ACM , volume=
Datasheets for datasets , author=. Communications of the ACM , volume=
-
[49]
Transactions of the Association for Computational Linguistics , volume=
Data statements for natural language processing: Toward mitigating system bias and enabling better science , author=. Transactions of the Association for Computational Linguistics , volume=
-
[50]
Croissant: A Metadata Format for
Mubashara Akhtar and Omar Benjelloun and Costanza Conforti and Luca Foschini and Joan Giner-Miguelez and Pieter Gijsbers and Sujata Goswami and Nitisha Jain and Michalis Karamousadakis and Michael Kuchnik and Satyapriya Krishna and Sylvain Lesage and Quentin Lhoest and Pierre Marcenac and Manil Maskey and Peter Mattson and Luis Oala and Hamidah Oderinwale...
-
[51]
ISA annual convention , volume=
Gdelt: Global data on events, location, and tone, 1979--2012 , author=. ISA annual convention , volume=. 2013 , organization=
1979
-
[52]
Journal of Business & Economic Statistics , volume=
FRED-MD: A monthly database for macroeconomic research , author=. Journal of Business & Economic Statistics , volume=
-
[53]
2020 , institution=
FRED-QD: A quarterly database for macroeconomic research , author=. 2020 , institution=
2020
-
[54]
The quarterly journal of economics , volume=
Measuring economic policy uncertainty , author=. The quarterly journal of economics , volume=
-
[55]
Financial Statement Data Sets (XBRL) --- documentation and downloads , year=
-
[56]
Neural Computing and Applications , volume=
A deep fusion model for stock market prediction with news headlines and time series data , author=. Neural Computing and Applications , volume=. 2024 , publisher=
2024
-
[57]
International journal of forecasting , volume=
Another look at measures of forecast accuracy , author=. International journal of forecasting , volume=
-
[58]
Journal of the American statistical Association , volume=
Strictly proper scoring rules, prediction, and estimation , author=. Journal of the American statistical Association , volume=. 2007 , publisher=
2007
-
[59]
The Thirteenth International Conference on Learning Representations , year=
Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts , author=. The Thirteenth International Conference on Learning Representations , year=
-
[60]
Yong Liu and Guo Qin and Xiangdong Huang and Jianmin Wang and Mingsheng Long , booktitle=. Timer-
-
[61]
Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks
Reimers, Nils and Gurevych, Iryna. Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019
2019
-
[62]
Retrieval-augmented generation for knowledge-intensive NLP tasks , year =
Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K\". Retrieval-augmented generation for knowledge-intensive NLP tasks , year =
-
[63]
Neural computation , volume=
Long short-term memory , author=. Neural computation , volume=
-
[64]
Econometrica: Journal of the econometric society , pages=
A new approach to the economic analysis of nonstationary time series and the business cycle , author=. Econometrica: Journal of the econometric society , pages=. 1989 , publisher=
1989
-
[65]
Journal of economic literature , volume=
Event studies in economics and finance , author=. Journal of economic literature , volume=. 1997 , publisher=
1997
-
[66]
Luo donghao and wang xue , booktitle=. Modern
-
[67]
and Huang, Jimin and Qian, Lingfei and Peng, Xueqing and Suchow, Jordan W
Li, Haohang and Cao, Yupeng and Yu, Yangyang and Javaji, Shashidhar Reddy and Deng, Zhiyang and He, Yueru and Jiang, Yuechen and Zhu, Zining and Subbalakshmi, K.p. and Huang, Jimin and Qian, Lingfei and Peng, Xueqing and Suchow, Jordan W. and Xie, Qianqian. INVESTORBENCH : A Benchmark for Financial Decision-Making Tasks with LLM -based Agent. Proceedings ...
2025
-
[68]
Issa Sugiura and Takashi Ishida and Taro Makino and Chieko Tazuke and Takanori Nakagawa and Kosuke Nakago and David Ha , booktitle=
-
[69]
Wen Wu and Ziyang Zhang and Liwei Liu and Xuenan Xu and Jimin Zhuang and Ke Fan and Qitan Lv and Junlin Liu and Chen Zhang and Zheqi Yuan and Siyuan Hou and Tianyi Lin and Kai Chen and Bowen Zhou and Chao Zhang , booktitle=. Sci
-
[70]
arXiv preprint arXiv:2503.16858 , year=
MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering , author=. arXiv preprint arXiv:2503.16858 , year=
-
[71]
Advances in neural information processing systems , volume=
Lightgbm: A highly efficient gradient boosting decision tree , author=. Advances in neural information processing systems , volume=
-
[72]
Proceedings of the AAAI Conference on Artificial Intelligence , pages=
Chattime: A unified multimodal time series foundation model bridging numerical and textual data , author=. Proceedings of the AAAI Conference on Artificial Intelligence , pages=
-
[73]
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Time-mqa: Time series multi-task question answering with context enhancement , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[74]
Proceedings of the AAAI conference on artificial intelligence , pages=
Are transformers effective for time series forecasting? , author=. Proceedings of the AAAI conference on artificial intelligence , pages=
-
[75]
European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2023 , pages=
Context-Aware Deep Time-Series Decomposition for Anomaly Detection in Businesses , author=. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2023 , pages=
2023
-
[76]
The review of economics and statistics , volume=
Bootstrap-based improvements for inference with clustered errors , author=. The review of economics and statistics , volume=. 2008 , publisher=
2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.