MacroLens: A Multi-Task Benchmark for Contextual Financial Reasoning under Macroeconomic Scenarios

Heejun Lee; Jay Heo; Jin Myung Kwak; Patara Trirat; Sung Ju Hwang

arxiv: 2606.24950 · v1 · pith:TK2VZITLnew · submitted 2026-06-23 · 💻 cs.LG · cs.AI

MacroLens: A Multi-Task Benchmark for Contextual Financial Reasoning under Macroeconomic Scenarios

Patara Trirat , Jin Myung Kwak , Jay Heo , Heejun Lee , Sung Ju Hwang This is my paper

Pith reviewed 2026-06-26 00:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords financial benchmarkmacroeconomic scenarioscontextual financial reasoningmulti-task benchmarkpoint-in-time panelXBRL accountingSEC filingslarge language models

0 comments

The pith

MacroLens is the first public benchmark to combine price history, accounting fundamentals, macroeconomic series and text for seven contextual financial tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Financial decision-making weighs price history, accounting fundamentals, macroeconomic regime and contemporaneous text, yet no prior benchmark handles all four signals together. Finance breaks standard time-series rules because text must be gated by publication date, fundamentals arrive with one- to ninety-day lags, filing text overlaps numerical fields, and macroeconomic regimes leak across calendar splits. MacroLens supplies one point-in-time panel covering 4,416 U.S. small- and micro-cap equities from 2021-2026 that contains prices, 46.8 million XBRL facts, 53 macro series, 295,860 SEC filings, 215,882 news articles and 1,130 automatically detected macro events. Seven tasks test contextual forecasting, valuation, statement generation, scenario-conditioned returns and real-estate valuation. Nineteen methods spanning heuristics, time-series models and large language models are evaluated on the shared panel together with feature-context ablations.

Core claim

MacroLens constructs a single point-in-time panel over 4,416 equities that correctly gates all text by publication date, applies the required lags to quarterly fundamentals, removes redundancy between filings and numerical fields, and prevents macroeconomic regime leakage, thereby enabling seven tasks that jointly use prices, 46.8M XBRL facts, 53 macro series, 295,860 SEC filings, 215,882 news articles and 1,130 macro events for contextual financial reasoning.

What carries the argument

The point-in-time panel that gates text by publication date, applies one- to ninety-day lags to fundamentals, removes filing redundancy and blocks macro regime leakage across splits.

If this is right

Models can be evaluated on realistic multi-signal financial tasks that respect publication timing and reporting lags.
Ablation results quantify how much each of the four signals contributes to performance on forecasting and valuation.
The scenario layer permits direct testing of how models respond to specific macroeconomic events rendered as natural language.
The shared panel allows head-to-head comparison of heuristics, time-series foundation models and zero-shot LLMs on identical data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The automatic macro-event detection layer could be reused to generate scenario text for other asset classes or geographies.
Strong performance on MacroLens tasks may indicate better handling of look-ahead bias in live trading or investment systems.
Extending the panel construction rules to non-U.S. equities would test whether the same four-signal integration works outside the current market.

Load-bearing premise

The point-in-time panel construction correctly gates every text item by its actual publication date and applies the stated lags without leakage.

What would settle it

Any instance in the released dataset where a filing, news article or macro series appears before its real-world publication date would show the panel construction failed.

Figures

Figures reproduced from arXiv: 2606.24950 by Heejun Lee, Jay Heo, Jin Myung Kwak, Patara Trirat, Sung Ju Hwang.

**Figure 1.** Figure 1: Left: Positioning of MacroLens against existing benchmarks. Right: A MacroLens instance at anchor date t, showing the four input types and the seven tasks derived from the same evidence. inputs must be gated by their release date. Quarterly fundamentals are reported with a one- to ninety-day lag, so a feature observed at calendar time t may not have been knowable at decision time t. Filing text is domain-s… view at source ↗

**Figure 2.** Figure 2: MacroLens construction pipeline: universe definition, per-source ingestion, four build-time invariants, panel and scenario assembly, and per-task ground-truth construction. A MacroLens instance is the tuple ⟨i, t, g, xi,t−L:t,g, zi , st, ui,≤t⟩ paired with a task-specific target yi,t. Setting any optional input to ∅ yields a natural modality ablation. Forecasting tasks score on chronological splits; valuat… view at source ↗

**Figure 3.** Figure 3: Context ablation across the A→E feature ladder on the two zero-shot frontier LLMs (GPT-5.1, Gemini-3-Flash). One panel per task: T1 MSE at h = 252 (log scale), T2 / T5 MedAPE, T4 return MAE. Each curve traces a single LLM’s primary metric across the five nested feature settings. The results isolate which signal channels each task is sensitive to and quantify within-model monotonicity (whether adding contex… view at source ↗

read the original abstract

Financial decision-making is contextual: forecasting prices, valuing companies, and assessing event exposure weigh price history, accounting fundamentals, macroeconomic regime, and contemporaneous text. A benchmark over these four signals is hard to build because finance violates four assumptions of time-series evaluation: text must be gated by its publication date to prevent look-ahead, quarterly fundamentals are reported with a one- to ninety-day lag, filing text is partly redundant with the numerical statement fields it accompanies, and macroeconomic regimes leak across calendar splits. No public benchmark addresses all four signals jointly. MacroLens covers 4,416 U.S. small- and micro-cap equities over 2021-2026. Seven tasks share one point-in-time panel of prices, 46.8M XBRL accounting facts, 53 macroeconomic series, 295,860 SEC filings, and 215,882 news articles, plus a scenario layer of 1,130 macroeconomic events across 49 types automatically detected and rendered as natural language. Tasks span contextual forecasting, public and private valuation, statement generation from fundamentals and descriptions, scenario-conditioned returns, and real-estate valuation. We evaluate 19 methods across six families spanning naive heuristics through time-series foundation models, fine-tuned LLM-based time-series models, and zero-shot large language models (LLMs), plus a five-step feature-context ablation on two frontier LLMs and a gradient-boosted baseline. MacroLens is released at https://huggingface.co/datasets/DeepAuto-AI/MacroLens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MacroLens puts together a broad financial benchmark with real timing constraints, but the point-in-time panel is asserted without any code, audit, or example that would let a reader check the four gating rules.

read the letter

The paper's main contribution is the release of MacroLens, a dataset covering 4,416 small- and micro-cap stocks from 2021-2026 that bundles prices, 46.8M XBRL facts, 53 macro series, filings, news, and 1,130 detected macro events into one panel for seven tasks. That integration across signals while trying to respect publication dates, reporting lags, redundancy removal, and split leakage is the part that stands out as new; prior benchmarks usually drop at least one of those constraints.

The work does a reasonable job laying out the seven tasks and running a wide set of baselines from heuristics to LLMs and time-series models, plus an ablation on context. The scenario layer and automatic event detection add a practical angle that could be useful for testing contextual reasoning.

The soft spot is exactly where the stress-test flags it: the abstract claims the panel correctly applies all four timing rules, yet neither the abstract nor the supplied description gives pseudocode, SQL, or even one concrete example of how a filing or XBRL fact was gated on a given date. Because every task inherits that panel, the central claim that the benchmark jointly handles the four violations rests on an unverified assertion. Without that evidence the results are difficult to trust.

This is for researchers building or evaluating financial time-series or LLM systems who need multi-signal data. It deserves a serious referee so the authors can supply the missing construction details and any validation they performed; if those check out the benchmark could be worth adopting.

Referee Report

1 major / 0 minor

Summary. The paper claims to introduce MacroLens, the first public multi-task benchmark jointly addressing price history, XBRL accounting facts, macroeconomic series, and text from SEC filings/news for contextual financial reasoning under macro scenarios. It covers 4,416 U.S. small- and micro-cap equities (2021-2026) via a shared point-in-time panel (46.8M XBRL facts, 53 macro series, 295,860 filings, 215,882 news items) plus 1,130 detected macro events, with seven tasks spanning forecasting, valuation, statement generation, and scenario-conditioned returns. The work evaluates 19 methods across six families and releases the dataset on Hugging Face.

Significance. If the point-in-time panel is correctly constructed, MacroLens would be a significant contribution by supplying the first public artifact that simultaneously handles all four finance-specific time-series violations (look-ahead on text, reporting lags, text-numeric redundancy, macro-regime leakage) within one panel. The scale, multi-task design, scenario layer, and broad baseline evaluation (naive heuristics through LLMs and time-series foundation models) plus public release are clear strengths that would enable more realistic model assessment in financial ML.

major comments (1)

[Abstract] Abstract: The headline claim that MacroLens 'jointly addresses all four signals' and that 'no public benchmark' previously did so is load-bearing on the assertion that the shared point-in-time panel correctly implements the four gating rules (publication-date gating of every filing/news item, 1-90 day lags on each XBRL fact, removal of text redundant with numerical fields, and no macro-regime leakage across calendar splits). No pseudocode, SQL logic, algorithm description, or small-scale audit (e.g., example of the latest allowed filing for a given ticker-date) is supplied, despite every downstream task inheriting from this panel.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the centrality of the point-in-time panel construction. We agree that the manuscript must supply explicit implementation details to support the headline claims. We will revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim that MacroLens 'jointly addresses all four signals' and that 'no public benchmark' previously did so is load-bearing on the assertion that the shared point-in-time panel correctly implements the four gating rules (publication-date gating of every filing/news item, 1-90 day lags on each XBRL fact, removal of text redundant with numerical fields, and no macro-regime leakage across calendar splits). No pseudocode, SQL logic, algorithm description, or small-scale audit (e.g., example of the latest allowed filing for a given ticker-date) is supplied, despite every downstream task inheriting from this panel.

Authors: We accept the referee's observation. The current version of the manuscript describes the four gating rules at a high level but does not include the concrete implementation artifacts requested. In the revised manuscript we will add a new subsection (Section 3.2) that (i) provides pseudocode for the four gating procedures, (ii) supplies the exact SQL-style logic used to assemble the shared panel, and (iii) includes a worked audit example that lists, for a representative ticker-date pair, the latest permissible filing, the applied lag window on each XBRL fact, the removed redundant text fields, and the macro-regime split used. These additions will allow independent verification that every downstream task inherits from a correctly gated panel. revision: yes

Circularity Check

0 steps flagged

No circularity; dataset benchmark with no derivations or fitted predictions

full rationale

The paper releases a multi-task benchmark dataset (MacroLens) covering equities, XBRL facts, macro series, filings, and news with asserted point-in-time properties. No mathematical derivations, equations, parameter fitting, predictions, or first-principles results appear in the abstract or described content. The contribution is the data artifact and task definitions themselves; no step reduces by construction to prior inputs, self-citations, or ansatzes. This is a standard empirical benchmark release and receives score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a benchmark construction and dataset release with no free parameters, mathematical axioms, or invented entities; it relies on standard data-processing practices for time-gating and lag application.

pith-pipeline@v0.9.1-grok · 5815 in / 1253 out tokens · 30288 ms · 2026-06-26T00:57:59.674563+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 12 linked inside Pith

[1]

arXiv preprint arXiv:2508.10925 , year=

gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

Pith/arXiv arXiv
[2]

arXiv preprint arXiv:2501.08313 , year=

Minimax-01: Scaling foundation models with lightning attention , author=. arXiv preprint arXiv:2501.08313 , year=

Pith/arXiv arXiv
[3]

arXiv preprint arXiv:2405.04434 , year=

Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model , author=. arXiv preprint arXiv:2405.04434 , year=

Pith/arXiv arXiv
[4]

arXiv preprint arXiv:2601.01739 , year=

K-EXAONE Technical Report , author=. arXiv preprint arXiv:2601.01739 , year=

arXiv
[5]

arXiv preprint arXiv:2508.06471 , year=

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models , author=. arXiv preprint arXiv:2508.06471 , year=

Pith/arXiv arXiv
[6]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv
[7]

The Llama 4 Herd: The Beginning of a New Era of Natively Multimodal
[8]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

2025
[9]

Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=

Monash Time Series Forecasting Archive , author=. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=
[10]

International Journal of forecasting , volume=

The M4 Competition: Results, findings, conclusion and way forward , author=. International Journal of forecasting , volume=
[11]

International Journal of Forecasting , volume=

M5 accuracy competition: Results, findings, and conclusions , author=. International Journal of Forecasting , volume=
[12]

International Journal of Forecasting , volume=

Another look at measures of forecast accuracy , author=. International Journal of Forecasting , volume=. 2006 , publisher=

2006
[13]

Journal of the American statistical Association , volume=

Strictly proper scoring rules, prediction, and estimation , author=. Journal of the American statistical Association , volume=
[14]

International journal of forecasting , volume=

DeepAR: Probabilistic forecasting with autoregressive recurrent networks , author=. International journal of forecasting , volume=
[15]

arXiv preprint arXiv:2211.14730 , year=

A time series is worth 64 words: Long-term forecasting with transformers , author=. arXiv preprint arXiv:2211.14730 , year=

Pith/arXiv arXiv
[16]

International Conference on Learning Representations , year=

N-BEATS: Neural basis expansion analysis for interpretable time series forecasting , author=. International Conference on Learning Representations , year=
[17]

Proceedings of the AAAI conference on artificial intelligence , volume=

Are transformers effective for time series forecasting? , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[18]

The Eleventh International Conference on Learning Representations , year=

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers , author=. The Eleventh International Conference on Learning Representations , year=
[19]

The Twelfth International Conference on Learning Representations , year=

iTransformer: Inverted Transformers Are Effective for Time Series Forecasting , author=. The Twelfth International Conference on Learning Representations , year=
[20]

Forty-second International Conference on Machine Learning , year=

Sundial: A Family of Highly Capable Time Series Foundation Models , author=. Forty-second International Conference on Machine Learning , year=
[21]

Transactions on Machine Learning Research , issn=

Chronos: Learning the Language of Time Series , author=. Transactions on Machine Learning Research , issn=. 2024 , note=

2024
[22]

arXiv preprint arXiv:2510.15821 , year=

Chronos-2: From univariate to universal forecasting , author=. arXiv preprint arXiv:2510.15821 , year=

Pith/arXiv arXiv
[23]

Forty-first International Conference on Machine Learning , year=

Unified training of universal time series forecasting transformers , author=. Forty-first International Conference on Machine Learning , year=
[24]

Forty-second International Conference on Machine Learning , year=

Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts , author=. Forty-second International Conference on Machine Learning , year=
[25]

arXiv preprint arXiv:2511.11698 , year=

Moirai 2.0: When less is more for time series forecasting , author=. arXiv preprint arXiv:2511.11698 , year=

arXiv
[26]

Forty-first international conference on machine learning , year=

A decoder-only foundation model for time-series forecasting , author=. Forty-first international conference on machine learning , year=
[27]

arXiv preprint arXiv:2310.03589 , year=

TimeGPT-1 , author=. arXiv preprint arXiv:2310.03589 , year=

arXiv
[28]

Taha Aksu and Gerald Woo and Juncheng Liu and Xu Liu and Chenghao Liu and Silvio Savarese and Caiming Xiong and Doyen Sahoo , booktitle=
[29]

and Yang, Bin , title =

Li, Zhe and Qiu, Xiangfei and Chen, Peng and Wang, Yihang and Cheng, Hanyin and Shu, Yang and Hu, Jilin and Guo, Chenjuan and Zhou, Aoying and Jensen, Christian S. and Yang, Bin , title =. 2025 , booktitle =

2025
[30]

Forty-second International Conference on Machine Learning , year=

Context is Key: A Benchmark for Forecasting with Essential Textual Information , author=. Forty-second International Conference on Machine Learning , year=
[31]

arXiv preprint arXiv:2601.08509 , year=

What If TSF: A Benchmark for Reframing Forecasting as Scenario-Guided Multimodal Forecasting , author=. arXiv preprint arXiv:2601.08509 , year=

arXiv
[32]

arXiv preprint arXiv:2405.13522 , year=

Intervention-aware forecasting: Breaking historical limits from a system perspective , author=. arXiv preprint arXiv:2405.13522 , year=

arXiv
[33]

Stock Movement Prediction from Tweets and Historical Prices

Xu, Yumo and Cohen, Shay B. Stock Movement Prediction from Tweets and Historical Prices. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018

2018
[34]

2024 , booktitle =

Dong, Zihan and Fan, Xinyu and Peng, Zhiyuan , title =. 2024 , booktitle =

2024
[35]

Qianqian Xie and Weiguang Han and Xiao Zhang and Yanzhao Lai and Min Peng and Alejandro Lopez-Lira and Jimin Huang , booktitle=
[36]

arXiv preprint arXiv:2311.11944 , year=

Financebench: A new benchmark for financial question answering , author=. arXiv preprint arXiv:2311.11944 , year=

Pith/arXiv arXiv
[37]

The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

FinBen: An Holistic Financial Benchmark for Large Language Models , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
[38]

arXiv preprint arXiv:2303.17564 , year=

BloombergGPT: A Large Language Model for Finance , author=. arXiv preprint arXiv:2303.17564 , year=

Pith/arXiv arXiv
[39]

arXiv preprint arXiv:1908.10063 , year=

FinBERT: Financial Sentiment Analysis with Pre-trained Language Models , author=. arXiv preprint arXiv:1908.10063 , year=

Pith/arXiv arXiv 1908
[40]

arXiv preprint arXiv:2306.06031 , year=

FinGPT: Open-Source Financial Large Language Models , author=. arXiv preprint arXiv:2306.06031 , year=

arXiv
[41]

and Sheng, Zhenli and Yang, Bin , title =

Qiu, Xiangfei and Hu, Jilin and Zhou, Lekui and Wu, Xingjian and Du, Junyang and Zhang, Buang and Guo, Chenjuan and Zhou, Aoying and Jensen, Christian S. and Sheng, Zhenli and Yang, Bin , title =. Proc. VLDB Endow. , pages =. 2024 , volume =

2024
[42]

arXiv preprint arXiv:2411.06735 , year=

Multi-modal forecaster: Jointly predicting time series and textual data , author=. arXiv preprint arXiv:2411.06735 , year=

arXiv
[43]

Sasanur and Megha Sharma and Jiaming Cui and Qingsong Wen and Chao Zhang and B

Haoxin Liu and Shangqing Xu and Zhiyuan Zhao and Lingkai Kong and Harshavardhan Kamarthi and Aditya B. Sasanur and Megha Sharma and Jiaming Cui and Qingsong Wen and Chao Zhang and B. Aditya Prakash , booktitle=. Time-
[44]

Financial Forecasting from Textual and Tabular Time Series

Koval, Ross and Andrews, Nicholas and Yan, Xifeng. Financial Forecasting from Textual and Tabular Time Series. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024

2024
[45]

arXiv preprint arXiv:2509.19628 , year=

Multimodal Language Models with Modality-Specific Experts for Financial Forecasting from Interleaved Sequences of Text and Time Series , author=. arXiv preprint arXiv:2509.19628 , year=

arXiv
[46]

arXiv preprint arXiv:2602.07294 , year=

Fin-RATE: A Real-world Financial Analytics and Tracking Evaluation Benchmark for LLMs on SEC Filings , author=. arXiv preprint arXiv:2602.07294 , year=

Pith/arXiv arXiv
[47]

arXiv preprint arXiv:2502.18834 , year=

Fintsb: A comprehensive and practical benchmark for financial time series forecasting , author=. arXiv preprint arXiv:2502.18834 , year=

Pith/arXiv arXiv
[48]

Communications of the ACM , volume=

Datasheets for datasets , author=. Communications of the ACM , volume=
[49]

Transactions of the Association for Computational Linguistics , volume=

Data statements for natural language processing: Toward mitigating system bias and enabling better science , author=. Transactions of the Association for Computational Linguistics , volume=
[50]

Croissant: A Metadata Format for

Mubashara Akhtar and Omar Benjelloun and Costanza Conforti and Luca Foschini and Joan Giner-Miguelez and Pieter Gijsbers and Sujata Goswami and Nitisha Jain and Michalis Karamousadakis and Michael Kuchnik and Satyapriya Krishna and Sylvain Lesage and Quentin Lhoest and Pierre Marcenac and Manil Maskey and Peter Mattson and Luis Oala and Hamidah Oderinwale...
[51]

ISA annual convention , volume=

Gdelt: Global data on events, location, and tone, 1979--2012 , author=. ISA annual convention , volume=. 2013 , organization=

1979
[52]

Journal of Business & Economic Statistics , volume=

FRED-MD: A monthly database for macroeconomic research , author=. Journal of Business & Economic Statistics , volume=
[53]

2020 , institution=

FRED-QD: A quarterly database for macroeconomic research , author=. 2020 , institution=

2020
[54]

The quarterly journal of economics , volume=

Measuring economic policy uncertainty , author=. The quarterly journal of economics , volume=
[55]

Financial Statement Data Sets (XBRL) --- documentation and downloads , year=
[56]

Neural Computing and Applications , volume=

A deep fusion model for stock market prediction with news headlines and time series data , author=. Neural Computing and Applications , volume=. 2024 , publisher=

2024
[57]

International journal of forecasting , volume=

Another look at measures of forecast accuracy , author=. International journal of forecasting , volume=
[58]

Journal of the American statistical Association , volume=

Strictly proper scoring rules, prediction, and estimation , author=. Journal of the American statistical Association , volume=. 2007 , publisher=

2007
[59]

The Thirteenth International Conference on Learning Representations , year=

Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts , author=. The Thirteenth International Conference on Learning Representations , year=
[60]

Yong Liu and Guo Qin and Xiangdong Huang and Jianmin Wang and Mingsheng Long , booktitle=. Timer-
[61]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

Reimers, Nils and Gurevych, Iryna. Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019

2019
[62]

Retrieval-augmented generation for knowledge-intensive NLP tasks , year =

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K\". Retrieval-augmented generation for knowledge-intensive NLP tasks , year =
[63]

Neural computation , volume=

Long short-term memory , author=. Neural computation , volume=
[64]

Econometrica: Journal of the econometric society , pages=

A new approach to the economic analysis of nonstationary time series and the business cycle , author=. Econometrica: Journal of the econometric society , pages=. 1989 , publisher=

1989
[65]

Journal of economic literature , volume=

Event studies in economics and finance , author=. Journal of economic literature , volume=. 1997 , publisher=

1997
[66]

Luo donghao and wang xue , booktitle=. Modern
[67]

and Huang, Jimin and Qian, Lingfei and Peng, Xueqing and Suchow, Jordan W

Li, Haohang and Cao, Yupeng and Yu, Yangyang and Javaji, Shashidhar Reddy and Deng, Zhiyang and He, Yueru and Jiang, Yuechen and Zhu, Zining and Subbalakshmi, K.p. and Huang, Jimin and Qian, Lingfei and Peng, Xueqing and Suchow, Jordan W. and Xie, Qianqian. INVESTORBENCH : A Benchmark for Financial Decision-Making Tasks with LLM -based Agent. Proceedings ...

2025
[68]

Issa Sugiura and Takashi Ishida and Taro Makino and Chieko Tazuke and Takanori Nakagawa and Kosuke Nakago and David Ha , booktitle=
[69]

Wen Wu and Ziyang Zhang and Liwei Liu and Xuenan Xu and Jimin Zhuang and Ke Fan and Qitan Lv and Junlin Liu and Chen Zhang and Zheqi Yuan and Siyuan Hou and Tianyi Lin and Kai Chen and Bowen Zhou and Chao Zhang , booktitle=. Sci
[70]

arXiv preprint arXiv:2503.16858 , year=

MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering , author=. arXiv preprint arXiv:2503.16858 , year=

arXiv
[71]

Advances in neural information processing systems , volume=

Lightgbm: A highly efficient gradient boosting decision tree , author=. Advances in neural information processing systems , volume=
[72]

Proceedings of the AAAI Conference on Artificial Intelligence , pages=

Chattime: A unified multimodal time series foundation model bridging numerical and textual data , author=. Proceedings of the AAAI Conference on Artificial Intelligence , pages=
[73]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Time-mqa: Time series multi-task question answering with context enhancement , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[74]

Proceedings of the AAAI conference on artificial intelligence , pages=

Are transformers effective for time series forecasting? , author=. Proceedings of the AAAI conference on artificial intelligence , pages=
[75]

European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2023 , pages=

Context-Aware Deep Time-Series Decomposition for Anomaly Detection in Businesses , author=. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2023 , pages=

2023
[76]

The review of economics and statistics , volume=

Bootstrap-based improvements for inference with clustered errors , author=. The review of economics and statistics , volume=. 2008 , publisher=

2008

[1] [1]

arXiv preprint arXiv:2508.10925 , year=

gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

Pith/arXiv arXiv

[2] [2]

arXiv preprint arXiv:2501.08313 , year=

Minimax-01: Scaling foundation models with lightning attention , author=. arXiv preprint arXiv:2501.08313 , year=

Pith/arXiv arXiv

[3] [3]

arXiv preprint arXiv:2405.04434 , year=

Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model , author=. arXiv preprint arXiv:2405.04434 , year=

Pith/arXiv arXiv

[4] [4]

arXiv preprint arXiv:2601.01739 , year=

K-EXAONE Technical Report , author=. arXiv preprint arXiv:2601.01739 , year=

arXiv

[5] [5]

arXiv preprint arXiv:2508.06471 , year=

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models , author=. arXiv preprint arXiv:2508.06471 , year=

Pith/arXiv arXiv

[6] [6]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv

[7] [7]

The Llama 4 Herd: The Beginning of a New Era of Natively Multimodal

[8] [8]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

2025

[9] [9]

Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=

Monash Time Series Forecasting Archive , author=. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=

[10] [10]

International Journal of forecasting , volume=

The M4 Competition: Results, findings, conclusion and way forward , author=. International Journal of forecasting , volume=

[11] [11]

International Journal of Forecasting , volume=

M5 accuracy competition: Results, findings, and conclusions , author=. International Journal of Forecasting , volume=

[12] [12]

International Journal of Forecasting , volume=

Another look at measures of forecast accuracy , author=. International Journal of Forecasting , volume=. 2006 , publisher=

2006

[13] [13]

Journal of the American statistical Association , volume=

Strictly proper scoring rules, prediction, and estimation , author=. Journal of the American statistical Association , volume=

[14] [14]

International journal of forecasting , volume=

DeepAR: Probabilistic forecasting with autoregressive recurrent networks , author=. International journal of forecasting , volume=

[15] [15]

arXiv preprint arXiv:2211.14730 , year=

A time series is worth 64 words: Long-term forecasting with transformers , author=. arXiv preprint arXiv:2211.14730 , year=

Pith/arXiv arXiv

[16] [16]

International Conference on Learning Representations , year=

N-BEATS: Neural basis expansion analysis for interpretable time series forecasting , author=. International Conference on Learning Representations , year=

[17] [17]

Proceedings of the AAAI conference on artificial intelligence , volume=

Are transformers effective for time series forecasting? , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

[18] [18]

The Eleventh International Conference on Learning Representations , year=

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers , author=. The Eleventh International Conference on Learning Representations , year=

[19] [19]

The Twelfth International Conference on Learning Representations , year=

iTransformer: Inverted Transformers Are Effective for Time Series Forecasting , author=. The Twelfth International Conference on Learning Representations , year=

[20] [20]

Forty-second International Conference on Machine Learning , year=

Sundial: A Family of Highly Capable Time Series Foundation Models , author=. Forty-second International Conference on Machine Learning , year=

[21] [21]

Transactions on Machine Learning Research , issn=

Chronos: Learning the Language of Time Series , author=. Transactions on Machine Learning Research , issn=. 2024 , note=

2024

[22] [22]

arXiv preprint arXiv:2510.15821 , year=

Chronos-2: From univariate to universal forecasting , author=. arXiv preprint arXiv:2510.15821 , year=

Pith/arXiv arXiv

[23] [23]

Forty-first International Conference on Machine Learning , year=

Unified training of universal time series forecasting transformers , author=. Forty-first International Conference on Machine Learning , year=

[24] [24]

Forty-second International Conference on Machine Learning , year=

Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts , author=. Forty-second International Conference on Machine Learning , year=

[25] [25]

arXiv preprint arXiv:2511.11698 , year=

Moirai 2.0: When less is more for time series forecasting , author=. arXiv preprint arXiv:2511.11698 , year=

arXiv

[26] [26]

Forty-first international conference on machine learning , year=

A decoder-only foundation model for time-series forecasting , author=. Forty-first international conference on machine learning , year=

[27] [27]

arXiv preprint arXiv:2310.03589 , year=

TimeGPT-1 , author=. arXiv preprint arXiv:2310.03589 , year=

arXiv

[28] [28]

Taha Aksu and Gerald Woo and Juncheng Liu and Xu Liu and Chenghao Liu and Silvio Savarese and Caiming Xiong and Doyen Sahoo , booktitle=

[29] [29]

and Yang, Bin , title =

Li, Zhe and Qiu, Xiangfei and Chen, Peng and Wang, Yihang and Cheng, Hanyin and Shu, Yang and Hu, Jilin and Guo, Chenjuan and Zhou, Aoying and Jensen, Christian S. and Yang, Bin , title =. 2025 , booktitle =

2025

[30] [30]

Forty-second International Conference on Machine Learning , year=

Context is Key: A Benchmark for Forecasting with Essential Textual Information , author=. Forty-second International Conference on Machine Learning , year=

[31] [31]

arXiv preprint arXiv:2601.08509 , year=

What If TSF: A Benchmark for Reframing Forecasting as Scenario-Guided Multimodal Forecasting , author=. arXiv preprint arXiv:2601.08509 , year=

arXiv

[32] [32]

arXiv preprint arXiv:2405.13522 , year=

Intervention-aware forecasting: Breaking historical limits from a system perspective , author=. arXiv preprint arXiv:2405.13522 , year=

arXiv

[33] [33]

Stock Movement Prediction from Tweets and Historical Prices

Xu, Yumo and Cohen, Shay B. Stock Movement Prediction from Tweets and Historical Prices. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018

2018

[34] [34]

2024 , booktitle =

Dong, Zihan and Fan, Xinyu and Peng, Zhiyuan , title =. 2024 , booktitle =

2024

[35] [35]

Qianqian Xie and Weiguang Han and Xiao Zhang and Yanzhao Lai and Min Peng and Alejandro Lopez-Lira and Jimin Huang , booktitle=

[36] [36]

arXiv preprint arXiv:2311.11944 , year=

Financebench: A new benchmark for financial question answering , author=. arXiv preprint arXiv:2311.11944 , year=

Pith/arXiv arXiv

[37] [37]

The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

FinBen: An Holistic Financial Benchmark for Large Language Models , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

[38] [38]

arXiv preprint arXiv:2303.17564 , year=

BloombergGPT: A Large Language Model for Finance , author=. arXiv preprint arXiv:2303.17564 , year=

Pith/arXiv arXiv

[39] [39]

arXiv preprint arXiv:1908.10063 , year=

FinBERT: Financial Sentiment Analysis with Pre-trained Language Models , author=. arXiv preprint arXiv:1908.10063 , year=

Pith/arXiv arXiv 1908

[40] [40]

arXiv preprint arXiv:2306.06031 , year=

FinGPT: Open-Source Financial Large Language Models , author=. arXiv preprint arXiv:2306.06031 , year=

arXiv

[41] [41]

and Sheng, Zhenli and Yang, Bin , title =

Qiu, Xiangfei and Hu, Jilin and Zhou, Lekui and Wu, Xingjian and Du, Junyang and Zhang, Buang and Guo, Chenjuan and Zhou, Aoying and Jensen, Christian S. and Sheng, Zhenli and Yang, Bin , title =. Proc. VLDB Endow. , pages =. 2024 , volume =

2024

[42] [42]

arXiv preprint arXiv:2411.06735 , year=

Multi-modal forecaster: Jointly predicting time series and textual data , author=. arXiv preprint arXiv:2411.06735 , year=

arXiv

[43] [43]

Sasanur and Megha Sharma and Jiaming Cui and Qingsong Wen and Chao Zhang and B

Haoxin Liu and Shangqing Xu and Zhiyuan Zhao and Lingkai Kong and Harshavardhan Kamarthi and Aditya B. Sasanur and Megha Sharma and Jiaming Cui and Qingsong Wen and Chao Zhang and B. Aditya Prakash , booktitle=. Time-

[44] [44]

Financial Forecasting from Textual and Tabular Time Series

Koval, Ross and Andrews, Nicholas and Yan, Xifeng. Financial Forecasting from Textual and Tabular Time Series. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024

2024

[45] [45]

arXiv preprint arXiv:2509.19628 , year=

Multimodal Language Models with Modality-Specific Experts for Financial Forecasting from Interleaved Sequences of Text and Time Series , author=. arXiv preprint arXiv:2509.19628 , year=

arXiv

[46] [46]

arXiv preprint arXiv:2602.07294 , year=

Fin-RATE: A Real-world Financial Analytics and Tracking Evaluation Benchmark for LLMs on SEC Filings , author=. arXiv preprint arXiv:2602.07294 , year=

Pith/arXiv arXiv

[47] [47]

arXiv preprint arXiv:2502.18834 , year=

Fintsb: A comprehensive and practical benchmark for financial time series forecasting , author=. arXiv preprint arXiv:2502.18834 , year=

Pith/arXiv arXiv

[48] [48]

Communications of the ACM , volume=

Datasheets for datasets , author=. Communications of the ACM , volume=

[49] [49]

Transactions of the Association for Computational Linguistics , volume=

Data statements for natural language processing: Toward mitigating system bias and enabling better science , author=. Transactions of the Association for Computational Linguistics , volume=

[50] [50]

Croissant: A Metadata Format for

Mubashara Akhtar and Omar Benjelloun and Costanza Conforti and Luca Foschini and Joan Giner-Miguelez and Pieter Gijsbers and Sujata Goswami and Nitisha Jain and Michalis Karamousadakis and Michael Kuchnik and Satyapriya Krishna and Sylvain Lesage and Quentin Lhoest and Pierre Marcenac and Manil Maskey and Peter Mattson and Luis Oala and Hamidah Oderinwale...

[51] [51]

ISA annual convention , volume=

Gdelt: Global data on events, location, and tone, 1979--2012 , author=. ISA annual convention , volume=. 2013 , organization=

1979

[52] [52]

Journal of Business & Economic Statistics , volume=

FRED-MD: A monthly database for macroeconomic research , author=. Journal of Business & Economic Statistics , volume=

[53] [53]

2020 , institution=

FRED-QD: A quarterly database for macroeconomic research , author=. 2020 , institution=

2020

[54] [54]

The quarterly journal of economics , volume=

Measuring economic policy uncertainty , author=. The quarterly journal of economics , volume=

[55] [55]

Financial Statement Data Sets (XBRL) --- documentation and downloads , year=

[56] [56]

Neural Computing and Applications , volume=

A deep fusion model for stock market prediction with news headlines and time series data , author=. Neural Computing and Applications , volume=. 2024 , publisher=

2024

[57] [57]

International journal of forecasting , volume=

Another look at measures of forecast accuracy , author=. International journal of forecasting , volume=

[58] [58]

Journal of the American statistical Association , volume=

Strictly proper scoring rules, prediction, and estimation , author=. Journal of the American statistical Association , volume=. 2007 , publisher=

2007

[59] [59]

The Thirteenth International Conference on Learning Representations , year=

Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts , author=. The Thirteenth International Conference on Learning Representations , year=

[60] [60]

Yong Liu and Guo Qin and Xiangdong Huang and Jianmin Wang and Mingsheng Long , booktitle=. Timer-

[61] [61]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

Reimers, Nils and Gurevych, Iryna. Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019

2019

[62] [62]

Retrieval-augmented generation for knowledge-intensive NLP tasks , year =

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K\". Retrieval-augmented generation for knowledge-intensive NLP tasks , year =

[63] [63]

Neural computation , volume=

Long short-term memory , author=. Neural computation , volume=

[64] [64]

Econometrica: Journal of the econometric society , pages=

A new approach to the economic analysis of nonstationary time series and the business cycle , author=. Econometrica: Journal of the econometric society , pages=. 1989 , publisher=

1989

[65] [65]

Journal of economic literature , volume=

Event studies in economics and finance , author=. Journal of economic literature , volume=. 1997 , publisher=

1997

[66] [66]

Luo donghao and wang xue , booktitle=. Modern

[67] [67]

and Huang, Jimin and Qian, Lingfei and Peng, Xueqing and Suchow, Jordan W

Li, Haohang and Cao, Yupeng and Yu, Yangyang and Javaji, Shashidhar Reddy and Deng, Zhiyang and He, Yueru and Jiang, Yuechen and Zhu, Zining and Subbalakshmi, K.p. and Huang, Jimin and Qian, Lingfei and Peng, Xueqing and Suchow, Jordan W. and Xie, Qianqian. INVESTORBENCH : A Benchmark for Financial Decision-Making Tasks with LLM -based Agent. Proceedings ...

2025

[68] [68]

Issa Sugiura and Takashi Ishida and Taro Makino and Chieko Tazuke and Takanori Nakagawa and Kosuke Nakago and David Ha , booktitle=

[69] [69]

Wen Wu and Ziyang Zhang and Liwei Liu and Xuenan Xu and Jimin Zhuang and Ke Fan and Qitan Lv and Junlin Liu and Chen Zhang and Zheqi Yuan and Siyuan Hou and Tianyi Lin and Kai Chen and Bowen Zhou and Chao Zhang , booktitle=. Sci

[70] [70]

arXiv preprint arXiv:2503.16858 , year=

MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering , author=. arXiv preprint arXiv:2503.16858 , year=

arXiv

[71] [71]

Advances in neural information processing systems , volume=

Lightgbm: A highly efficient gradient boosting decision tree , author=. Advances in neural information processing systems , volume=

[72] [72]

Proceedings of the AAAI Conference on Artificial Intelligence , pages=

Chattime: A unified multimodal time series foundation model bridging numerical and textual data , author=. Proceedings of the AAAI Conference on Artificial Intelligence , pages=

[73] [73]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Time-mqa: Time series multi-task question answering with context enhancement , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[74] [74]

Proceedings of the AAAI conference on artificial intelligence , pages=

Are transformers effective for time series forecasting? , author=. Proceedings of the AAAI conference on artificial intelligence , pages=

[75] [75]

European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2023 , pages=

Context-Aware Deep Time-Series Decomposition for Anomaly Detection in Businesses , author=. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2023 , pages=

2023

[76] [76]

The review of economics and statistics , volume=

Bootstrap-based improvements for inference with clustered errors , author=. The review of economics and statistics , volume=. 2008 , publisher=

2008