OracleProto: A Reproducible Framework for Benchmarking LLM Native Forecasting via Knowledge Cutoff and Temporal Masking

· 2026 · cs.AI · arXiv 2605.03762

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Large language models are moving from static text generators toward real-world decision-support systems, where forecasting is a composite capability that links information gathering, evidence integration, situational judgment, and action-oriented decision making. This capability is in broad demand across finance, policy, industry, and scientific research, yet its evaluation remains difficult: live benchmarks evaluate forecasts before answers exist, making them the cleanest way to measure forecasting ability, but they expire once events resolve; retrospective benchmarks are reproducible, but they cannot reliably distinguish genuine forecasting from facts a model may have already learned during pretraining. Prompting models to "pretend not to know" cannot replace a genuine knowledge boundary. We propose OracleProto, a reproducible framework for evaluating LLM native forecasting capability. OracleProto reconstructs resolved events into time-bounded forecasting samples by combining model-cutoff-aligned sample admission, tool-level temporal masking, content-level leakage detection, discrete answer normalization, and hierarchical scoring. Instantiated on a FutureX-Past-derived dataset with six contemporary LLMs, OracleProto distinguishes forecasting quality, sampling stability, and cost efficiency under controlled information boundaries, while reducing residual leakage to the $1\%$ level, an order of magnitude below tool-only temporal filtering. OracleProto turns LLM forecasting from one-off evaluation into an auditable, reusable, and trainable dataset-level capability, providing a unified interface for fair cross-model comparison and a controlled signal source for downstream SFT and RL. Code and data are available at https://github.com/MaYiding/OracleProto and https://huggingface.co/datasets/MaYiding/OracleProto.

representative citing papers

From Knowing to Doing: A Memory-Controlled Benchmark for LLM Trading Agents on Stock Markets

cs.AI · 2026-05-27 · unverdicted · novelty 6.0

KTD-Fin benchmark with data masking and return attribution shows frontier LLM agents on CSI300 generate returns mainly from market and style exposure rather than persistent stock-selection alpha.

citing papers explorer

Showing 1 of 1 citing paper after filters.

From Knowing to Doing: A Memory-Controlled Benchmark for LLM Trading Agents on Stock Markets cs.AI · 2026-05-27 · unverdicted · none · ref 3 · internal anchor
KTD-Fin benchmark with data masking and return attribution shows frontier LLM agents on CSI300 generate returns mainly from market and style exposure rather than persistent stock-selection alpha.

OracleProto: A Reproducible Framework for Benchmarking LLM Native Forecasting via Knowledge Cutoff and Temporal Masking

fields

years

verdicts

representative citing papers

citing papers explorer