NumLeak: Public Numeric Benchmarks as Latent Labels in Foundation Models

Anany Kotawala

arxiv: 2605.30393 · v1 · pith:HENXVC6Inew · submitted 2026-05-28 · 💻 cs.LG · cs.AI· cs.CR

NumLeak: Public Numeric Benchmarks as Latent Labels in Foundation Models

Anany Kotawala This is my paper

Pith reviewed 2026-06-29 08:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CR

keywords NumLeakmemorizationfoundation modelsFama-Frenchevaluation contaminationpublic numeric benchmarksLLM recalldate-conditioned probes

0 comments

The pith

Frontier LLMs recall Fama-French market excess returns at Pearson r of 0.97-0.99 from pretraining data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that public numeric benchmarks such as Fama-French factors, unemployment rates, CPI, and temperature series appear in pretraining corpora, so models can recall their exact values when given dates. NumLeak detects this by running date-conditioned probes on production APIs and controlled white-box tests, revealing near-perfect correlations on answered cases and sharp refusal drops on recent holdouts. The pattern matches a memorization channel: parse rates fall on unseen months while correlations stay high on what is answered, and logprob methods catch more leakage than open generation. If correct, evaluations that treat these public series as fresh labels are measuring recall instead of out-of-sample skill, and simple prompt defenses can block many extraction attacks at low cost to other uses.

Core claim

Top-tier frontier LLMs recall the Fama-French market excess return at 3-seed pooled Pearson r=0.97-0.99 while staying within 0.15 within-25bps on the five sibling factors; comparable fidelity appears on U.S. unemployment, CPI inflation, and NOAA temperature. On a recent-release holdout, parse rate collapses to 21-57% but r stays at approximately 0.99 on months answered, the refuse-or-recall asymmetry a memorized channel predicts. The white-box experiment reproduces the dose-response, and logprob ranking detects memorization that open-ended generation misses, implying closed-API black-box probes understate the channel. A Sonnet date-to-market-sentiment regression that correlates with true Mkt

What carries the argument

NumLeak, a measurement framework that combines API-boundary probes on production models with white-box controlled validation on an open causal LM to isolate memorization of public numeric series.

If this is right

Date-conditioned evaluations on public economic and financial series primarily test recall of pretraining data.
Black-box API probes understate the memorization channel compared with logprob ranking inside open models.
A one-line system prompt blocks 99.8 percent of non-adaptive single-turn suffix attacks with near-zero utility cost on narrative queries.
Regressions that rely on model-generated outputs from these series lose nearly all correlation once the recalled values are removed.
The refuse-or-recall asymmetry on holdout dates provides a practical diagnostic for numeric leakage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Any widely published numeric time series used as evaluation targets is likely to show similar leakage once probed.
Private or synthetic numeric benchmarks become necessary to measure genuine generalization rather than recall.
The same refuse-or-recall pattern could be applied to other public data domains to diagnose contamination.
Downstream reasoning that draws on these recalled values may occur without explicit date prompts.

Load-bearing premise

The high correlations and refusal patterns arise from memorization of the specific public numbers in pretraining rather than from the models inferring the underlying economic relationships.

What would settle it

Models recalling randomized or non-public versions of the same numeric series at comparable accuracy and correlation would falsify the memorization account.

Figures

Figures reproduced from arXiv: 2605.30393 by Anany Kotawala.

**Figure 1.** Figure 1: The recall channel. Date-conditioned numeric queries can return memorized historical values, contaminating downstream LLM-finance signals. NumLeak diagnoses and mitigates the channel. We introduce NumLeak, a measurement framework with three parts. First, an identification protocol of four diagnostics (formally defined in §2) that characterizes the recall channel from what the model exposes through its A… view at source ↗

**Figure 2.** Figure 2: Mkt-RF value recall is calibrated. Opus and Sonnet align with the 45◦ line; GPT-5.4 weaker. Scatter shows the singleseed Variant-A baseline (in-figure n and r are per-seed); Tab. 1 reports the 3-seed pooled values, consistent with these. Haiku 4.5 is excluded because its pooled r=0.57 reflects high seed-toseed variance (per-seed range 0.24–0.74, App. E) and a singleseed scatter misrepresents the ensembl… view at source ↗

**Figure 4.** Figure 4: Privacy–utility tradeoff per defense. Left: worst-case adversarial parse rate (lower = more private); all three defenses sit at the floor. Right: mean utility per question category (0–4 rubric, 6 queries per category, panel-averaged); conceptual and qualitativehistorical knowledge stays at baseline, the cost concentrates on adjacent-numeric (retrieval-only: −1.17 from the no-defense baseline of 4.0). Ful… view at source ↗

**Figure 5.** Figure 5: NumLeak probes pipeline. The input tuple feeds four diagnostic probes (§2); their joint signal anchors the recall measurement, controlled validation, and stress-tested mitigation reported in §3–§5. B. Calibration grid: all 12 cells [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Variant A parsed estimate vs Kenneth French truth for every (model, factor) cell. Dashed line: perfect recall (45◦ ). Annotations: Pearson r and parsed-estimate count n per cell. C. Per-factor headline results (full table) Mkt-RF SMB HML RMW CMA Mom factor Opus 4.7 Sonnet 4.6 Haiku 4.5 GPT-5.4 model 68% 12% 10% 5% 10% 3% 26% 9% 3% 5% 0% 1% 18% 2% 9% 7% 11% 0% 35% 10% 12% 15% 5% 0% 0.0 0.2 0.4 0.6 0.8 1.0 w… view at source ↗

**Figure 7.** Figure 7: Within-25 bps recall rate per (model, factor), computed from each model’s main Variant-A sweep (single-seed-42, parsed-only denominator). Mkt-RF is the only column that recovers monthly values at rates meaningfully above chance, for every model. Haiku’s Mkt-RF cell (18% here) is single-seed; the honest 3-seed pooled value is 12% (see Tab. 1, Sec. E). Other factors stay at ≤ 15% for every cell [PITH_FULL_I… view at source ↗

**Figure 8.** Figure 8: Calibration scatter for every (model, probe) cell of the original four models in [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Capability-scaled recall across providers. Recall increases with within-provider model tier on Mkt-RF and S&P 500; DeepSeek provides an additional non-U.S. provider check. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Variant-A calibration on the two Fama-French factors with any partial recall (SMB, HML) across all four models. Opus shows the cleanest alignment (r=0.44 on SMB, r=0.58 on HML), with weaker but visible HML signal on Sonnet (r=0.48); other cells are noise. Mkt-RF (clean recall on all four) is shown in [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Chain-of-thought degrades Sonnet’s Mkt-RF recall. Left: Variant A (green) and Variant D (red) estimates plotted against Kenneth French truth on the months probed under both conditions. Right: per-month absolute error, Variant D (y-axis) versus Variant A (x-axis). Points above the dashed equality line are months where reasoning made the answer worse [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Date-conditional sentiment vs. truth Mkt-RF (left) and vs. the model’s own recall estimate (right). Sonnet n=77, Opus n=40. The two slopes per model are nearly identical (+0.066/ + 0.064 Sonnet, +0.076/ + 0.078 Opus), the visual identity discussed in §6. Permutation null on the slope. Permuting the (date, truth-Mkt-RF) pairing 10,000 times within each model gives a null 95% interval of [−0.020, +0.020] fo… view at source ↗

read the original abstract

Public numeric benchmarks appear in pretraining, so an evaluation that conditions on a date may be measuring memorized recall rather than out-of-sample skill. We introduce NumLeak, a measurement framework that combines API-boundary probes on production models with a white-box controlled validation on an open causal LM. Top-tier frontier LLMs recall the Fama-French market excess return at 3-seed pooled Pearson r=0.97-0.99 while staying within 0.15 within-25bps on the five sibling factors; comparable fidelity appears on U.S. unemployment, CPI inflation, and NOAA temperature. On a recent-release holdout, parse rate collapses to 21-57% but r stays at approximately 0.99 on months answered, the refuse-or-recall asymmetry a memorized channel predicts. The white-box experiment reproduces the dose-response, and logprob ranking detects memorization that open-ended generation misses, implying closed-API black-box probes understate the channel. A Sonnet "date to market-sentiment" regression that correlates with true Mkt-RF at r=0.74 collapses to r=0.02 once the model's own recall is residualized out. A one-line system-prompt defense blocks 99.8% of a non-adaptive single-turn suffix attack set at near-zero utility cost on conceptual and historical-narrative queries

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NumLeak supplies a credible set of controls showing that frontier LLMs recall exact values from public numeric series like Fama-French rather than inferring them.

read the letter

The paper's main contribution is a combined black-box and white-box framework that measures numeric leakage on public benchmarks. It reports very high recall correlations on Fama-French factors and similar series, then uses residualization, holdout parse-rate collapse, log-prob ranking, and white-box dose-response to argue the signal comes from pretraining memorization.

What stands out is how the controls line up. The date-to-sentiment regression falls from 0.74 to 0.02 once the model's own numeric recall is removed. On recent holdouts the model refuses most queries but keeps r near 0.99 on the ones it answers. Log-prob catches leakage that open generation misses, and the open causal LM reproduces the pattern. These pieces target the inference-versus-memorization alternative directly and make the central claim internally consistent.

The soft spots are mostly about missing details in the abstract: no error bars, exact query templates, or exclusion rules are shown here, so full reproducibility checks would need the methods. The defense prompt result is a side observation rather than a core claim. Nothing looks load-bearing or circular from the reported tests.

This is useful for anyone running numeric or time-series evaluations on LLMs, especially in finance or macro. A reader who cares about contamination would get concrete measurement ideas from it. It deserves peer review because the evidence stack is tighter than the abstract alone suggested and the controls are orthogonal rather than decorative.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the NumLeak framework to measure memorization of public numeric benchmarks (e.g., Fama-French factors, U.S. unemployment, CPI, NOAA temperatures) in frontier LLMs via API probes and white-box validation on open causal LMs. It reports pooled Pearson r=0.97-0.99 for Mkt-RF recall across 3 seeds, comparable fidelity on sibling factors within 0.15 within-25bps, collapse of a date-to-market-sentiment regression from r=0.74 to r=0.02 after residualizing model recall, parse-rate drop to 21-57% on recent holdouts with conditional r remaining ~0.99, reproduction of dose-response in white-box settings, and detection of memorization via log-prob ranking that open generation misses. It also shows a one-line system prompt blocks 99.8% of a non-adaptive suffix attack.

Significance. If the results hold, this work demonstrates that high performance on public numeric benchmarks in LLMs can reflect pretraining recall rather than independent inference, with direct implications for evaluation practices in finance, economics, and time-series forecasting. The orthogonal controls (residualization collapse, holdout asymmetry, log-prob vs. generation, white-box dose-response) provide internally consistent evidence favoring the memorization channel and address the memorization-vs-inference alternative. The low-cost defense demonstration adds practical utility. These elements strengthen the contribution beyond the abstract alone.

major comments (2)

[Abstract] Abstract: the pooled r=0.97-0.99 for Mkt-RF (and the r=0.74 to r=0.02 collapse) is reported without error bars, standard errors, number of observations, or exact seed-pooling procedure. This detail is load-bearing for evaluating the statistical robustness of the memorization claim.
[Abstract] Abstract / §3 (implied methods): exact query templates, response parsing rules, and exclusion criteria for the holdout set are not specified, which limits independent verification of the parse-rate collapse and conditional-r result even though the controls target the inference alternative.

minor comments (2)

[Abstract] Abstract: the phrase 'within 0.15 within-25bps on the five sibling factors' requires explicit definition of the tolerance metric and which factors are considered siblings.
[Abstract] Abstract: the holdout time period (e.g., specific months or years) for the 21-57% parse-rate result should be stated to allow assessment of recency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. The comments correctly identify areas where additional statistical detail and methodological transparency will strengthen the paper. We address each point below and will incorporate the requested information in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the pooled r=0.97-0.99 for Mkt-RF (and the r=0.74 to r=0.02 collapse) is reported without error bars, standard errors, number of observations, or exact seed-pooling procedure. This detail is load-bearing for evaluating the statistical robustness of the memorization claim.

Authors: We agree that these statistical details are essential. The pooled Pearson r is obtained by concatenating observations across the three independent seeds and computing the correlation on the combined series. In the revision we will report the exact number of observations (3 imes number of months), standard errors (via Fisher z-transformation), and 95% confidence intervals both for the Mkt-RF correlations and for the regression coefficients before and after residualization. These additions will appear in the abstract and be elaborated in the methods and results sections. revision: yes
Referee: [Abstract] Abstract / §3 (implied methods): exact query templates, response parsing rules, and exclusion criteria for the holdout set are not specified, which limits independent verification of the parse-rate collapse and conditional-r result even though the controls target the inference alternative.

Authors: We acknowledge that full reproducibility requires these specifications. The revised manuscript will expand Section 3 to include the exact prompt templates used for the API probes, the parsing rules (including regex patterns for numeric extraction and refusal detection), and the precise definition of the recent-release holdout set (dates after each model’s training cutoff). Example templates and a short parsing script will be added to the appendix. These clarifications will not change any reported numbers but will allow independent verification of the parse-rate and conditional-r results. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central measurements (Pearson r on numeric recall, residualization collapse from 0.74 to 0.02, holdout parse-rate drop with preserved conditional r, log-prob and white-box dose-response) are direct empirical observations and diagnostics. None of the reported statistics are fitted parameters renamed as predictions, nor do any load-bearing steps reduce by construction to prior self-citations or ansatzes; the residualization isolates the memorization channel without circularity in the r values themselves. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The measurement framework rests on the assumption that public benchmarks appear verbatim in pretraining corpora and that high-fidelity numeric recall is diagnostic of memorization rather than generalization.

axioms (1)

domain assumption High Pearson correlation between model output and historical benchmark values indicates memorization from pretraining data.
Core interpretive step linking observed r values to the leakage claim.

invented entities (1)

NumLeak framework no independent evidence
purpose: Combined API and white-box probe for numeric leakage detection
Newly defined measurement procedure introduced in the paper.

pith-pipeline@v0.9.1-grok · 5769 in / 1215 out tokens · 22863 ms · 2026-06-29T08:59:50.919737+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

URL https://www.sciencedirect.com/ science/article/pii/S0165176525004392. Fama, E. F. and French, K. R. The cross-section of expected stock returns.Journal of Finance, 47(2):427–465, 1992. Fama, E. F. and French, K. R. Common risk factors in the returns on stocks and bonds.Journal of Financial Economics, 33(1):3–56, 1993. Fama, E. F. and French, K. R. A f...

work page arXiv 1992
[2]

Profit mirage: Revisiting information leakage in llm-based financial agents.arXiv preprint arXiv:2510.07920, 2025

URL https://openreview.net/forum? id=G4EKAFzMIs. Li, X., Zeng, Y ., Xing, X., Xu, J., and Xu, X. Profit mirage: Revisiting information leakage in LLM-based financial agents.arXiv preprint arXiv:2510.07920, 2025. URL https://arxiv.org/abs/2510.07920. Liang, S., Garg, S., and Moghaddam, R. Z. The SWE-Bench illusion: When state-of-the-art LLMs remember inste...

work page arXiv 2025
[3]

The memorization problem: Can we trust llms’ economic forecasts?arXiv preprint arXiv:2504.14765, 2025

URL https://papers.ssrn.com/sol3/ papers.cfm?abstract_id=4412788. Lopez-Lira, A., Tang, Y ., and Zhu, M. The memoriza- tion problem: Can we trust LLMs’ economic forecasts? arXiv preprint arXiv:2504.14765, 2025. URL https: //arxiv.org/abs/2504.14765. Sarkar, S. K. and Vafa, K. Lookahead bias in pre- trained language models. SSRN working paper 4754678,

work page arXiv 2025
[4]

the broad U.S. stock market in excess of the T-bill rate

URL https://papers.ssrn.com/sol3/ papers.cfm?abstract_id=4754678. Tirumala, K., Markosyan, A. H., Zettlemoyer, L., and Aghajanyan, A. Memorization without overfitting: An- alyzing the training dynamics of large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. URL https://arxiv.org/abs/ 2205.10770. 6 NumLeak: Public Num...

work page arXiv 2022

[1] [1]

URL https://www.sciencedirect.com/ science/article/pii/S0165176525004392. Fama, E. F. and French, K. R. The cross-section of expected stock returns.Journal of Finance, 47(2):427–465, 1992. Fama, E. F. and French, K. R. Common risk factors in the returns on stocks and bonds.Journal of Financial Economics, 33(1):3–56, 1993. Fama, E. F. and French, K. R. A f...

work page arXiv 1992

[2] [2]

Profit mirage: Revisiting information leakage in llm-based financial agents.arXiv preprint arXiv:2510.07920, 2025

URL https://openreview.net/forum? id=G4EKAFzMIs. Li, X., Zeng, Y ., Xing, X., Xu, J., and Xu, X. Profit mirage: Revisiting information leakage in LLM-based financial agents.arXiv preprint arXiv:2510.07920, 2025. URL https://arxiv.org/abs/2510.07920. Liang, S., Garg, S., and Moghaddam, R. Z. The SWE-Bench illusion: When state-of-the-art LLMs remember inste...

work page arXiv 2025

[3] [3]

The memorization problem: Can we trust llms’ economic forecasts?arXiv preprint arXiv:2504.14765, 2025

URL https://papers.ssrn.com/sol3/ papers.cfm?abstract_id=4412788. Lopez-Lira, A., Tang, Y ., and Zhu, M. The memoriza- tion problem: Can we trust LLMs’ economic forecasts? arXiv preprint arXiv:2504.14765, 2025. URL https: //arxiv.org/abs/2504.14765. Sarkar, S. K. and Vafa, K. Lookahead bias in pre- trained language models. SSRN working paper 4754678,

work page arXiv 2025

[4] [4]

the broad U.S. stock market in excess of the T-bill rate

URL https://papers.ssrn.com/sol3/ papers.cfm?abstract_id=4754678. Tirumala, K., Markosyan, A. H., Zettlemoyer, L., and Aghajanyan, A. Memorization without overfitting: An- alyzing the training dynamics of large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. URL https://arxiv.org/abs/ 2205.10770. 6 NumLeak: Public Num...

work page arXiv 2022