GITCO: Gated Inference-Time Context Optimization in TSFMs

Dhruv Kumar; Manya Pandey; Murari Mandal; Saurabh Deshpande

arxiv: 2606.05332 · v1 · pith:OGUVCWCYnew · submitted 2026-06-03 · 💻 cs.AI

GITCO: Gated Inference-Time Context Optimization in TSFMs

Manya Pandey , Dhruv Kumar , Murari Mandal , Saurabh Deshpande This is my paper

Pith reviewed 2026-06-28 06:13 UTC · model grok-4.3

classification 💻 cs.AI

keywords time series foundation modelscontext optimizationinference timepatch-based modelscontext poisoningforecast accuracyGITCO

0 comments

The pith

GITCO improves TSFM forecast accuracy by suppressing anomalous patches at inference time without model updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Patch-based time series foundation models suffer context poisoning when structurally anomalous patches capture disproportionate attention and degrade zero-shot forecasts. GITCO counters this with a lightweight Gate-Router-Critic system that identifies and suppresses harmful patches using only inference-time signals. On TimesFM 2.5 across 53 GIFT-Eval datasets under K-fold cross-validation, the method delivers an average 1.95% MASE reduction while recovering 89.9% of the maximum possible gain from context changes. The work also defines context sensitivity profiles that map time series meta-features to expected accuracy gains under such interventions.

Core claim

A gated inference-time context optimization framework with Gate, Router, and Critic components can selectively suppress structurally anomalous patches in patch-based TSFMs, improving zero-shot forecast quality through input context optimization alone.

What carries the argument

The GITCO framework: a three-component inference-time system (Gate, Router, Critic) that detects and suppresses harmful patches without parameter updates or model internals access.

If this is right

Context optimization at inference time can recover most of the accuracy lost to context poisoning without any retraining.
Context sensitivity profiles characterize how TSFM accuracy responds to data meta-features under inference-time intervention.
The gains hold under K-fold cross-validation across 53 diverse datasets.
Improvements require no changes to model weights or architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same inference-time suppression idea could be tested on non-time-series foundation models that use patch or token context.
Context sensitivity profiles might serve as a diagnostic tool for choosing which TSFM to deploy on a given dataset.
Adaptive versions of the Gate component could adjust suppression thresholds based on real-time data statistics.

Load-bearing premise

The Gate, Router, and Critic can reliably detect and suppress structurally anomalous patches using only inference-time signals, and this detection generalizes across datasets without model access or updates.

What would settle it

If GITCO produces no MASE improvement or negative results when tested on TimesFM 2.5 with new held-out data or different TSFMs, the claim that inference-time signals suffice for reliable patch suppression would be refuted.

Figures

Figures reproduced from arXiv: 2606.05332 by Dhruv Kumar, Manya Pandey, Murari Mandal, Saurabh Deshpande.

**Figure 1.** Figure 1: The GITCO Pipeline: A raw 512-step context window (16 patches of 32 steps) is passed to the Gate, which computes meta-features to decide whether intervention is warranted. If so, the Critic scores all 16 patches and identifies the most disruptive one (highlighted red); soft-denoising via SMA produces the GITCO-conditioned context. a model-specific, vocabulary-dependent property. 2. Related Work Time Serie… view at source ↗

read the original abstract

Patch-based Time Series Foundation Models (TSFMs) suffer from context poisoning: structurally anomalous patches capture disproportionate attention and silently degrade zero-shot forecast quality. We propose improving TSFM accuracy at inference time by optimizing the input context rather than modifying model weights. We present GITCO (Gated Inference-Time Context Optimization), a lightweight three-component framework: Gate, Router, and Critic that selectively identifies and suppresses harmful patches without any parameter updates. Evaluated on TimesFM 2.5 across 53 GIFT-Eval datasets under K-fold cross-validation, GITCO achieves an average +1.95% MASE reduction on TimesFM 2.5 while capturing 89.9% of the improvement upper bound. We introduce context sensitivity profiles as a new characterizable property of TSFMs: the mapping from time series meta-features to expected accuracy improvement under inference-time context intervention, shaped jointly by model architecture and the statistical structure of the data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GITCO adds a three-component inference-time patch filter for TSFMs with small reported gains, but the abstract leaves the actual mechanics and robustness unverified.

read the letter

The core contribution is a lightweight Gate-Router-Critic setup that tries to spot and drop structurally bad patches from the input context of patch-based TSFMs like TimesFM, all without touching weights. They also define context sensitivity profiles as a way to map dataset meta-features to expected gains from this kind of intervention.

What stands out is the scale of the test: 53 GIFT-Eval datasets, K-fold cross-validation, and a concrete +1.95% average MASE drop that reaches 89.9% of some upper bound. That is more evaluation than most inference-time tweaks get, and the zero-update constraint is a practical selling point for deployed models.

The soft spots are mostly around verification. The abstract gives numbers but no equations, no pseudocode for how the Gate or Critic actually scores patches, and no error bars or per-dataset breakdowns. Without those, it is impossible to tell whether the heuristics generalize or simply exploit quirks in the GIFT-Eval collection. The K-fold protocol also raises the usual question of whether any per-fold tuning crept in, which would undercut the pure inference-time claim. The stress-test note about unverified heuristics is fair; until the full methods section shows the rules are architecture-agnostic and not post-hoc, the result stays provisional.

This is aimed at the TSFM subfield, especially groups already running TimesFM or similar patch models who want a quick robustness patch. It is not a foundational advance, but the empirical framing is honest enough that a serious referee should look at it to check reproducibility and whether the components actually do what the abstract claims.

Referee Report

3 major / 0 minor

Summary. The manuscript proposes GITCO, a lightweight three-component (Gate, Router, Critic) framework for inference-time optimization of input context in patch-based Time Series Foundation Models (TSFMs) to mitigate context poisoning from structurally anomalous patches, without any parameter updates. Evaluated on TimesFM 2.5 across 53 GIFT-Eval datasets under K-fold cross-validation, it reports an average +1.95% MASE reduction while capturing 89.9% of the improvement upper bound, and introduces context sensitivity profiles as a new characterizable property of TSFMs.

Significance. If the empirical results hold under the zero-update constraint and the components generalize using only inference-time signals, GITCO could provide a practical method for improving zero-shot forecasting accuracy in existing TSFMs. The context sensitivity profiles offer a potentially useful new analysis tool for relating meta-features to intervention gains.

major comments (3)

[Abstract] Abstract: the reported +1.95% MASE reduction and 89.9% capture of the upper bound are presented without error bars, statistical tests, or an explicit definition/computation of the upper bound, preventing verification of the central empirical claim.
[Evaluation] Evaluation protocol: K-fold cross-validation across the 53 datasets creates a risk of implicit per-fold adaptation in the Gate/Router/Critic components, which would undermine the claim of strictly zero parameter updates and inference-time-only operation.
Framework description: the Gate, Router, and Critic are said to identify anomalous patches using only inference-time signals and patch statistics, but no explicit heuristics, meta-feature rules, or decision criteria are provided, leaving open whether these rules are architecture-agnostic or overfit to GIFT-Eval artifacts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major point below and indicate where we will revise the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the reported +1.95% MASE reduction and 89.9% capture of the upper bound are presented without error bars, statistical tests, or an explicit definition/computation of the upper bound, preventing verification of the central empirical claim.

Authors: We agree that the abstract would benefit from greater explicitness for verification. The upper bound is defined in the manuscript as the MASE reduction obtained by an oracle that perfectly suppresses all anomalous patches. Error bars and statistical tests appear in the results. We will revise the abstract to include a concise definition of the upper bound. revision: yes
Referee: [Evaluation] Evaluation protocol: K-fold cross-validation across the 53 datasets creates a risk of implicit per-fold adaptation in the Gate/Router/Critic components, which would undermine the claim of strictly zero parameter updates and inference-time-only operation.

Authors: The K-fold procedure evaluates forecasting performance only and does not adapt or update any parameters of the Gate, Router, or Critic. These components use fixed, inference-time signals exclusively. We will add an explicit statement in the evaluation section confirming that no per-fold adaptation of the components occurs. revision: yes
Referee: [—] Framework description: the Gate, Router, and Critic are said to identify anomalous patches using only inference-time signals and patch statistics, but no explicit heuristics, meta-feature rules, or decision criteria are provided, leaving open whether these rules are architecture-agnostic or overfit to GIFT-Eval artifacts.

Authors: We acknowledge that the manuscript currently gives only a high-level description of the components. We will expand the framework section to supply the explicit heuristics, meta-feature rules, and decision criteria, allowing readers to assess their generality and independence from GIFT-Eval specifics. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivation chain

full rationale

The paper proposes GITCO as an inference-time method and reports measured MASE reductions on held-out K-fold splits across 53 datasets. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation load-bearing uniqueness theorems appear in the provided text. The central claim is an observed performance delta under explicit zero-update constraints, which is externally falsifiable on the same benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The framework implicitly assumes the existence of identifiable anomalous patches and the effectiveness of the three-component design.

pith-pipeline@v0.9.1-grok · 5699 in / 1104 out tokens · 25925 ms · 2026-06-28T06:13:50.025576+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 6 linked inside Pith

[1]

Ansari, A

URLhttps://arxiv.org/abs/2410.10393. Ansari, A. F., Stella, L., Turkmen, C., Zhang, X., Mercado, P., Shen, H., Shchur, O., Rangapuram, S. S., Arango, S. P., Kapoor, S., Zschiegner, J., Maddix, D. C., Ma- honey, M. W., Torkkola, K., Wilson, A. G., Bohlke- Schneider, M., and Wang, Y . Chronos: Learning the lan- guage of time series.Transactions on Machine L...

arXiv
[2]

Auer, A., B¨ock, S., Podest`a, P., Klambauer, G., Klotz, D., and Hochreiter, S

URL https://arxiv.org/abs/ 2403.07815. Auer, A., B¨ock, S., Podest`a, P., Klambauer, G., Klotz, D., and Hochreiter, S. Tirex: Zero-shot forecasting across long and short horizons with enhanced in-context learn- ing. InAdvances in Neural Information Processing Sys- tems (NeurIPS),

Pith/arXiv arXiv
[3]

Das, A., Kong, W., Sen, R., and Zhou, Y

URL https://arxiv.org/ abs/2505.23719. Das, A., Kong, W., Sen, R., and Zhou, Y . A decoder- only foundation model for time-series forecasting. In Proceedings of the International Conference on Machine Learning (ICML),

arXiv
[4]

Faw, M., Sen, R., Zhou, Y ., and Das, A

URL https://arxiv.org/ abs/2310.10688. Faw, M., Sen, R., Zhou, Y ., and Das, A. In-context fine- tuning for time-series foundation models. InProceedings of the International Conference on Machine Learning (ICML),

Pith/arXiv arXiv
[5]

Gruver, N., Finzi, M., Qiu, S., and Wilson, A

URL https://arxiv.org/abs/ 2410.24087. Gruver, N., Finzi, M., Qiu, S., and Wilson, A. G. Large language models are zero-shot time series forecasters. InAdvances in Neural Information Processing Systems (NeurIPS),

arXiv
[6]

Hua, R., Liu, Z., Zhang, K., and Yang, Y

URL https://arxiv.org/abs/ 2310.07820. Hua, R., Liu, Z., Zhang, K., and Yang, Y . Diversified scaling inference in time series foundation models,

arXiv
[7]

Hyndman, R., Kang, Y ., Montero-Manso, P., Tala- gala, T., Wang, E., Yang, Y ., and O’Hara-Wild, M.tsfeatures: Time Series Feature Extraction,

URL https://arxiv.org/abs/2601.17376. Hyndman, R., Kang, Y ., Montero-Manso, P., Tala- gala, T., Wang, E., Yang, Y ., and O’Hara-Wild, M.tsfeatures: Time Series Feature Extraction,

arXiv
[8]

Nie, Y ., Nguyen, N

URL https://arxiv.org/ abs/2403.14735. Nie, Y ., Nguyen, N. H., Sinthong, P., and Kalagnanam, J. A time series is worth 64 words: Long-term forecasting with transformers. InProceedings of the International Conference on Learning Representations (ICLR),

arXiv
[9]

Sanyal, D., Nagpal, A., Kumar, D., Mandal, M., and Desh- pande, S

URLhttps://arxiv.org/abs/2211.14730. Sanyal, D., Nagpal, A., Kumar, D., Mandal, M., and Desh- pande, S. time2time: Causal intervention in hidden states to simulate rare events in time series foundation mod- els. InNeurIPS 2025 Workshop on Recent Advances in Time Series Foundation Models (BERT2S),

Pith/arXiv arXiv 2025
[10]

Snell, C., Lee, J., Xu, K., and Kumar, A

URL https://arxiv.org/abs/2509.05801. Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling llm test- time compute optimally can be more effective than scal- ing model parameters. InProceedings of the International Conference on Learning Representations (ICLR),

arXiv
[11]

Sun, Y ., Wang, X., Liu, Z., Miller, J., Efros, A., and Hardt, M

URLhttps://arxiv.org/abs/2408.03314. Sun, Y ., Wang, X., Liu, Z., Miller, J., Efros, A., and Hardt, M. Test-time training with self-supervision for general- ization under distribution shifts. InProceedings of the International Conference on Machine Learning,

Pith/arXiv arXiv
[12]

Wang, X., Wei, J., Schuurmans, D., Le, Q

URLhttps://arxiv.org/abs/1909.13231. Wang, X., Wei, J., Schuurmans, D., Le, Q. V ., Chi, E. H., Narang, S., Chowdhery, A., and Zhou, D. Self- consistency improves chain of thought reasoning in lan- guage models. InProceedings of the International Con- ference on Learning Representations (ICLR),

arXiv 1909
[13]

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E

URL https://arxiv.org/abs/2203.11171. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V ., and Zhou, D. Chain-of- thought prompting elicits reasoning in large language models. InAdvances in Neural Information Process- ing Systems (NeurIPS),

Pith/arXiv arXiv
[14]

org/abs/2201.11903

URL https://arxiv. org/abs/2201.11903. Zeng, A., Chen, M., Zhang, L., and Xu, Q. Are trans- formers effective for time series forecasting? InPro- ceedings of the AAAI Conference on Artificial Intelli- gence (AAAI),

Pith/arXiv arXiv
[15]

5 GITCO: Inference-Time Context Optimization in TSFMs A

URL https://arxiv.org/ abs/2205.13504. 5 GITCO: Inference-Time Context Optimization in TSFMs A. Appendix Table 3.Evaluation Dataset Summary Frequency BandExample Datasets TimesFM 2.5 Chronos2 Sub-hourlyLOOP SEATTLE/5T,SZ TAXI/15T✓ ✓ HourlyLOOP SEATTLE/H,M DENSE/H✓ ✓ DailyM DENSE/D,ETTh1,Weather/D✓ ✓ Weekly / Monthlym4 monthly,us births✓ ✓ Other Economics,...

arXiv

[1] [1]

Ansari, A

URLhttps://arxiv.org/abs/2410.10393. Ansari, A. F., Stella, L., Turkmen, C., Zhang, X., Mercado, P., Shen, H., Shchur, O., Rangapuram, S. S., Arango, S. P., Kapoor, S., Zschiegner, J., Maddix, D. C., Ma- honey, M. W., Torkkola, K., Wilson, A. G., Bohlke- Schneider, M., and Wang, Y . Chronos: Learning the lan- guage of time series.Transactions on Machine L...

arXiv

[2] [2]

Auer, A., B¨ock, S., Podest`a, P., Klambauer, G., Klotz, D., and Hochreiter, S

URL https://arxiv.org/abs/ 2403.07815. Auer, A., B¨ock, S., Podest`a, P., Klambauer, G., Klotz, D., and Hochreiter, S. Tirex: Zero-shot forecasting across long and short horizons with enhanced in-context learn- ing. InAdvances in Neural Information Processing Sys- tems (NeurIPS),

Pith/arXiv arXiv

[3] [3]

Das, A., Kong, W., Sen, R., and Zhou, Y

URL https://arxiv.org/ abs/2505.23719. Das, A., Kong, W., Sen, R., and Zhou, Y . A decoder- only foundation model for time-series forecasting. In Proceedings of the International Conference on Machine Learning (ICML),

arXiv

[4] [4]

Faw, M., Sen, R., Zhou, Y ., and Das, A

URL https://arxiv.org/ abs/2310.10688. Faw, M., Sen, R., Zhou, Y ., and Das, A. In-context fine- tuning for time-series foundation models. InProceedings of the International Conference on Machine Learning (ICML),

Pith/arXiv arXiv

[5] [5]

Gruver, N., Finzi, M., Qiu, S., and Wilson, A

URL https://arxiv.org/abs/ 2410.24087. Gruver, N., Finzi, M., Qiu, S., and Wilson, A. G. Large language models are zero-shot time series forecasters. InAdvances in Neural Information Processing Systems (NeurIPS),

arXiv

[6] [6]

Hua, R., Liu, Z., Zhang, K., and Yang, Y

URL https://arxiv.org/abs/ 2310.07820. Hua, R., Liu, Z., Zhang, K., and Yang, Y . Diversified scaling inference in time series foundation models,

arXiv

[7] [7]

Hyndman, R., Kang, Y ., Montero-Manso, P., Tala- gala, T., Wang, E., Yang, Y ., and O’Hara-Wild, M.tsfeatures: Time Series Feature Extraction,

URL https://arxiv.org/abs/2601.17376. Hyndman, R., Kang, Y ., Montero-Manso, P., Tala- gala, T., Wang, E., Yang, Y ., and O’Hara-Wild, M.tsfeatures: Time Series Feature Extraction,

arXiv

[8] [8]

Nie, Y ., Nguyen, N

URL https://arxiv.org/ abs/2403.14735. Nie, Y ., Nguyen, N. H., Sinthong, P., and Kalagnanam, J. A time series is worth 64 words: Long-term forecasting with transformers. InProceedings of the International Conference on Learning Representations (ICLR),

arXiv

[9] [9]

Sanyal, D., Nagpal, A., Kumar, D., Mandal, M., and Desh- pande, S

URLhttps://arxiv.org/abs/2211.14730. Sanyal, D., Nagpal, A., Kumar, D., Mandal, M., and Desh- pande, S. time2time: Causal intervention in hidden states to simulate rare events in time series foundation mod- els. InNeurIPS 2025 Workshop on Recent Advances in Time Series Foundation Models (BERT2S),

Pith/arXiv arXiv 2025

[10] [10]

Snell, C., Lee, J., Xu, K., and Kumar, A

URL https://arxiv.org/abs/2509.05801. Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling llm test- time compute optimally can be more effective than scal- ing model parameters. InProceedings of the International Conference on Learning Representations (ICLR),

arXiv

[11] [11]

Sun, Y ., Wang, X., Liu, Z., Miller, J., Efros, A., and Hardt, M

URLhttps://arxiv.org/abs/2408.03314. Sun, Y ., Wang, X., Liu, Z., Miller, J., Efros, A., and Hardt, M. Test-time training with self-supervision for general- ization under distribution shifts. InProceedings of the International Conference on Machine Learning,

Pith/arXiv arXiv

[12] [12]

Wang, X., Wei, J., Schuurmans, D., Le, Q

URLhttps://arxiv.org/abs/1909.13231. Wang, X., Wei, J., Schuurmans, D., Le, Q. V ., Chi, E. H., Narang, S., Chowdhery, A., and Zhou, D. Self- consistency improves chain of thought reasoning in lan- guage models. InProceedings of the International Con- ference on Learning Representations (ICLR),

arXiv 1909

[13] [13]

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E

URL https://arxiv.org/abs/2203.11171. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V ., and Zhou, D. Chain-of- thought prompting elicits reasoning in large language models. InAdvances in Neural Information Process- ing Systems (NeurIPS),

Pith/arXiv arXiv

[14] [14]

org/abs/2201.11903

URL https://arxiv. org/abs/2201.11903. Zeng, A., Chen, M., Zhang, L., and Xu, Q. Are trans- formers effective for time series forecasting? InPro- ceedings of the AAAI Conference on Artificial Intelli- gence (AAAI),

Pith/arXiv arXiv

[15] [15]

5 GITCO: Inference-Time Context Optimization in TSFMs A

URL https://arxiv.org/ abs/2205.13504. 5 GITCO: Inference-Time Context Optimization in TSFMs A. Appendix Table 3.Evaluation Dataset Summary Frequency BandExample Datasets TimesFM 2.5 Chronos2 Sub-hourlyLOOP SEATTLE/5T,SZ TAXI/15T✓ ✓ HourlyLOOP SEATTLE/H,M DENSE/H✓ ✓ DailyM DENSE/D,ETTh1,Weather/D✓ ✓ Weekly / Monthlym4 monthly,us births✓ ✓ Other Economics,...

arXiv