Learning to Query History: Nonstationary Classification via Learned Retrieval

Bilel Fehri; Bishal Thapaliya; Deepayan Chakrabarti; Jimmy Gammell; Riyasat Ohib; Yoon Jung

arxiv: 2604.07027 · v1 · submitted 2026-04-08 · 💻 cs.LG

Learning to Query History: Nonstationary Classification via Learned Retrieval

Jimmy Gammell , Bishal Thapaliya , Yoon Jung , Riyasat Ohib , Bilel Fehri , Deepayan Chakrabarti This is my paper

Pith reviewed 2026-05-10 19:02 UTC · model grok-4.3

classification 💻 cs.LG

keywords nonstationary classificationlearned retrievaldistribution shifthistorical examplestime series predictiondiscrete retrievalscore-based gradient estimator

0 comments

The pith

A learned retrieval system lets classifiers condition on selected past examples to stay accurate as data distributions shift over time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes nonstationary classification as a time-series task where the model predicts from the current input together with a long sequence of earlier labeled examples that reach past the original training window. It solves the scaling problem by training a discrete retrieval step that, for each new input, issues a query and samples only the most relevant historical cases instead of loading the entire archive. Because the retriever is trained end-to-end with the classifier using a score-based gradient estimator, the whole system can keep its growing history on disk rather than in VRAM. A reader would care because deployed models routinely encounter changing conditions that standard training never prepared them for, and this approach gives a concrete way to reuse past data without the usual memory or retraining costs.

Core claim

Nonstationary classification can be solved by conditioning the predictor on a retrieved subsequence of historical labeled examples that extends beyond the training cutoff; the retrieval itself is performed by an input-dependent discrete mechanism that is trained jointly with the classifier via a score-based gradient estimator, allowing the full historical corpus to reside on external storage while still delivering improved robustness to distribution shift.

What carries the argument

Learned discrete retrieval mechanism that produces input-dependent queries to sample relevant past labeled examples, trained end-to-end with a score-based gradient estimator.

If this is right

Models can continue to classify accurately after the training distribution has shifted by drawing on earlier examples stored externally.
VRAM consumption grows predictably with the length of the historical sequence rather than requiring the entire archive to be resident.
The same architecture applies to both synthetic nonstationary benchmarks and real review data that exhibits temporal drift.
End-to-end training of the retriever removes the need for separate indexing or heuristic selection rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retrieval pattern could be applied to other tasks that suffer from temporal drift, such as forecasting or recommendation systems.
Production pipelines might reduce retraining frequency by maintaining and querying a growing external history instead of periodic full-model updates.
If the score-based estimator proves stable, similar learned retrieval could be inserted into existing continual-learning pipelines with modest changes.

Load-bearing premise

The learned retrieval step will reliably surface useful historical examples for the current input without introducing new errors or prohibitive overhead.

What would settle it

An experiment in which the accuracy under distribution shift is no higher with the retrieval mechanism than with a standard classifier trained only on the original data, or in which memory usage fails to scale linearly with history length.

Figures

Figures reproduced from arXiv: 2604.07027 by Bilel Fehri, Bishal Thapaliya, Deepayan Chakrabarti, Jimmy Gammell, Riyasat Ohib, Yoon Jung.

**Figure 2.** Figure 2: Synthetic settings. (left) Accuracy vs. time in a nonstationary binary classification setting with rotating decision boundary. Models train on t ∈ [0, 0.5] and test on t ∈ [0, 1]. Our method leverages historical context to mitigate performance degradation beyond the training distribution. (right) Training dynamics on a ‘needle in haystack’ task where the label is encoded in one historical item. The system … view at source ↗

**Figure 3.** Figure 3: Amazon Electronics Reviews. (left) Peak VRAM consumption during training vs. hyperparameters. Scaling is consistent with Eqn. 2. (right) Accuracy over time for models trained on pre-2014 data. Our method leverages historical context to mitigate performance degradation on out-of-distribution data beyond 2014. our framework supports both, we find that in this setting label-only retrieval improves robustness… view at source ↗

**Figure 4.** Figure 4: Additional results in the ‘needle in haystack’ setting. ( [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Diagram of a Perceiver block (Jaegle et al., 2021). Rather than attending directly to the [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Accuracy over time for models trained on pre-2014 data from the electronics category of [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

read the original abstract

Nonstationarity is ubiquitous in practical classification settings, leading deployed models to perform poorly even when they generalize well to holdout sets available at training time. We address this by reframing nonstationary classification as time series prediction: rather than predicting from the current input alone, we condition the classifier on a sequence of historical labeled examples that extends beyond the training cutoff. To scale to large sequences, we introduce a learned discrete retrieval mechanism that samples relevant historical examples via input-dependent queries, trained end-to-end with the classifier using a score-based gradient estimator. This enables the full corpus of historical data to remain on an arbitrary filesystem during training and deployment. Experiments on synthetic benchmarks and Amazon Reviews '23 (electronics category) show improved robustness to distribution shift compared to standard classifiers, with VRAM scaling predictably as the length of the historical data sequence increases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper tries to fix nonstationary classification by learning discrete queries to pull relevant history, but the score-based training for those queries is the obvious weak link.

read the letter

The main move here is to treat nonstationary classification as a retrieval problem: instead of predicting from the current input, the model conditions on a sequence of past labeled examples drawn from a large history via learned input-dependent queries. The queries are trained end-to-end with the classifier using a score-based gradient estimator, which lets the full corpus stay on disk rather than in VRAM. That part is practical and directly targets a common deployment headache where models degrade on shifted data but you still have the old records available. The synthetic benchmarks and Amazon Reviews electronics experiments are presented as evidence that this improves robustness over standard classifiers while keeping memory scaling predictable with history length. That framing and the filesystem trick are the clearest new pieces. The soft spot is exactly the one the stress-test flags. Score-based estimators for discrete sampling carry high variance by default, and nothing in the abstract or setup description shows variance reduction, baseline subtraction, or even simple diagnostics that the retrieved items are more relevant than random draws. Without those, it is hard to know whether the gains come from the retrieval mechanism or from incidental effects like extra capacity or regularization. The paper engages the nonstationarity literature honestly and sets up a reproducible-looking experimental direction, so it is worth a referee's time to check whether the training actually produces useful queries and whether the robustness holds under tighter controls. For someone working on continual learning or retrieval-augmented models this is worth reading; for most others it is a maybe. I would send it to review rather than desk reject.

Referee Report

1 major / 2 minor

Summary. The paper reframes nonstationary classification as conditioning a model on a sequence of historical labeled examples retrieved from a large corpus stored on disk. It introduces a learned discrete retrieval mechanism that generates input-dependent queries, trained end-to-end with the classifier via a score-based gradient estimator. Experiments on synthetic benchmarks and the Amazon Reviews '23 electronics category report improved robustness to distribution shift relative to standard classifiers, with VRAM usage scaling predictably with history length.

Significance. If the retrieval mechanism reliably surfaces relevant history, the approach offers a scalable way to mitigate degradation in deployed classifiers under nonstationarity without requiring the entire history in memory. The filesystem-based design and explicit time-series reframing are practical strengths that could influence handling of streaming or shifting data in production systems.

major comments (1)

[§3.2] §3.2 (score-based gradient estimator for discrete retrieval): the central claim that input-dependent queries improve robustness requires that the REINFORCE-style estimator reliably learns useful retrieval policies. No variance reduction (baseline subtraction, control variates) or diagnostics (e.g., oracle overlap on synthetic data showing retrieved items exceed random relevance) are described; without these, reported gains may arise from longer context or regularization rather than the learned mechanism, undermining attribution to the proposed retrieval.

minor comments (2)

[§4] Abstract and §4: the experimental section should report statistical significance, number of runs, and explicit comparison to strong baselines that also access history (e.g., simple concatenation or non-learned retrieval) to isolate the contribution of the learned queries.
Notation: the distinction between the query network parameters and the classifier parameters should be made explicit in the training objective to clarify what is optimized jointly.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below and will revise the manuscript to strengthen the attribution of gains to the learned retrieval mechanism.

read point-by-point responses

Referee: [§3.2] §3.2 (score-based gradient estimator for discrete retrieval): the central claim that input-dependent queries improve robustness requires that the REINFORCE-style estimator reliably learns useful retrieval policies. No variance reduction (baseline subtraction, control variates) or diagnostics (e.g., oracle overlap on synthetic data showing retrieved items exceed random relevance) are described; without these, reported gains may arise from longer context or regularization rather than the learned mechanism, undermining attribution to the proposed retrieval.

Authors: We agree that the score-based gradient estimator (a REINFORCE-style estimator for discrete sampling) lacks explicit variance reduction in the current manuscript, and no oracle-overlap diagnostics are reported. This is a fair concern: without such controls it is possible that gains stem from simply conditioning on longer history rather than from input-dependent query learning. In the revised version we will (i) add a simple baseline-subtracted estimator and report gradient variance during training, and (ii) include an oracle-relevance diagnostic on the synthetic benchmarks (where ground-truth relevant history is known) showing that the learned policy retrieves items with higher overlap than random or fixed retrieval. These additions will directly address attribution. revision: yes

Circularity Check

0 steps flagged

No circularity detected; new mechanism presented without reduction to inputs by construction.

full rationale

The abstract reframes nonstationary classification as conditioning the classifier on historical sequences and introduces a learned discrete retrieval mechanism trained end-to-end via score-based gradient estimation. No equations, derivations, or self-citations are shown that would make any claimed prediction or result equivalent to its inputs by definition or by fitting. The approach is positioned as a novel scaling solution for large historical corpora, with empirical validation on synthetic benchmarks and Amazon Reviews data providing independent support. This satisfies the criteria for a self-contained derivation without load-bearing self-references, fitted inputs renamed as predictions, or ansatz smuggling.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that historical labeled data remains relevant and retrievable, plus the effectiveness of the score-based estimator for discrete sampling.

free parameters (1)

query network parameters
Learned end-to-end with the classifier; no specific values given.

axioms (1)

domain assumption Historical labeled examples contain information useful for predicting future labels under distribution shift
Invoked by reframing the task as time series prediction conditioned on history.

invented entities (1)

learned discrete retrieval mechanism no independent evidence
purpose: To sample relevant historical examples via input-dependent queries
New component introduced to scale to large histories while keeping data on disk.

pith-pipeline@v0.9.0 · 5460 in / 1306 out tokens · 100004 ms · 2026-05-10T19:02:15.514814+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[4]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[3] [3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[4] [4]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page