pith. sign in

arxiv: 2605.31108 · v1 · pith:C3H6PVLSnew · submitted 2026-05-29 · 💻 cs.CV · cs.LG

Remembering by Reconstructing: Domain Incremental Learning With Test-Time Training on Video Streams

Pith reviewed 2026-06-28 22:59 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords domain incremental learningtest-time trainingLoRA adaptersmasked autoencodervideo streamscatastrophic forgetting
0
0 comments X

The pith

At inference, online test-time training on a masked autoencoder head selects the matching LoRA adapter to recover domain knowledge after induced forgetting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes allowing catastrophic forgetting in domain incremental learning by training domain-specific LoRA adapters. A main task head is paired with a self-supervised masked autoencoder head. Each adapter specializes to its domain, inducing forgetting on others. At inference on video streams, online test-time training on the MAE head selects the best-matching LoRA to recover domain knowledge. This is demonstrated on action recognition and semantic segmentation under gradual domain shifts.

Core claim

By inducing forgetting through specialized LoRA adapters during incremental training and recovering the right adapter at inference via test-time training on the MAE reconstruction task, the model adapts to non-stationary video streams where domains evolve gradually.

What carries the argument

Domain-specific LoRA adapters combined with a self-supervised MAE head for online adapter selection at test time.

Load-bearing premise

Test-time training on the MAE head produces a reliable signal to select the correct LoRA adapter during gradual domain shifts in video.

What would settle it

A case where the MAE reconstruction error fails to identify the matching LoRA for inputs from a shifted domain in video sequences.

Figures

Figures reproduced from arXiv: 2605.31108 by Jonathan Swinnen, Tinne Tuytelaars.

Figure 1
Figure 1. Figure 1: A general overview of our domain incremental learning method: At training￾time (left), we train a new LoRA for each new domain on both the main downstream task and an auxiliary MAE task. At test-time (right), we use a weighted combination of all previously trained LoRAs and optimize their weighting coefficients online using test-time training on the MAE task. sample belongs to, treating them all as IID sam… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of our model architecture for the case of video action recognition. The upper branch consists of an encoder E and a decoder D, which are trained as an MAE. The lower branch shares the same encoder but uses a different head M for the main downstream task. and we randomly initialize the parameters θM of the main task head M. We do not discard the MAE decoder, instead we keep it and continue fine… view at source ↗
Figure 3
Figure 3. Figure 3: Some example frames illustrating the task sequence of the NTU-RGB+D120 dataset (left) and the Places365 + Domains-Campus datasets (right). well suited for domain-incremental learning. The initial pre-training task, T0, comprises recordings within a single room. The subsequent tasks, T1 through T8, illustrated in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: visualizes the behavior of these TTT coefficients ck over time in the Domains-Campus test stream. We can clearly see that the coefficient correspond￾ing to each domain gains importance when entering its domain. There is some confusion in the ’basement’ domain, where at some point the ’upstairs’ and ’garage’ domain coefficients also contribute significantly. This happens due to some visual similarities betw… view at source ↗
Figure 5
Figure 5. Figure 5: Effect of key hyperparameters of the test-time training procedure. Each plot varies a single hyperparameter while keeping the others fixed to their default value (η = 0.05, λ = 0.03, τ = 0.2, m = 16, n = 16, w = 32). We plot the mean performance over 3 random training seeds and 3 inference seeds per training seed, for a total of 9 evaluations per config. The shaded band shows ±1 standard deviation. 12 [PI… view at source ↗
Figure 6
Figure 6. Figure 6: The modified VideoMAE-based encoder architecture. The encoder consists of 12 ViT blocks, where we use RoPE positional encoding for the time axis. The encoder outputs the features Fi of every block i = 0, 1, ..., 12. We use an attention mask for causal windowed attention, and KV-Caching is possible for streaming inference [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The modified cross-attention decoder architecture, inspired by [11]. We compute a reconstruction for each of the given mask tokens, which serve as queries for attention. Keys and values are computed from linear combinations of the encoder output features. As there is no self-attention between the mask tokens, each one is reconstructed in￾dependently, and we could decide to only reconstruct a subset of the … view at source ↗
Figure 8
Figure 8. Figure 8: The classifier head for action classification. The architecture is quite similar to the MAE decoder in [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Overview of the entire action classification architecture, combining the encoder, decoder and classifier. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example causal sliding window attention mask which simulates the KV-cache at training time. At inference, we maintain separate KV-caches for the main task branch Fmain and for the MAE branch FMAE. To keep the KV-cache of the MAE branch up to date, we now also evaluate the MAE branch when no TTT backpropagation step is required. To correctly use the temporal attention, we also do not randomly sample frames… view at source ↗
Figure 12
Figure 12. Figure 12: Sparse frame sampling strategy used to speed up training the first 50 epochs of T0. cumulation). Parameters that are not mentioned in this table remain unchanged from Tab. 3. We only train on full length action clips. We do not perform sparse frame sampling. We again use a single Nvidia RTX 3090 GPU for training. Training all 8 tasks takes roughly 10 hours. For the training of the full finetuning baseline… view at source ↗
read the original abstract

In this work we introduce a novel approach to domain incremental learning, adapting models over time to evolving, non-stationary data. In contrast to other works, we do not attempt to avoid catastrophic forgetting, but rather allow it and exploit it. Our model combines a main task head with a self-supervised masked autoencoder (MAE) head. We then learn domain-specific LoRA adapters during incremental training. Each adapter specializes to its domain, naturally inducing forgetting on other domains in both heads. At inference, we perform online test-time training on the self-supervised MAE head to identify which LoRAs best matches the current input, so the model can `remember' the domain again. Our scheme is especially well-suited to real-world streaming data, such as video, where consecutive samples are highly correlated and domain shifts are gradual. We demonstrate our method on domain-incremental action recognition and semantic segmentation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes a domain-incremental learning method that deliberately allows and exploits catastrophic forgetting rather than preventing it. A shared backbone is paired with a main task head and a self-supervised MAE reconstruction head; domain-specific LoRA adapters are learned during incremental training so that each adapter specializes to its domain while inducing forgetting elsewhere. At inference on video streams, online test-time training is performed on the MAE head to identify which LoRA best matches the current input, thereby 'remembering' the appropriate domain adapter. The approach is claimed to be particularly suited to gradual domain shifts in temporally correlated video data and is demonstrated on domain-incremental action recognition and semantic segmentation.

Significance. If the selection mechanism proves reliable, the work would offer a conceptually distinct alternative to conventional continual-learning strategies that focus on forgetting mitigation. By turning domain-specific forgetting into a feature that can be recovered via test-time adaptation, the method could simplify handling of non-stationary streaming data without requiring explicit domain labels or rehearsal buffers.

major comments (1)
  1. [Abstract] Abstract (final paragraph): the claim that the scheme is 'especially well-suited' to video streams with gradual domain shifts rests on the unverified assumption that TTT on the MAE head will produce an unambiguous argmin over reconstruction losses for the correct LoRA. When consecutive frames lie in an intermediate regime between two domains, multiple LoRAs may yield comparable losses, rendering selection noisy or unstable; this is load-bearing for the entire 'remembering' procedure and is not addressed by any analysis or experiment in the provided text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for identifying a key assumption underlying our claims about suitability for video streams. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (final paragraph): the claim that the scheme is 'especially well-suited' to video streams with gradual domain shifts rests on the unverified assumption that TTT on the MAE head will produce an unambiguous argmin over reconstruction losses for the correct LoRA. When consecutive frames lie in an intermediate regime between two domains, multiple LoRAs may yield comparable losses, rendering selection noisy or unstable; this is load-bearing for the entire 'remembering' procedure and is not addressed by any analysis or experiment in the provided text.

    Authors: We agree that the manuscript does not provide explicit analysis or experiments examining the behavior of the MAE reconstruction losses (and resulting argmin) when input frames lie in transitional or intermediate regimes between domains. Our existing experiments demonstrate overall task performance under gradual shifts, but they do not isolate or quantify selection stability in such boundary cases. We will revise the paper by adding a targeted analysis (new subsection and supplementary figures) that measures per-LoRA reconstruction losses on synthetic and real transitional sequences, reports the frequency of ambiguous selections, and evaluates the impact on downstream task accuracy. This will either substantiate or qualify the suitability claim. revision: yes

Circularity Check

0 steps flagged

No circularity in method description or inference procedure

full rationale

The paper describes a procedural approach combining a task head with an MAE head, training domain-specific LoRA adapters, and using online TTT on the MAE head at inference to select the matching adapter. No equations, derivations, or parameter-fitting steps are presented that reduce by construction to the method's own inputs or prior self-citations. The central claim relies on an empirical assumption about reconstruction loss providing a selection signal under gradual video shifts, but this is not a self-definitional or fitted-input reduction; it remains an externally testable hypothesis. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that video streams exhibit gradual shifts and high temporal correlation, allowing the MAE head to serve as a reliable domain identifier during test-time training. No free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Domain shifts in video streams are gradual and consecutive samples are highly correlated.
    Explicitly invoked in the abstract to justify why the scheme is especially well-suited to video.

pith-pipeline@v0.9.1-grok · 5683 in / 1389 out tokens · 23474 ms · 2026-06-28T22:59:54.005606+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 9 canonical work pages · 7 internal anchors

  1. [1]

    BEiT: BERT Pre-Training of Image Transformers

    Bao, H., Dong, L., Piao, S., Wei, F.: Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)

  2. [2]

    Revisiting Feature Prediction for Learning Visual Representations from Video

    Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., Ballas, N.: Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471 (2024)

  3. [3]

    SAM 3: Segment Anything with Concepts

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025)

  4. [4]

    In- ternational Journal of Computer Vision132(1), 208–223 (2024)

    Chen, X., Ding, M., Wang, X., Xin, Y., Mo, S., Wang, Y., Han, S., Luo, P., Zeng, G., Wang, J.: Context autoencoder for self-supervised representation learning. In- ternational Journal of Computer Vision132(1), 208–223 (2024)

  5. [5]

    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

    Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 1280–1289 (2021), https://api.semanticscholar.org/CorpusID:244799297

  6. [6]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1290–1299 (2022)

  7. [7]

    arXiv preprint arXiv:2311.02428 (2023)

    Chitale, R., Vaidya, A., Kane, A., Ghotkar, A.: Task arithmetic with lora for continual learning. arXiv preprint arXiv:2311.02428 (2023)

  8. [8]

    IEEE transactions on pattern analysis and machine intelligence 44(7), 3366–3385 (2021)

    De Lange, M., Aljundi, R., Masana, M., Parisot, S., Jia, X., Leonardis, A., Slabaugh, G., Tuytelaars, T.: A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence 44(7), 3366–3385 (2021)

  9. [9]

    Project Aria: A New Tool for Egocentric Multi-Modal AI Research

    Engel, J., Somasundaram, K., Goesele, M., Sun, A., Gamino, A., Turner, A., Ta- lattof, A., Yuan, A., Souti, B., Meredith, B., et al.: Project aria: A new tool for egocentric multi-modal ai research. arXiv preprint arXiv:2308.13561 (2023)

  10. [10]

    Advances in neural information processing systems35, 35946–35958 (2022)

    Feichtenhofer, C., Li, Y., He, K., et al.: Masked autoencoders as spatiotempo- ral learners. Advances in neural information processing systems35, 35946–35958 (2022)

  11. [11]

    arXiv preprint arXiv:2401.14391 (2024)

    Fu, L., Lian, L., Wang, R., Shi, B., Wang, X., Yala, A., Darrell, T., Efros, A.A., Goldberg, K.: Rethinking patch dependence for masked autoencoders. arXiv preprint arXiv:2401.14391 (2024)

  12. [12]

    Advances in Neural Information Processing Systems35, 29374–29385 (2022)

    Gandelsman, Y., Sun, Y., Chen, X., Efros, A.: Test-time training with masked au- toencoders. Advances in Neural Information Processing Systems35, 29374–29385 (2022)

  13. [13]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Gao,Q.,Zhao,C.,Sun,Y.,Xi,T.,Zhang,G.,Ghanem,B.,Zhang,J.:Aunifiedcon- tinual learning framework with general parameter-efficient tuning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11483–11493 (2023) 14

  14. [14]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

  15. [15]

    ICLR1(2), 3 (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

  16. [16]

    In: International Conference on Image and Graphics

    Hu, Y., Hou, J., Liu, X., Sun, X., Guo, W.: Video domain incremental learning for human action recognition in home environments. In: International Conference on Image and Graphics. pp. 316–327. Springer (2025)

  17. [17]

    In: European conference on computer vision

    Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: European conference on computer vision. pp. 646–661. Springer (2016)

  18. [18]

    The Kinetics Human Action Video Dataset

    Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

  19. [19]

    The Power of Scale for Parameter-Efficient Prompt Tuning

    Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 (2021)

  20. [20]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liang, Y.S., Li, W.J.: Inflora: Interference-free low-rank adaptation for continual learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23638–23647 (2024)

  21. [21]

    IEEE trans- actions on pattern analysis and machine intelligence42(10), 2684–2701 (2019)

    Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.Y., Kot, A.C.: Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE trans- actions on pattern analysis and machine intelligence42(10), 2684–2701 (2019)

  22. [22]

    Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer:Hierarchicalvisiontransformerusingshiftedwindows.In:Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021)

  23. [23]

    McCloskey, M., Cohen, N.J.: Catastrophic interference in connectionist networks: Thesequentiallearningproblem.In:Psychologyoflearningandmotivation,vol.24, pp. 109–165. Elsevier (1989)

  24. [24]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Park, H., Park, H., Ko, J., Min, D.: Hybrid-tta: Continual test-time adaptation via dynamic domain shift detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2877–2886 (2025)

  25. [25]

    Proceedings of machine learning and systems5, 606–624 (2023)

    Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Heek, J., Xiao, K., Agrawal, S., Dean, J.: Efficiently scaling transformer inference. Proceedings of machine learning and systems5, 606–624 (2023)

  26. [26]

    Advances in neural information processing systems32(2019)

    Rolnick, D., Ahuja, A., Schwarz, J., Lillicrap, T., Wayne, G.: Experience replay for continual learning. Advances in neural information processing systems32(2019)

  27. [27]

    Ryali, C., Hu, Y.T., Bolya, D., Wei, C., Fan, H., Huang, P.Y., Aggarwal, V., Chowdhury, A., Poursaeed, O., Hoffman, J., et al.: Hiera: A hierarchical vision transformerwithoutthebells-and-whistles.In:Internationalconferenceonmachine learning. pp. 29441–29454. PMLR (2023)

  28. [28]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1010–1019 (2016)

  29. [29]

    Neurocomputing568, 127063 (2024)

    Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced trans- former with rotary position embedding. Neurocomputing568, 127063 (2024)

  30. [30]

    In: International conference on machine learning

    Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A., Hardt, M.: Test-time training with self-supervision for generalization under distribution shifts. In: International conference on machine learning. pp. 9229–9248. PMLR (2020) 15

  31. [31]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the incep- tion architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2818–2826 (2016)

  32. [32]

    In: Proceed- ings of the Asian Conference on Computer Vision

    Tian, J., Lyu, F.: Parameter-selective continual test-time adaptation. In: Proceed- ings of the Asian Conference on Computer Vision. pp. 1384–1400 (2024)

  33. [33]

    Advances in neural infor- mation processing systems35, 10078–10093 (2022)

    Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: Masked autoencoders are data- efficient learners for self-supervised video pre-training. Advances in neural infor- mation processing systems35, 10078–10093 (2022)

  34. [34]

    Three scenarios for continual learning

    Van de Ven, G.M., Tolias, A.S.: Three scenarios for continual learning. arXiv preprint arXiv:1904.07734 (2019)

  35. [35]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

    Wang, L., Xie, J., Zhang, X., Su, H., Zhu, J.: Hide-pet: continual learning via hierarchical decomposition of parameter-efficient tuning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

  36. [36]

    IEEE transactions on pattern analysis and machine intelligence46(8), 5362–5383 (2024)

    Wang, L., Zhang, X., Su, H., Zhu, J.: A comprehensive survey of continual learn- ing: Theory, method and application. IEEE transactions on pattern analysis and machine intelligence46(8), 5362–5383 (2024)

  37. [37]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wang, Q., Fink, O., Van Gool, L., Dai, D.: Continual test-time domain adapta- tion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7201–7211 (2022)

  38. [38]

    Journal of Machine Learning Research26(9), 1–29 (2025)

    Wang, R., Sun, Y., Tandon, A., Gandelsman, Y., Chen, X., Efros, A.A., Wang, X.: Test-time training on video streams. Journal of Machine Learning Research26(9), 1–29 (2025)

  39. [39]

    In: European conference on computer vision

    Wang, Z., Zhang, Z., Ebrahimi, S., Sun, R., Zhang, H., Lee, C.Y., Ren, X., Su, G., Perot, V., Dy, J., et al.: Dualprompt: Complementary prompting for rehearsal- free continual learning. In: European conference on computer vision. pp. 631–648. Springer (2022)

  40. [40]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wang, Z., Zhang, Z., Lee, C.Y., Zhang, H., Sun, R., Ren, X., Su, G., Perot, V., Dy, J., Pfister, T.: Learning to prompt for continual learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 139–149 (2022)

  41. [41]

    Workshop on Distribution Shifts (DistShift), NeurIPS (2023)

    Wistuba, M., Sivaprasad, P.T., Balles, L., Zappella, G.: Continual learning with low rank adaptation. Workshop on Distribution Shifts (DistShift), NeurIPS (2023)

  42. [42]

    In: Inter- national Conference on Machine Learning (ICML) (2023)

    Zhao, H., Liu, Y., Alahi, A., Lin, T.: On pitfalls of test-time adaptation. In: Inter- national Conference on Machine Learning (ICML) (2023)

  43. [43]

    leave out

    Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017) 16 A Implementation Details A.1 Model architectures Video Action Recognition:For our action recognition model, the encoder and MAE decoderEandDare based on the VideoM...