Adaptive Compression-based Lifelong Learning

Devis Tuia; Matthew B. Blaschko; Maxim Berman; Shivangi Srivastava

arxiv: 1907.09695 · v1 · pith:2KNHSISInew · submitted 2019-07-23 · 💻 cs.CV

Adaptive Compression-based Lifelong Learning

Shivangi Srivastava , Maxim Berman , Matthew B. Blaschko , Devis Tuia This is my paper

Pith reviewed 2026-05-24 18:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords lifelong learningcatastrophic forgettingnetwork pruningBayesian optimizationadaptive compressioncontinual learningdeep neural networkssemantic segmentation

0 comments

The pith

Bayesian optimization selects pruning rates adaptively for each new task in lifelong learning, using heavier compression on small or simple datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles catastrophic forgetting when a network is updated on new tasks without old training data. It replaces fixed pruning percentages with Bayesian optimization that chooses the compression rate from the new task's validation performance alone. The resulting schedule applies stronger pruning to small or simple datasets and gentler rates to large or complex ones, freeing parameters for later tasks while keeping earlier-task accuracy stable. Experiments on classification and semantic segmentation sequences show the approach works across varying dataset sizes and complexities.

Core claim

The method uses Bayesian optimization on the current task's data to choose a task-specific pruning fraction; this fraction is larger for small or simple datasets and smaller for large or complex ones, allowing the network to retain performance on all previously seen tasks without replay or explicit regularization from earlier stages.

What carries the argument

Bayesian optimization that selects the pruning percentage for the network parameters using only the new task's training and validation sets.

If this is right

Small networks suffice for early simple tasks, leaving more free parameters for later tasks.
Performance on previous tasks is maintained across sequences of datasets whose sizes and complexities differ.
The same adaptive schedule applies to both image classification and semantic segmentation.
No storage of old-task samples or replay buffers is required to achieve the reported stability.
Compression rate is determined automatically rather than set by hand for each new task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be combined with replay or regularization methods to handle cases where validation on the new task alone is insufficient.
It may reduce memory footprint on edge devices that must learn many tasks in succession.
Testing the optimizer on non-image modalities would show whether the adaptation rule generalizes beyond vision tasks.
If the validation set for the new task is small, the chosen pruning rate may become unstable across random seeds.

Load-bearing premise

Bayesian optimization performed solely on the new task can pick a pruning rate that leaves performance on all earlier tasks intact without ever seeing their training samples.

What would settle it

On a sequence of tasks, measure whether accuracy on the first task after adaptive pruning falls below the accuracy obtained by a single fixed moderate pruning rate chosen in advance.

Figures

Figures reproduced from arXiv: 1907.09695 by Devis Tuia, Matthew B. Blaschko, Maxim Berman, Shivangi Srivastava.

**Figure 2.** Figure 2: Comparison of the different lifelong learning strategies in the DeepGlobe data for [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

read the original abstract

The problem of a deep learning model losing performance on a previously learned task when fine-tuned to a new one is a phenomenon known as Catastrophic forgetting. There are two major ways to mitigate this problem: either preserving activations of the initial network during training with a new task; or restricting the new network activations to remain close to the initial ones. The latter approach falls under the denomination of lifelong learning, where the model is updated in a way that it performs well on both old and new tasks, without having access to the old task's training samples anymore. Recently, approaches like pruning networks for freeing network capacity during sequential learning of tasks have been gaining in popularity. Such approaches allow learning small networks while making redundant parameters available for the next tasks. The common problem encountered with these approaches is that the pruning percentage is hard-coded, irrespective of the number of samples, of the complexity of the learning task and of the number of classes in the dataset. We propose a method based on Bayesian optimization to perform adaptive compression/pruning of the network and show its effectiveness in lifelong learning. Our method learns to perform heavy pruning for small and/or simple datasets while using milder compression rates for large and/or complex data. Experiments on classification and semantic segmentation demonstrate the applicability of learning network compression, where we are able to effectively preserve performances along sequences of tasks of varying complexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper replaces fixed pruning rates with Bayesian optimization on new-task data to adapt compression in lifelong learning, but the no-forgetting claim rests on an unverified assumption that new-task optimum preserves old tasks.

read the letter

The main new piece is treating the pruning percentage as a hyperparameter that Bayesian optimization tunes from the current task's training and validation data alone, rather than using a fixed rate across all tasks. This is a straightforward extension of existing pruning-based lifelong learning work, and it makes intuitive sense that small or simple datasets can tolerate heavier pruning while complex ones need milder compression. The experiments on classification and semantic segmentation sequences are the part that could be useful to people already working in this area, as they at least try to show the method across task types of varying difficulty. The abstract claims the approach preserves performance on prior tasks, which is the central result. The soft spot is exactly where the stress-test note points: the optimization objective has no term for old-task retention, no replay buffer, and no access to previous samples. Nothing in the setup forces the chosen rate to be safe for earlier tasks, so any claim that it works for lifelong learning depends on the empirical results actually demonstrating retention. Without seeing error bars, ablations on the BO search space, or direct comparisons of old-task accuracy before and after each new task, that evidence is not yet visible. The method description treats BO as a black box, which is fine if the outcomes are reproducible, but it leaves the reader wondering how sensitive the final rates are to the search bounds. This paper is for researchers already using pruning for continual learning who want a data-dependent way to set the rate. It is not a foundational shift, but the idea is clean enough that a serious referee could evaluate whether the experiments close the gap on the forgetting question. I would send it to review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper proposes using Bayesian optimization (BO) to adaptively select per-task pruning/compression rates for neural networks in a lifelong learning sequence. The method claims to prune more aggressively on small/simple datasets and more conservatively on large/complex ones, thereby mitigating catastrophic forgetting while freeing capacity for future tasks, all without access to previous-task training samples. Effectiveness is asserted via experiments on classification and semantic segmentation tasks.

Significance. If the adaptive BO procedure reliably selects rates that preserve prior-task accuracy, the approach would provide a practical, data-driven alternative to fixed pruning percentages in continual learning. The absence of any replay buffer or explicit regularization term for old tasks would make the result particularly noteworthy if demonstrated.

major comments (2)

[Method description] Method description (Bayesian optimization for pruning rate): the objective is stated to be evaluated using only the new task's training and validation data. No term for prior-task retention, replay, or old-task validation appears in the BO search; therefore the central claim that the selected rate preserves performance across the entire sequence rests on an unverified assumption that new-task optimum coincides with the multi-task optimum.
[Abstract] Abstract and experimental claims: the manuscript asserts that experiments on classification and semantic segmentation demonstrate effectiveness and that the method 'learns to perform heavy pruning for small and/or simple datasets,' yet supplies no quantitative accuracy numbers, error bars, ablation studies, baseline comparisons, or description of the BO objective function and search-space bounds. Without these data the load-bearing claim that adaptive rates preserve prior-task performance cannot be evaluated.

minor comments (1)

[Abstract] The abstract refers to 'standard tasks' but does not name the specific datasets or task sequences used; this should be stated explicitly for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive comments. We address the major points below and will revise the manuscript accordingly to improve clarity and provide the requested details.

read point-by-point responses

Referee: [Method description] Method description (Bayesian optimization for pruning rate): the objective is stated to be evaluated using only the new task's training and validation data. No term for prior-task retention, replay, or old-task validation appears in the BO search; therefore the central claim that the selected rate preserves performance across the entire sequence rests on an unverified assumption that new-task optimum coincides with the multi-task optimum.

Authors: The manuscript describes the BO objective as being computed on the new task only, with no explicit replay or old-task term, as the approach relies on the pruning step itself to free capacity for future tasks while preserving prior performance through the lifelong learning setup. We agree that this leaves the multi-task preservation as an implicit assumption rather than directly optimized or validated in the search. To strengthen the paper, we will add experiments that measure retention on previous tasks after each adaptive pruning step and clarify the underlying assumption in the method section. revision: yes
Referee: [Abstract] Abstract and experimental claims: the manuscript asserts that experiments on classification and semantic segmentation demonstrate effectiveness and that the method 'learns to perform heavy pruning for small and/or simple datasets,' yet supplies no quantitative accuracy numbers, error bars, ablation studies, baseline comparisons, or description of the BO objective function and search-space bounds. Without these data the load-bearing claim that adaptive rates preserve prior-task performance cannot be evaluated.

Authors: We acknowledge that the provided manuscript version emphasizes the high-level approach and does not include quantitative accuracy numbers, error bars, ablation studies, baseline comparisons, or explicit details on the BO objective function and search-space bounds. This omission makes it difficult to fully assess the claims. We will revise the experimental section and abstract to incorporate these elements, including the requested quantitative results and descriptions, to support the effectiveness claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method relies on external optimizer and empirical results.

full rationale

The paper describes a Bayesian optimization procedure that selects per-task pruning rates using only the new task's training and validation data. No equations, fitted parameters, or self-citations are presented that reduce the claimed lifelong-learning performance to a quantity defined from the same inputs by construction. The central claim rests on experimental outcomes rather than any self-definitional loop, fitted-input prediction, or load-bearing self-citation chain. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that network pruning can be performed sequentially without old data while preserving prior-task accuracy, plus the modeling choice that Bayesian optimization will converge to useful rates from new-task statistics alone.

free parameters (1)

Bayesian optimization search space bounds
The ranges and acquisition function parameters for BO are not specified and must be chosen to make the adaptive pruning work.

axioms (1)

domain assumption Pruning redundant parameters frees capacity for new tasks without destroying representations needed for old tasks when the pruning rate is chosen appropriately.
Invoked when the paper states that adaptive compression preserves performance along sequences of tasks.

pith-pipeline@v0.9.0 · 5772 in / 1292 out tokens · 24418 ms · 2026-05-24T18:02:25.350356+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a method based on Bayesian optimization to perform adaptive compression/pruning of the network... min_θ size(f_θ) s.t. R(f_θ) ≤ R(f) + ε
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the pruning percentage is hard-coded, irrespective of the number of samples, of the complexity of the learning task

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 3 internal anchors

[1]

Convex Optimization

Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004

work page 2004
[2]

A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning

Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. CoRR, abs/1012.2599, 2010

work page internal anchor Pith review Pith/arXiv arXiv 2010
[3]

Multitask learning

Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997

work page 1997
[4]

Branch-speciﬁc dendritic ca 2+ spikes cause per- sistent synaptic plasticity

Joseph Cichon and Wen-Biao Gan. Branch-speciﬁc dendritic ca 2+ spikes cause per- sistent synaptic plasticity. Nature, 520(7546):180, 2015

work page 2015
[5]

Deepglobe 2018: A challenge to SRIV ASTA V A ET AL.: ADAPTIVE COMPRESSION-BASED LIFELONG LEARNING 11 parse the earth through satellite images

Ilke Demir, Krzysztof Koperski, David Lindenbaum, Guan Pang, Jing Huang, Saikat Basu, Forest Hughes, Devis Tuia, and Ramesh Raska. Deepglobe 2018: A challenge to SRIV ASTA V A ET AL.: ADAPTIVE COMPRESSION-BASED LIFELONG LEARNING 11 parse the earth through satellite images. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (C...

work page 2018
[6]

Decaf: A deep convolutional activation feature for generic visual recognition

Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning , pages 647–655, 2014

work page 2014
[7]

Peter I. Frazier. A tutorial on Bayesian optimization. CoRR, abs/1807.02811, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Catastrophic forgetting in connectionist networks

Robert M French. Catastrophic forgetting in connectionist networks. Trends in cogni- tive sciences, 3(4):128–135, 1999

work page 1999
[9]

Rich feature hierar- chies for accurate object detection and semantic segmentation

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierar- chies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 580–587, 2014

work page 2014
[10]

An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[11]

Building detection from satellite imagery using ensemble of size-speciﬁc detectors

Ryuhei Hamaguchi and Shuhei Hikosaka. Building detection from satellite imagery using ensemble of size-speciﬁc detectors. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 223–2234. IEEE, 2018

work page 2018
[12]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[13]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. NIPS Workshop, 2014

work page 2014
[14]

Adam: A method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International conference on learning representations, 2015

work page 2015
[15]

Overcoming catastrophic forgetting in neural networks.Pro- ceedings of the national academy of sciences , 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Des- jardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Pro- ceedings of the national academy of sciences , 114(13):3521–3526, 2017

work page 2017
[16]

What learning systems do intelligent agents need? complementary learning systems theory updated

Dharshan Kumaran, Demis Hassabis, and James L McClelland. What learning systems do intelligent agents need? complementary learning systems theory updated. Trends in cognitive sciences, 20(7):512–534, 2016

work page 2016
[17]

Overcoming catastrophic forgetting by incremental moment matching

Sang-Woo Lee, Jin-Hwa Kim, Jaehyun Jun, Jung-Woo Ha, and Byoung-Tak Zhang. Overcoming catastrophic forgetting by incremental moment matching. In Advances in Neural Information Processing Systems, pages 4652–4662, 2017

work page 2017
[18]

Learning without forgetting

Zhizhong Li and Derek Hoiem. Learning without forgetting. In European Conference on Computer Vision, pages 614â ˘A¸ S–629. Springer, 2016

work page 2016
[19]

Learning without forgetting

Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2018. 12 SRIV ASTA V A ET AL.: ADAPTIVE COMPRESSION-BASED LIFELONG LEARNING

work page 2018
[20]

Packnet: Adding multiple tasks to a single net- work by iterative pruning

Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single net- work by iterative pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018

work page 2018
[21]

Piggyback: Adapting a single network to multiple tasks by learning to mask weights

Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. InProceedings of the European Conference on Computer Vision (ECCV), pages 67–82, 2018

work page 2018
[22]

Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory

James L McClelland, Bruce L McNaughton, and Randall C O’reilly. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review, 102(3):419, 1995

work page 1995
[23]

Catastrophic interference in connectionist net- works: The sequential learning problem

Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist net- works: The sequential learning problem. In Psychology of learning and motivation , volume 24, pages 109–165. Elsevier, 1989

work page 1989
[24]

Bayesian optimization github repository, 2018

Fernando Nogueira. Bayesian optimization github repository, 2018. URL https: //github.com/fmfn/BayesianOptimization

work page 2018
[25]

Encoder based lifelong learning

Amal Rannen, Rahaf Aljundi, Matthew B Blaschko, and Tinne Tuytelaars. Encoder based lifelong learning. In Proceedings of the IEEE International Conference on Com- puter Vision, pages 1320–1328, 2017

work page 2017
[26]

Connectionist models of recognition memory: constraints imposed by learning and forgetting functions

Roger Ratcliff. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological review, 97(2):285, 1990

work page 1990
[27]

Erfnet: Efﬁ- cient residual factorized convnet for real-time semantic segmentation

Eduardo Romera, José M Alvarez, Luis M Bergasa, and Roberto Arroyo. Erfnet: Efﬁ- cient residual factorized convnet for real-time semantic segmentation. IEEE Transac- tions on Intelligent Transportation Systems, 19(1):263–272, 2018

work page 2018
[28]

Imagenet large scale visual recognition challenge

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision , 115 (3):211–252, 2015

work page 2015
[29]

Overcoming catastrophic forgetting with hard attention to the task

Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. In Proceedings of the 35th In- ternational Conference on Machine Learning , volume 80 of Proceedings of Machine Learning Research, pages 4548–4557. PMLR, 2018

work page 2018
[30]

Compete to compute

Rupesh K Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, and Jür- gen Schmidhuber. Compete to compute. In Advances in neural information processing systems, pages 2310–2318, 2013

work page 2013
[31]

Dense fusion classmate network for land cover classiﬁcation

Chao Tian, Cong Li, and Jianping Shi. Dense fusion classmate network for land cover classiﬁcation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition Workshops (CVPRW), pages 262–2624. IEEE, 2018

work page 2018
[32]

The caltech-ucsd birds-200-2011 dataset

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011. SRIV ASTA V A ET AL.: ADAPTIVE COMPRESSION-BASED LIFELONG LEARNING 13

work page 2011
[33]

Places: A 10 million image database for scene recognition

Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence , 40(6):1452–1464, 2018

work page 2018
[34]

D-linknet: Linknet with pretrained en- coder and dilated convolution for high resolution satellite imagery road extraction

Lichen Zhou, Chuang Zhang, and Ming Wu. D-linknet: Linknet with pretrained en- coder and dilated convolution for high resolution satellite imagery road extraction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 182–186. IEEE, 2018

work page 2018

[1] [1]

Convex Optimization

Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004

work page 2004

[2] [2]

A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning

Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. CoRR, abs/1012.2599, 2010

work page internal anchor Pith review Pith/arXiv arXiv 2010

[3] [3]

Multitask learning

Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997

work page 1997

[4] [4]

Branch-speciﬁc dendritic ca 2+ spikes cause per- sistent synaptic plasticity

Joseph Cichon and Wen-Biao Gan. Branch-speciﬁc dendritic ca 2+ spikes cause per- sistent synaptic plasticity. Nature, 520(7546):180, 2015

work page 2015

[5] [5]

Deepglobe 2018: A challenge to SRIV ASTA V A ET AL.: ADAPTIVE COMPRESSION-BASED LIFELONG LEARNING 11 parse the earth through satellite images

Ilke Demir, Krzysztof Koperski, David Lindenbaum, Guan Pang, Jing Huang, Saikat Basu, Forest Hughes, Devis Tuia, and Ramesh Raska. Deepglobe 2018: A challenge to SRIV ASTA V A ET AL.: ADAPTIVE COMPRESSION-BASED LIFELONG LEARNING 11 parse the earth through satellite images. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (C...

work page 2018

[6] [6]

Decaf: A deep convolutional activation feature for generic visual recognition

Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning , pages 647–655, 2014

work page 2014

[7] [7]

Peter I. Frazier. A tutorial on Bayesian optimization. CoRR, abs/1807.02811, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Catastrophic forgetting in connectionist networks

Robert M French. Catastrophic forgetting in connectionist networks. Trends in cogni- tive sciences, 3(4):128–135, 1999

work page 1999

[9] [9]

Rich feature hierar- chies for accurate object detection and semantic segmentation

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierar- chies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 580–587, 2014

work page 2014

[10] [10]

An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[11] [11]

Building detection from satellite imagery using ensemble of size-speciﬁc detectors

Ryuhei Hamaguchi and Shuhei Hikosaka. Building detection from satellite imagery using ensemble of size-speciﬁc detectors. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 223–2234. IEEE, 2018

work page 2018

[12] [12]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016

[13] [13]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. NIPS Workshop, 2014

work page 2014

[14] [14]

Adam: A method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International conference on learning representations, 2015

work page 2015

[15] [15]

Overcoming catastrophic forgetting in neural networks.Pro- ceedings of the national academy of sciences , 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Des- jardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Pro- ceedings of the national academy of sciences , 114(13):3521–3526, 2017

work page 2017

[16] [16]

What learning systems do intelligent agents need? complementary learning systems theory updated

Dharshan Kumaran, Demis Hassabis, and James L McClelland. What learning systems do intelligent agents need? complementary learning systems theory updated. Trends in cognitive sciences, 20(7):512–534, 2016

work page 2016

[17] [17]

Overcoming catastrophic forgetting by incremental moment matching

Sang-Woo Lee, Jin-Hwa Kim, Jaehyun Jun, Jung-Woo Ha, and Byoung-Tak Zhang. Overcoming catastrophic forgetting by incremental moment matching. In Advances in Neural Information Processing Systems, pages 4652–4662, 2017

work page 2017

[18] [18]

Learning without forgetting

Zhizhong Li and Derek Hoiem. Learning without forgetting. In European Conference on Computer Vision, pages 614â ˘A¸ S–629. Springer, 2016

work page 2016

[19] [19]

Learning without forgetting

Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2018. 12 SRIV ASTA V A ET AL.: ADAPTIVE COMPRESSION-BASED LIFELONG LEARNING

work page 2018

[20] [20]

Packnet: Adding multiple tasks to a single net- work by iterative pruning

Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single net- work by iterative pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018

work page 2018

[21] [21]

Piggyback: Adapting a single network to multiple tasks by learning to mask weights

Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. InProceedings of the European Conference on Computer Vision (ECCV), pages 67–82, 2018

work page 2018

[22] [22]

Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory

James L McClelland, Bruce L McNaughton, and Randall C O’reilly. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review, 102(3):419, 1995

work page 1995

[23] [23]

Catastrophic interference in connectionist net- works: The sequential learning problem

Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist net- works: The sequential learning problem. In Psychology of learning and motivation , volume 24, pages 109–165. Elsevier, 1989

work page 1989

[24] [24]

Bayesian optimization github repository, 2018

Fernando Nogueira. Bayesian optimization github repository, 2018. URL https: //github.com/fmfn/BayesianOptimization

work page 2018

[25] [25]

Encoder based lifelong learning

Amal Rannen, Rahaf Aljundi, Matthew B Blaschko, and Tinne Tuytelaars. Encoder based lifelong learning. In Proceedings of the IEEE International Conference on Com- puter Vision, pages 1320–1328, 2017

work page 2017

[26] [26]

Connectionist models of recognition memory: constraints imposed by learning and forgetting functions

Roger Ratcliff. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological review, 97(2):285, 1990

work page 1990

[27] [27]

Erfnet: Efﬁ- cient residual factorized convnet for real-time semantic segmentation

Eduardo Romera, José M Alvarez, Luis M Bergasa, and Roberto Arroyo. Erfnet: Efﬁ- cient residual factorized convnet for real-time semantic segmentation. IEEE Transac- tions on Intelligent Transportation Systems, 19(1):263–272, 2018

work page 2018

[28] [28]

Imagenet large scale visual recognition challenge

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision , 115 (3):211–252, 2015

work page 2015

[29] [29]

Overcoming catastrophic forgetting with hard attention to the task

Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. In Proceedings of the 35th In- ternational Conference on Machine Learning , volume 80 of Proceedings of Machine Learning Research, pages 4548–4557. PMLR, 2018

work page 2018

[30] [30]

Compete to compute

Rupesh K Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, and Jür- gen Schmidhuber. Compete to compute. In Advances in neural information processing systems, pages 2310–2318, 2013

work page 2013

[31] [31]

Dense fusion classmate network for land cover classiﬁcation

Chao Tian, Cong Li, and Jianping Shi. Dense fusion classmate network for land cover classiﬁcation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition Workshops (CVPRW), pages 262–2624. IEEE, 2018

work page 2018

[32] [32]

The caltech-ucsd birds-200-2011 dataset

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011. SRIV ASTA V A ET AL.: ADAPTIVE COMPRESSION-BASED LIFELONG LEARNING 13

work page 2011

[33] [33]

Places: A 10 million image database for scene recognition

Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence , 40(6):1452–1464, 2018

work page 2018

[34] [34]

D-linknet: Linknet with pretrained en- coder and dilated convolution for high resolution satellite imagery road extraction

Lichen Zhou, Chuang Zhang, and Ming Wu. D-linknet: Linknet with pretrained en- coder and dilated convolution for high resolution satellite imagery road extraction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 182–186. IEEE, 2018

work page 2018