Few-shot Multi-Task Learning of Linear Invariant Features with Meta Subspace Pursuit

Chaozhi Zhang; Lin Liu; Xiaoqun Zhang

arxiv: 2409.02708 · v2 · pith:CYBYWRPVnew · submitted 2024-09-04 · 💻 cs.LG · stat.ME

Few-shot Multi-Task Learning of Linear Invariant Features with Meta Subspace Pursuit

Chaozhi Zhang , Lin Liu , Xiaoqun Zhang This is my paper

Pith reviewed 2026-05-23 20:49 UTC · model grok-4.3

classification 💻 cs.LG stat.ME

keywords few-shot learningmulti-task learningmeta learningsubspace pursuitinvariant featureslow-rank modelslinear regression

0 comments

The pith

Meta Subspace Pursuit learns the invariant low-rank subspace shared across multiple linear tasks with provable guarantees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that when coefficients of linear models across tasks share an invariant low-rank component, the Meta Subspace Pursuit algorithm recovers this common subspace. This setup addresses data scarcity by extracting transferable structure from related tasks before fine-tuning on limited examples. The work supplies both algorithmic convergence results and statistical error bounds under the stated model. Experiments confirm the method improves over several competing approaches in few-shot multi-task regimes.

Core claim

Under the assumption that model coefficients across tasks share an invariant low-rank component, the Meta Subspace Pursuit algorithm provably learns this invariant subspace, with both algorithmic and statistical guarantees established for the multi-task linear model setup.

What carries the argument

Meta Subspace Pursuit (Meta-SP), an iterative algorithm that identifies the common low-rank subspace shared by task coefficients.

If this is right

The algorithm converges to the true shared subspace as the number of tasks and samples grows.
Statistical rates bound the error in the recovered features and downstream task predictions.
The procedure yields better few-shot performance than model-agnostic alternatives on linear tasks satisfying the low-rank assumption.
The guarantees apply directly to the stylized multi-task linear regression setting considered.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The subspace recovery step could be inserted as a preprocessing module before applying other meta-learning updates on non-linear models.
If the low-rank component is only approximately shared, the method might still provide useful initializations whose quality degrades gracefully with the approximation error.
The same pursuit idea could be tested on sequential task arrival, where each new task refines the current estimate of the invariant subspace.

Load-bearing premise

The coefficients of the linear models for different tasks share an invariant low-rank component.

What would settle it

A controlled simulation in which the shared subspace is known in advance but Meta-SP fails to recover it at the claimed rate when sample sizes and task counts meet the paper's conditions.

Figures

Figures reproduced from arXiv: 2409.02708 by Chaozhi Zhang, Lin Liu, Xiaoqun Zhang.

**Figure 2.** Figure 2: Evolution of Dist1(left) and Dist2(right) with the number of tasks T for s = 5, m = 5 and σ = 1. other methods. Notably, ANIL, designed to model-agnostically identify the meta-representation, consistently underperformed by other methods under the squared distances induced by the matrix Frobenius norm. When m = s = 5, the scenario corresponding to the extreme data-scarcity setting, methods like AltMin, AltM… view at source ↗

**Figure 3.** Figure 3: Evolution of Dist1(left) and Dist2(right) with the sample size m for s = 5, T = 800 and σ = 1 [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Evolution of Dist1(left) and Dist2(right) with variance of noise σ for s = 5, T = 400 and m = 25. m= Meta-SP(Ours) AltMin AltMinGD BM NUC ANIL 5 12000 / / / 19000 35000 6 8000 / / / 12000 27000 8 4500 / 58000 7400 7200 15000 10 2800 5100 17500 3400 4500 10000 25 750 760 1600 760 950 2600 50 320 320 450 320 400 900 75 220 220 270 220 240 480 100 160 160 200 160 180 320 [PITH_FULL_IMAGE:figures/full_fig_p01… view at source ↗

**Figure 5.** Figure 5: The empirical minimum amount of (m, T) required for the sine angle distance between the estimated and the true task-invariant subspaces to be ≤ 0.1. The horizontal axis represents the value of m, while the vertical axis represents the value of T. variant, AltMinGD, offers improved speed but lags behind other methods in terms of performance based on the two evaluation metrics. NUC, ANIL, MoM, and MoM2 can e… view at source ↗

**Figure 6.** Figure 6: Evolution of Dist1(left) and Dist2(right) with iterations(6(a)) and time(6(b), unit: second) for s = 5, m = 25, T = 400 and σ = 1 in one example 3. Take the logarithm of the CO and NO2 values and standardize each dimension in every task. 4. Take the logarithm of PM2.5 values. We assume that the coefficients of all tasks lie in an r-dimensional space. 80% of the tasks are used to train the model to get the … view at source ↗

**Figure 7.** Figure 7: Evolution of Dist1(left) and Dist2(right) with the number of tasks T for s = 5, m = 10 and σ = 1 [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗

**Figure 8.** Figure 8: Evolution of Dist1(left) and Dist2(right) with the number of tasks T for s = 5, m = 25 and σ = 0. consistently demonstrate that our method, Meta-SP, requires less data to solve the same problem and consistently achieves superior results under the same conditions. We have presented results pertaining to the parameter σ in [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

**Figure 9.** Figure 9: Evolution of Dist1(left) and Dist2(right) with the number of tasks T for s = 25, m = 40 and σ = 1 [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗

**Figure 10.** Figure 10: Evolution of Dist1(left) and Dist2(right) with the number of tasks T for s = 25, m = 25 and σ = 1 fewer iterations and less time to yield favorable outcomes. B.2 Experimental Details In all the experiments, we generate the true task-invariant matrix B∗ by first QR factorizing of a d×d matrix with elements sampled i.i.d. from the standard normal N (0, 1), and then retrieving the first s columns. The elemen… view at source ↗

**Figure 11.** Figure 11: Evolution of Dist1(left) and Dist2(right) with the number of tasks T for s = 5, T = 1600 and σ = 1 [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗

**Figure 12.** Figure 12: Evolution of Dist1(left) and Dist2(right) with variance of noise σ for s = 5, T = 1600 and m = 10. experiments. In [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗

**Figure 13.** Figure 13: Evolution of Dist1(left) and Dist2(right) with variance of noise σ for s = 5, T = 6400 and m = 5. since the best value is zero when it follows the formulation in [TJJ21]. In [PITH_FULL_IMAGE:figures/full_fig_p031_13.png] view at source ↗

**Figure 14.** Figure 14: Evolution of Dist1 and Dist2 with iterations and computational time (unit: second) for s = 5, m = 25, T = 400 and σ = 1 in three examples. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_14.png] view at source ↗

**Figure 15.** Figure 15: Evolution of Dist1 and Dist2 over iterations and computational time (unit: second) for s = 5, m = 10, T = 3200 and σ = 1 in three examples. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗

read the original abstract

Data scarcity poses a serious threat to modern machine learning and artificial intelligence, as their practical success typically relies on the availability of big datasets. One effective strategy to mitigate the issue of insufficient data is to first harness information from other data sources possessing certain similarities in the study design stage, and then employ the multi-task or meta learning framework in the analysis stage. In this paper, we focus on multi-task (or multi-source) linear models whose coefficients across tasks share an invariant low-rank component, a popular structural assumption considered in the recent multi-task or meta learning literature. Under this assumption, we propose a new algorithm, called Meta Subspace Pursuit (abbreviated as Meta-SP), that provably learns this invariant subspace shared by different tasks. Under this stylized setup for multi-task or meta learning, we establish both the algorithmic and statistical guarantees of the proposed method. Extensive numerical experiments are conducted, comparing Meta-SP against several competing methods, including popular, off-the-shelf model-agnostic meta learning algorithms such as ANIL. These experiments demonstrate that Meta-SP achieves superior performance over the competing methods in various aspects.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Meta-SP adds a subspace pursuit algorithm with claimed recovery guarantees for linear multi-task models under a shared low-rank invariant component, but the contribution stays inside that specific assumption.

read the letter

The paper introduces Meta Subspace Pursuit for few-shot multi-task linear regression where task coefficients share an invariant low-rank part. It claims both algorithmic convergence and statistical recovery of that subspace, plus experiments beating ANIL and other baselines in the controlled setting. That is the core new piece: a named method with explicit guarantees tied to the low-rank structure rather than another heuristic meta-learner. The work sits cleanly in the line of papers that exploit shared structure across tasks, and giving both algorithmic and statistical bounds is better than the usual empirical-only meta-learning write-up. The experiments are at least run against relevant competitors. The soft spot is the assumption itself. Everything rests on the tasks sharing a low-rank invariant component; if that does not hold even approximately, the guarantees and the method have no footing, and the paper does not test sensitivity to mild violations. Because the models are strictly linear, it is also unclear how much carries over to the nonlinear regimes that dominate current practice. The numerical results are described only at a high level, so it is hard to judge whether the data generation enforces the assumption or probes realistic departures from it. This is for people working on theoretical multi-task and meta-learning with linear models and explicit structural assumptions. A reader in that niche can extract the algorithm and the proof strategy. It deserves a serious referee because the claims are concrete enough to check, even if the scope is limited.

Referee Report

2 major / 2 minor

Summary. The paper proposes Meta Subspace Pursuit (Meta-SP) for few-shot multi-task linear regression where task coefficient vectors share an invariant low-rank component. Under this structural assumption, the method is claimed to recover the shared subspace with both algorithmic convergence guarantees and statistical recovery bounds; extensive experiments compare it favorably to baselines including ANIL on synthetic and real data.

Significance. If the recovery guarantees hold, the work supplies a concrete, theoretically supported algorithm for exploiting shared low-rank structure across tasks in the few-shot regime, which is a common modeling choice in multi-task and meta-learning literature. The combination of subspace pursuit with meta-learning steps and the reported empirical gains over model-agnostic baselines constitute a modest but useful contribution.

major comments (2)

[§4.2, Theorem 2] §4.2, Theorem 2: the statistical recovery bound for the subspace distance appears to require the number of tasks T to grow with the ambient dimension d; this dependence is not stated explicitly in the theorem statement and may limit applicability in the few-shot multi-task setting where T is moderate.
[§3.3, Eq. (8)] §3.3, Eq. (8): the alternating minimization step for the low-rank component is analyzed under exact subspace knowledge, yet the algorithm description interleaves subspace estimation with coefficient updates; the proof sketch does not quantify the error propagation from the initial subspace estimate to the final coefficients.

minor comments (2)

[§2] Notation for the shared subspace U is introduced in §2 but reused without redefinition in the algorithm pseudocode; a single consistent definition would improve readability.
[Figure 3] Figure 3 caption does not specify the number of random seeds or error bars; adding this information would strengthen the experimental claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for minor revision. We address the two major comments point by point below.

read point-by-point responses

Referee: [§4.2, Theorem 2] the statistical recovery bound for the subspace distance appears to require the number of tasks T to grow with the ambient dimension d; this dependence is not stated explicitly in the theorem statement and may limit applicability in the few-shot multi-task setting where T is moderate.

Authors: We appreciate this observation. The statistical bound in Theorem 2 does require a scaling condition on T relative to d (specifically, T scaling at least linearly with d up to logarithmic factors to obtain the stated high-probability rate). We agree that this dependence should be stated explicitly rather than left implicit. In the revision we will update the statement of Theorem 2 to include the explicit condition on T and add a short paragraph discussing its implications for moderate T in the few-shot regime. revision: yes
Referee: [§3.3, Eq. (8)] the alternating minimization step for the low-rank component is analyzed under exact subspace knowledge, yet the algorithm description interleaves subspace estimation with coefficient updates; the proof sketch does not quantify the error propagation from the initial subspace estimate to the final coefficients.

Authors: This comment correctly identifies a presentational gap. Section 3.3 analyzes the alternating minimization for the coefficients under a fixed (exact) subspace as an intermediate building block. The overall guarantee in Section 4 combines this with the subspace estimation result, but the current proof sketch does not explicitly bound how the initial subspace error propagates through the interleaved updates. We will expand the proof to include a quantitative error-propagation argument showing that the total error is controlled by the sum of the subspace estimation error and the per-iteration optimization error, under the stated assumptions. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces Meta-SP under an explicit low-rank invariant component assumption for multi-task linear models and derives algorithmic plus statistical guarantees from that stylized setup. No quoted steps reduce a claimed prediction or uniqueness result to a fitted parameter, self-citation chain, or definitional tautology; the central claims rest on the problem formulation rather than re-labeling inputs as outputs. This is the normal case of a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the domain assumption of shared low-rank invariant features; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Coefficients across tasks share an invariant low-rank component
This is the key structural assumption stated in the abstract for the multi-task models.

pith-pipeline@v0.9.0 · 5727 in / 1057 out tokens · 30174 ms · 2026-05-23T20:49:36.096898+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Near-optimal and Efficient First-Order Algorithm for Multi-Task Learning with Shared Linear Representation
cs.LG 2026-05 unverdicted novelty 7.0

A new first-order algorithm for multi-task learning with shared linear representation achieves near-optimal error rates in constant iterations, improving existing methods by a factor of k.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Multitask learning via shared features: Algorithms and hardness

Konstantina Bairaktari, Guy Blanc, Li-Yang Tan, Jonathan Ullman, and Lydia Zakynthinou. Multitask learning via shared features: Algorithms and hardness. In The Thirty Sixth Annual Conference on Learning Theory , pages 747--772. PMLR, 2023

work page 2023
[2]

Iterative hard thresholding for compressed sensing

Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis , 27(3):265--274, 2009

work page 2009
[3]

Trace norm regularization for multi-task learning with scarce data

Etienne Boursier, Mikhail Konobeev, and Nicolas Flammarion. Trace norm regularization for multi-task learning with scarce data. In Conference on Learning Theory , pages 1303--1327. PMLR, 2022

work page 2022
[4]

Concentration inequalities

St \'e phane Boucheron, G \'a bor Lugosi, and Olivier Bousquet. Concentration inequalities. In Summer School on Machine Learning , pages 208--240. Springer, 2003

work page 2003
[5]

A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization

Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming , 95(2):329--357, 2003

work page 2003
[6]

A systematic review on data scarcity problem in deep learning: solution and applications

Aayushi Bansal, Rewa Sharma, and Mamta Kathuria. A systematic review on data scarcity problem in deep learning: solution and applications. ACM Computing Surveys (CSUR) , 54(10s):1--29, 2022

work page 2022
[7]

Statistics for high-dimensional data: methods, theory and applications

Peter B \"u hlmann and Sara van de Geer. Statistics for high-dimensional data: methods, theory and applications . Springer Science & Business Media, 2011

work page 2011
[8]

A singular value thresholding algorithm for matrix completion

Jian-Feng Cai, Emmanuel J Cand \`e s, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization , 20(4):1956--1982, 2010

work page 1956
[9]

An iterative hard thresholding estimator for low rank matrix recovery with explicit limiting distribution

Alexandra Carpentier and Arlene KH Kim. An iterative hard thresholding estimator for low rank matrix recovery with explicit limiting distribution. Statistica Sinica , 28(3):1371--1393, 2018

work page 2018
[10]

Polynomial time guarantees for the B urer- M onteiro method

Diego Cifuentes and Ankur Moitra. Polynomial time guarantees for the B urer- M onteiro method. In Proceedings of the 36th International Conference on Neural Information Processing Systems , pages 23923--23935, 2022

work page 2022
[11]

MAML and ANIL provably learn representations

Liam Collins, Aryan Mokhtari, Sewoong Oh, and Sanjay Shakkottai. MAML and ANIL provably learn representations. In International Conference on Machine Learning , pages 4238--4310. PMLR, 2022

work page 2022
[12]

_1 -magic: Recovery of sparse signals via convex programming

Emmanuel Candes and Justin Romberg. _1 -magic: Recovery of sparse signals via convex programming. Technical report, California Institute of Technology, 2005

work page 2005
[13]

Decoding by linear programming

Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE Transactions on Information Theory , 51(12):4203--4215, 2005

work page 2005
[14]

Few-shot learning via learning the representation, provably

Simon Shaolei Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning the representation, provably. In International Conference on Learning Representations , 2020

work page 2020
[15]

Universality of approximate message passing with semirandom matrices

Rishabh Dudeja, Yue M Lu, and Subhabrata Sen. Universality of approximate message passing with semirandom matrices. The Annals of Probability , 51(5):1616--1683, 2023

work page 2023
[16]

Subspace pursuit for compressive sensing signal reconstruction

Wei Dai and Olgica Milenkovic. Subspace pursuit for compressive sensing signal reconstruction. IEEE Transactions on Information Theory , 55(5):2230--2249, 2009

work page 2009
[17]

High dimensional robust M -estimation: Asymptotic variance via approximate message passing

David Donoho and Andrea Montanari. High dimensional robust M -estimation: Asymptotic variance via approximate message passing. Probability Theory and Related Fields , 166(3):935--969, 2016

work page 2016
[18]

Adaptive and robust multi-task learning

Yaqi Duan and Kaizheng Wang. Adaptive and robust multi-task learning. The Annals of Statistics , 51(5):2015--2039, 2023

work page 2015
[19]

Model-agnostic meta-learning for fast adaptation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning , pages 1126--1135. PMLR, 2017

work page 2017
[20]

Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems

M \'a rio AT Figueiredo, Robert D Nowak, and Stephen J Wright. Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE Journal of Selected Topics in Signal Processing , 1(4):586--597, 2007

work page 2007
[21]

Hard thresholding pursuit: an algorithm for compressive sensing

Simon Foucart. Hard thresholding pursuit: an algorithm for compressive sensing. SIAM Journal on Numerical Analysis , 49(6):2543--2563, 2011

work page 2011
[22]

Probabilistic model-agnostic meta-learning

Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems , pages 9537--9548, 2018

work page 2018
[23]

Convergence of fixed-point continuation algorithms for matrix rank minimization

Donald Goldfarb and Shiqian Ma. Convergence of fixed-point continuation algorithms for matrix rank minimization. Foundations of Computational Mathematics , 11(2):183--210, 2011

work page 2011
[24]

Overcoming data scarcity with transfer learning

Maxwell L Hutchinson, Erin Antono, Brenna M Gibbons, Sean Paradiso, Julia Ling, and Bryce Meredig. Overcoming data scarcity with transfer learning. arXiv preprint arXiv:1711.05099 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

Universality of regularized regression estimators in high dimensions

Qiyang Han and Yandi Shen. Universality of regularized regression estimators in high dimensions. The Annals of Statistics , 51(4):1799--1823, 2023

work page 2023
[26]

Fixed-point continuation for _1 -minimization: Methodology and convergence

Elaine T Hale, Wotao Yin, and Yin Zhang. Fixed-point continuation for _1 -minimization: Methodology and convergence. SIAM Journal on Optimization , 19(3):1107--1130, 2008

work page 2008
[27]

Meta-learning with generalized ridge regression: High-dimensional asymptotics, optimality and hyper-covariance estimation

Yanhao Jin, Krishnakumar Balasubramanian, and Debashis Paul. Meta-learning with generalized ridge regression: High-dimensional asymptotics, optimality and hyper-covariance estimation. arXiv preprint arXiv:2403.19720 , 2024

work page arXiv 2024
[28]

Guaranteed rank minimization via singular value projection

Prateek Jain, Raghu Meka, and Inderjit Dhillon. Guaranteed rank minimization via singular value projection. Advances in Neural Information Processing Systems , 23, 2010

work page 2010
[29]

Meta-learning for mixed linear regression

Weihao Kong, Raghav Somani, Zhao Song, Sham Kakade, and Sewoong Oh. Meta-learning for mixed linear regression. In International Conference on Machine Learning , pages 5394--5404. PMLR, 2020

work page 2020
[30]

Efficient and guaranteed rank minimization by atomic decomposition

Kiryung Lee and Yoram Bresler. Efficient and guaranteed rank minimization by atomic decomposition. In 2009 IEEE International Symposium on Information Theory , pages 314--318. IEEE, 2009

work page 2009
[31]

Admira: Atomic decomposition for minimum rank approximation

Kiryung Lee and Yoram Bresler. Admira: Atomic decomposition for minimum rank approximation. IEEE Transactions on Information Theory , 56(9):4402--4416, 2010

work page 2010
[32]

Transformers as algorithms: Generalization and stability in in-context learning

Yingcong Li, Muhammed Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Transformers as algorithms: Generalization and stability in in-context learning. In International Conference on Machine Learning , pages 19565--19594. PMLR, 2023

work page 2023
[33]

Improved bounds for multi-task learning with trace norm regularization

Weiwei Liu. Improved bounds for multi-task learning with trace norm regularization. In The Thirty Sixth Annual Conference on Learning Theory , pages 700--714. PMLR, 2023

work page 2023
[34]

On the limited memory BFGS method for large scale optimization

Dong C Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming , 45(1-3):503--528, 1989

work page 1989
[35]

Understanding optimal feature transfer via a fine-grained bias-variance analysis

Yufan Li, Subhabrata Sen, and Ben Adlam. Understanding optimal feature transfer via a fine-grained bias-variance analysis. arXiv preprint arXiv:2404.12481 , 2024

work page arXiv 2024
[36]

Low-rank semidefinite programming: Theory and applications

Alex Lemon, Anthony Man-Cho So, and Yinyu Ye. Low-rank semidefinite programming: Theory and applications. Foundations and Trends in Optimization , 2(1-2):1--156, 2016

work page 2016
[37]

Interior-point method for nuclear norm approximation with application to system identification

Zhang Liu and Lieven Vandenberghe. Interior-point method for nuclear norm approximation with application to system identification. SIAM Journal on Matrix Analysis and Applications , 31(3):1235--1256, 2010

work page 2010
[38]

Multi-target QSAR modelling in the analysis and design of HIV - HCV co-inhibitors: an in-silico study

Qi Liu, Han Zhou, Lin Liu, Xi Chen, Ruixin Zhu, and Zhiwei Cao. Multi-target QSAR modelling in the analysis and design of HIV - HCV co-inhibitors: an in-silico study. BMC Bioinformatics , 12(1):1--20, 2011

work page 2011
[39]

Sparse principal component analysis and iterative thresholding

Zongming Ma. Sparse principal component analysis and iterative thresholding. The Annals of Statistics , 41(2):772--801, 2013

work page 2013
[40]

Coherence analysis of iterative thresholding algorithms

Arian Maleki. Coherence analysis of iterative thresholding algorithms. In 2009 47th Annual Allerton Conference on Communication, Control, and Computing (Allerton) , pages 236--243. IEEE, 2009

work page 2009
[41]

Fixed point and B regman iterative methods for matrix rank minimization

Shiqian Ma, Donald Goldfarb, and Lifeng Chen. Fixed point and B regman iterative methods for matrix rank minimization. Mathematical Programming , 128(1-2):321--353, 2011

work page 2011
[42]

The benefit of multitask representation learning

Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of multitask representation learning. Journal of Machine Learning Research , 17(81):1--32, 2016

work page 2016
[43]

Sparse approximate solutions to linear systems

Balas K Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing , 24(2):227--234, 1995

work page 1995
[44]

Co S a MP : Iterative signal recovery from incomplete and inaccurate samples

Deanna Needell and Joel A Tropp. Co S a MP : Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis , 26(3):301--321, 2009

work page 2009
[45]

Invariant models for causal transfer learning

Mateo Rojas-Carulla, Bernhard Sch \"o lkopf, Richard Turner, and Jonas Peters. Invariant models for causal transfer learning. Journal of Machine Learning Research , 19(36):1--34, 2018

work page 2018
[46]

Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization

Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review , 52(3):471--501, 2010

work page 2010
[47]

Rapid learning or feature reuse? T owards understanding the effectiveness of MAML

Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse? T owards understanding the effectiveness of MAML . In International Conference on Learning Representations , 2020

work page 2020
[48]

Matrix perturbation theory

Gilbert W Stewart and Ji-Guang Sun. Matrix perturbation theory . Academic Press, 1990

work page 1990
[49]

A framework to characterize performance of LASSO algorithms

Mihailo Stojnic. A framework to characterize performance of lasso algorithms. arXiv preprint arXiv:1303.7291 , 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[50]

Precise error analysis of regularized M -estimators in high dimensions

Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassibi. Precise error analysis of regularized M -estimators in high dimensions. IEEE Transactions on Information Theory , 64(8):5592--5628, 2018

work page 2018
[51]

Learning from similar linear representations: Adaptivity, minimaxity, and robustness

Ye Tian, Yuqi Gu, and Yang Feng. Learning from similar linear representations: Adaptivity, minimaxity, and robustness. arXiv preprint arXiv:2303.17765 , 2023

work page arXiv 2023
[52]

Provable meta-learning of linear representations

Nilesh Tripuraneni, Chi Jin, and Michael Jordan. Provable meta-learning of linear representations. In International Conference on Machine Learning , pages 10434--10443. PMLR, 2021

work page 2021
[53]

Statistically and computationally efficient linear meta-representation learning

Kiran K Thekumparampil, Prateek Jain, Praneeth Netrapalli, and Sewoong Oh. Statistically and computationally efficient linear meta-representation learning. In Proceedings of the 35th International Conference on Neural Information Processing Systems , pages 18487--18500, 2021

work page 2021
[54]

An introduction to matrix concentration inequalities

Joel A Tropp. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning , 8(1-2):1--230, 2015

work page 2015
[55]

On the conditions used to prove oracle results for the L asso

Sara A van de Geer and Peter B \"u hlmann. On the conditions used to prove oracle results for the L asso. Electronic Journal of Statistics , 3:1360--1392, 2009

work page 2009
[56]

An automatic inequality prover and instance optimal identity testing

Gregory Valiant and Paul Valiant. An automatic inequality prover and instance optimal identity testing. SIAM Journal on Computing , 46(1):429--455, 2017

work page 2017
[57]

First-order ANIL provably learns representations despite overparametrisation

O g uz Y \"u ksel, Etienne Boursier, and Nicolas Flammarion. First-order ANIL provably learns representations despite overparametrisation. In NeurIPS 2023 Workshop on Mathematics of Modern Machine Learning , 2023

work page 2023
[58]

Bregman iterative algorithms for _1 -minimization with applications to compressed sensing

Wotao Yin, Stanley Osher, Donald Goldfarb, and Jerome Darbon. Bregman iterative algorithms for _1 -minimization with applications to compressed sensing. SIAM Journal on Imaging Sciences , 1(1):143--168, 2008

work page 2008
[59]

A survey on multi-task learning

Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering , 34(12):5586--5609, 2021

work page 2021

[1] [1]

Multitask learning via shared features: Algorithms and hardness

Konstantina Bairaktari, Guy Blanc, Li-Yang Tan, Jonathan Ullman, and Lydia Zakynthinou. Multitask learning via shared features: Algorithms and hardness. In The Thirty Sixth Annual Conference on Learning Theory , pages 747--772. PMLR, 2023

work page 2023

[2] [2]

Iterative hard thresholding for compressed sensing

Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis , 27(3):265--274, 2009

work page 2009

[3] [3]

Trace norm regularization for multi-task learning with scarce data

Etienne Boursier, Mikhail Konobeev, and Nicolas Flammarion. Trace norm regularization for multi-task learning with scarce data. In Conference on Learning Theory , pages 1303--1327. PMLR, 2022

work page 2022

[4] [4]

Concentration inequalities

St \'e phane Boucheron, G \'a bor Lugosi, and Olivier Bousquet. Concentration inequalities. In Summer School on Machine Learning , pages 208--240. Springer, 2003

work page 2003

[5] [5]

A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization

Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming , 95(2):329--357, 2003

work page 2003

[6] [6]

A systematic review on data scarcity problem in deep learning: solution and applications

Aayushi Bansal, Rewa Sharma, and Mamta Kathuria. A systematic review on data scarcity problem in deep learning: solution and applications. ACM Computing Surveys (CSUR) , 54(10s):1--29, 2022

work page 2022

[7] [7]

Statistics for high-dimensional data: methods, theory and applications

Peter B \"u hlmann and Sara van de Geer. Statistics for high-dimensional data: methods, theory and applications . Springer Science & Business Media, 2011

work page 2011

[8] [8]

A singular value thresholding algorithm for matrix completion

Jian-Feng Cai, Emmanuel J Cand \`e s, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization , 20(4):1956--1982, 2010

work page 1956

[9] [9]

An iterative hard thresholding estimator for low rank matrix recovery with explicit limiting distribution

Alexandra Carpentier and Arlene KH Kim. An iterative hard thresholding estimator for low rank matrix recovery with explicit limiting distribution. Statistica Sinica , 28(3):1371--1393, 2018

work page 2018

[10] [10]

Polynomial time guarantees for the B urer- M onteiro method

Diego Cifuentes and Ankur Moitra. Polynomial time guarantees for the B urer- M onteiro method. In Proceedings of the 36th International Conference on Neural Information Processing Systems , pages 23923--23935, 2022

work page 2022

[11] [11]

MAML and ANIL provably learn representations

Liam Collins, Aryan Mokhtari, Sewoong Oh, and Sanjay Shakkottai. MAML and ANIL provably learn representations. In International Conference on Machine Learning , pages 4238--4310. PMLR, 2022

work page 2022

[12] [12]

_1 -magic: Recovery of sparse signals via convex programming

Emmanuel Candes and Justin Romberg. _1 -magic: Recovery of sparse signals via convex programming. Technical report, California Institute of Technology, 2005

work page 2005

[13] [13]

Decoding by linear programming

Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE Transactions on Information Theory , 51(12):4203--4215, 2005

work page 2005

[14] [14]

Few-shot learning via learning the representation, provably

Simon Shaolei Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning the representation, provably. In International Conference on Learning Representations , 2020

work page 2020

[15] [15]

Universality of approximate message passing with semirandom matrices

Rishabh Dudeja, Yue M Lu, and Subhabrata Sen. Universality of approximate message passing with semirandom matrices. The Annals of Probability , 51(5):1616--1683, 2023

work page 2023

[16] [16]

Subspace pursuit for compressive sensing signal reconstruction

Wei Dai and Olgica Milenkovic. Subspace pursuit for compressive sensing signal reconstruction. IEEE Transactions on Information Theory , 55(5):2230--2249, 2009

work page 2009

[17] [17]

High dimensional robust M -estimation: Asymptotic variance via approximate message passing

David Donoho and Andrea Montanari. High dimensional robust M -estimation: Asymptotic variance via approximate message passing. Probability Theory and Related Fields , 166(3):935--969, 2016

work page 2016

[18] [18]

Adaptive and robust multi-task learning

Yaqi Duan and Kaizheng Wang. Adaptive and robust multi-task learning. The Annals of Statistics , 51(5):2015--2039, 2023

work page 2015

[19] [19]

Model-agnostic meta-learning for fast adaptation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning , pages 1126--1135. PMLR, 2017

work page 2017

[20] [20]

Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems

M \'a rio AT Figueiredo, Robert D Nowak, and Stephen J Wright. Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE Journal of Selected Topics in Signal Processing , 1(4):586--597, 2007

work page 2007

[21] [21]

Hard thresholding pursuit: an algorithm for compressive sensing

Simon Foucart. Hard thresholding pursuit: an algorithm for compressive sensing. SIAM Journal on Numerical Analysis , 49(6):2543--2563, 2011

work page 2011

[22] [22]

Probabilistic model-agnostic meta-learning

Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems , pages 9537--9548, 2018

work page 2018

[23] [23]

Convergence of fixed-point continuation algorithms for matrix rank minimization

Donald Goldfarb and Shiqian Ma. Convergence of fixed-point continuation algorithms for matrix rank minimization. Foundations of Computational Mathematics , 11(2):183--210, 2011

work page 2011

[24] [24]

Overcoming data scarcity with transfer learning

Maxwell L Hutchinson, Erin Antono, Brenna M Gibbons, Sean Paradiso, Julia Ling, and Bryce Meredig. Overcoming data scarcity with transfer learning. arXiv preprint arXiv:1711.05099 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[25] [25]

Universality of regularized regression estimators in high dimensions

Qiyang Han and Yandi Shen. Universality of regularized regression estimators in high dimensions. The Annals of Statistics , 51(4):1799--1823, 2023

work page 2023

[26] [26]

Fixed-point continuation for _1 -minimization: Methodology and convergence

Elaine T Hale, Wotao Yin, and Yin Zhang. Fixed-point continuation for _1 -minimization: Methodology and convergence. SIAM Journal on Optimization , 19(3):1107--1130, 2008

work page 2008

[27] [27]

Meta-learning with generalized ridge regression: High-dimensional asymptotics, optimality and hyper-covariance estimation

Yanhao Jin, Krishnakumar Balasubramanian, and Debashis Paul. Meta-learning with generalized ridge regression: High-dimensional asymptotics, optimality and hyper-covariance estimation. arXiv preprint arXiv:2403.19720 , 2024

work page arXiv 2024

[28] [28]

Guaranteed rank minimization via singular value projection

Prateek Jain, Raghu Meka, and Inderjit Dhillon. Guaranteed rank minimization via singular value projection. Advances in Neural Information Processing Systems , 23, 2010

work page 2010

[29] [29]

Meta-learning for mixed linear regression

Weihao Kong, Raghav Somani, Zhao Song, Sham Kakade, and Sewoong Oh. Meta-learning for mixed linear regression. In International Conference on Machine Learning , pages 5394--5404. PMLR, 2020

work page 2020

[30] [30]

Efficient and guaranteed rank minimization by atomic decomposition

Kiryung Lee and Yoram Bresler. Efficient and guaranteed rank minimization by atomic decomposition. In 2009 IEEE International Symposium on Information Theory , pages 314--318. IEEE, 2009

work page 2009

[31] [31]

Admira: Atomic decomposition for minimum rank approximation

Kiryung Lee and Yoram Bresler. Admira: Atomic decomposition for minimum rank approximation. IEEE Transactions on Information Theory , 56(9):4402--4416, 2010

work page 2010

[32] [32]

Transformers as algorithms: Generalization and stability in in-context learning

Yingcong Li, Muhammed Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Transformers as algorithms: Generalization and stability in in-context learning. In International Conference on Machine Learning , pages 19565--19594. PMLR, 2023

work page 2023

[33] [33]

Improved bounds for multi-task learning with trace norm regularization

Weiwei Liu. Improved bounds for multi-task learning with trace norm regularization. In The Thirty Sixth Annual Conference on Learning Theory , pages 700--714. PMLR, 2023

work page 2023

[34] [34]

On the limited memory BFGS method for large scale optimization

Dong C Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming , 45(1-3):503--528, 1989

work page 1989

[35] [35]

Understanding optimal feature transfer via a fine-grained bias-variance analysis

Yufan Li, Subhabrata Sen, and Ben Adlam. Understanding optimal feature transfer via a fine-grained bias-variance analysis. arXiv preprint arXiv:2404.12481 , 2024

work page arXiv 2024

[36] [36]

Low-rank semidefinite programming: Theory and applications

Alex Lemon, Anthony Man-Cho So, and Yinyu Ye. Low-rank semidefinite programming: Theory and applications. Foundations and Trends in Optimization , 2(1-2):1--156, 2016

work page 2016

[37] [37]

Interior-point method for nuclear norm approximation with application to system identification

Zhang Liu and Lieven Vandenberghe. Interior-point method for nuclear norm approximation with application to system identification. SIAM Journal on Matrix Analysis and Applications , 31(3):1235--1256, 2010

work page 2010

[38] [38]

Multi-target QSAR modelling in the analysis and design of HIV - HCV co-inhibitors: an in-silico study

Qi Liu, Han Zhou, Lin Liu, Xi Chen, Ruixin Zhu, and Zhiwei Cao. Multi-target QSAR modelling in the analysis and design of HIV - HCV co-inhibitors: an in-silico study. BMC Bioinformatics , 12(1):1--20, 2011

work page 2011

[39] [39]

Sparse principal component analysis and iterative thresholding

Zongming Ma. Sparse principal component analysis and iterative thresholding. The Annals of Statistics , 41(2):772--801, 2013

work page 2013

[40] [40]

Coherence analysis of iterative thresholding algorithms

Arian Maleki. Coherence analysis of iterative thresholding algorithms. In 2009 47th Annual Allerton Conference on Communication, Control, and Computing (Allerton) , pages 236--243. IEEE, 2009

work page 2009

[41] [41]

Fixed point and B regman iterative methods for matrix rank minimization

Shiqian Ma, Donald Goldfarb, and Lifeng Chen. Fixed point and B regman iterative methods for matrix rank minimization. Mathematical Programming , 128(1-2):321--353, 2011

work page 2011

[42] [42]

The benefit of multitask representation learning

Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of multitask representation learning. Journal of Machine Learning Research , 17(81):1--32, 2016

work page 2016

[43] [43]

Sparse approximate solutions to linear systems

Balas K Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing , 24(2):227--234, 1995

work page 1995

[44] [44]

Co S a MP : Iterative signal recovery from incomplete and inaccurate samples

Deanna Needell and Joel A Tropp. Co S a MP : Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis , 26(3):301--321, 2009

work page 2009

[45] [45]

Invariant models for causal transfer learning

Mateo Rojas-Carulla, Bernhard Sch \"o lkopf, Richard Turner, and Jonas Peters. Invariant models for causal transfer learning. Journal of Machine Learning Research , 19(36):1--34, 2018

work page 2018

[46] [46]

Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization

Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review , 52(3):471--501, 2010

work page 2010

[47] [47]

Rapid learning or feature reuse? T owards understanding the effectiveness of MAML

Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse? T owards understanding the effectiveness of MAML . In International Conference on Learning Representations , 2020

work page 2020

[48] [48]

Matrix perturbation theory

Gilbert W Stewart and Ji-Guang Sun. Matrix perturbation theory . Academic Press, 1990

work page 1990

[49] [49]

A framework to characterize performance of LASSO algorithms

Mihailo Stojnic. A framework to characterize performance of lasso algorithms. arXiv preprint arXiv:1303.7291 , 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[50] [50]

Precise error analysis of regularized M -estimators in high dimensions

Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassibi. Precise error analysis of regularized M -estimators in high dimensions. IEEE Transactions on Information Theory , 64(8):5592--5628, 2018

work page 2018

[51] [51]

Learning from similar linear representations: Adaptivity, minimaxity, and robustness

Ye Tian, Yuqi Gu, and Yang Feng. Learning from similar linear representations: Adaptivity, minimaxity, and robustness. arXiv preprint arXiv:2303.17765 , 2023

work page arXiv 2023

[52] [52]

Provable meta-learning of linear representations

Nilesh Tripuraneni, Chi Jin, and Michael Jordan. Provable meta-learning of linear representations. In International Conference on Machine Learning , pages 10434--10443. PMLR, 2021

work page 2021

[53] [53]

Statistically and computationally efficient linear meta-representation learning

Kiran K Thekumparampil, Prateek Jain, Praneeth Netrapalli, and Sewoong Oh. Statistically and computationally efficient linear meta-representation learning. In Proceedings of the 35th International Conference on Neural Information Processing Systems , pages 18487--18500, 2021

work page 2021

[54] [54]

An introduction to matrix concentration inequalities

Joel A Tropp. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning , 8(1-2):1--230, 2015

work page 2015

[55] [55]

On the conditions used to prove oracle results for the L asso

Sara A van de Geer and Peter B \"u hlmann. On the conditions used to prove oracle results for the L asso. Electronic Journal of Statistics , 3:1360--1392, 2009

work page 2009

[56] [56]

An automatic inequality prover and instance optimal identity testing

Gregory Valiant and Paul Valiant. An automatic inequality prover and instance optimal identity testing. SIAM Journal on Computing , 46(1):429--455, 2017

work page 2017

[57] [57]

First-order ANIL provably learns representations despite overparametrisation

O g uz Y \"u ksel, Etienne Boursier, and Nicolas Flammarion. First-order ANIL provably learns representations despite overparametrisation. In NeurIPS 2023 Workshop on Mathematics of Modern Machine Learning , 2023

work page 2023

[58] [58]

Bregman iterative algorithms for _1 -minimization with applications to compressed sensing

Wotao Yin, Stanley Osher, Donald Goldfarb, and Jerome Darbon. Bregman iterative algorithms for _1 -minimization with applications to compressed sensing. SIAM Journal on Imaging Sciences , 1(1):143--168, 2008

work page 2008

[59] [59]

A survey on multi-task learning

Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering , 34(12):5586--5609, 2021

work page 2021