pith. sign in

arxiv: 2409.02708 · v2 · pith:CYBYWRPVnew · submitted 2024-09-04 · 💻 cs.LG · stat.ME

Few-shot Multi-Task Learning of Linear Invariant Features with Meta Subspace Pursuit

Pith reviewed 2026-05-23 20:49 UTC · model grok-4.3

classification 💻 cs.LG stat.ME
keywords few-shot learningmulti-task learningmeta learningsubspace pursuitinvariant featureslow-rank modelslinear regression
0
0 comments X

The pith

Meta Subspace Pursuit learns the invariant low-rank subspace shared across multiple linear tasks with provable guarantees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that when coefficients of linear models across tasks share an invariant low-rank component, the Meta Subspace Pursuit algorithm recovers this common subspace. This setup addresses data scarcity by extracting transferable structure from related tasks before fine-tuning on limited examples. The work supplies both algorithmic convergence results and statistical error bounds under the stated model. Experiments confirm the method improves over several competing approaches in few-shot multi-task regimes.

Core claim

Under the assumption that model coefficients across tasks share an invariant low-rank component, the Meta Subspace Pursuit algorithm provably learns this invariant subspace, with both algorithmic and statistical guarantees established for the multi-task linear model setup.

What carries the argument

Meta Subspace Pursuit (Meta-SP), an iterative algorithm that identifies the common low-rank subspace shared by task coefficients.

If this is right

  • The algorithm converges to the true shared subspace as the number of tasks and samples grows.
  • Statistical rates bound the error in the recovered features and downstream task predictions.
  • The procedure yields better few-shot performance than model-agnostic alternatives on linear tasks satisfying the low-rank assumption.
  • The guarantees apply directly to the stylized multi-task linear regression setting considered.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The subspace recovery step could be inserted as a preprocessing module before applying other meta-learning updates on non-linear models.
  • If the low-rank component is only approximately shared, the method might still provide useful initializations whose quality degrades gracefully with the approximation error.
  • The same pursuit idea could be tested on sequential task arrival, where each new task refines the current estimate of the invariant subspace.

Load-bearing premise

The coefficients of the linear models for different tasks share an invariant low-rank component.

What would settle it

A controlled simulation in which the shared subspace is known in advance but Meta-SP fails to recover it at the claimed rate when sample sizes and task counts meet the paper's conditions.

Figures

Figures reproduced from arXiv: 2409.02708 by Chaozhi Zhang, Lin Liu, Xiaoqun Zhang.

Figure 1
Figure 1. Figure 1: Evolution of Dist1 (left) and Dist2 (right) with the number of tasks T for s = 5, m = 25 and σ = 1. Figures 1 and 2 showcase how Dist1 and Dist2 change by varying the number of tasks T. When m = 25, it is notable that Meta-SP, AltMin, and BM achieve lower errors compared to 11 [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Evolution of Dist1(left) and Dist2(right) with the number of tasks T for s = 5, m = 5 and σ = 1. other methods. Notably, ANIL, designed to model-agnostically identify the meta-representation, consistently underperformed by other methods under the squared distances induced by the matrix Frobenius norm. When m = s = 5, the scenario corresponding to the extreme data-scarcity setting, methods like AltMin, AltM… view at source ↗
Figure 3
Figure 3. Figure 3: Evolution of Dist1(left) and Dist2(right) with the sample size m for s = 5, T = 800 and σ = 1 [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Evolution of Dist1(left) and Dist2(right) with variance of noise σ for s = 5, T = 400 and m = 25. m= Meta-SP(Ours) AltMin AltMinGD BM NUC ANIL 5 12000 / / / 19000 35000 6 8000 / / / 12000 27000 8 4500 / 58000 7400 7200 15000 10 2800 5100 17500 3400 4500 10000 25 750 760 1600 760 950 2600 50 320 320 450 320 400 900 75 220 220 270 220 240 480 100 160 160 200 160 180 320 [PITH_FULL_IMAGE:figures/full_fig_p01… view at source ↗
Figure 5
Figure 5. Figure 5: The empirical minimum amount of (m, T) required for the sine angle distance between the estimated and the true task-invariant subspaces to be ≤ 0.1. The horizontal axis represents the value of m, while the vertical axis represents the value of T. variant, AltMinGD, offers improved speed but lags behind other methods in terms of performance based on the two evaluation metrics. NUC, ANIL, MoM, and MoM2 can e… view at source ↗
Figure 6
Figure 6. Figure 6: Evolution of Dist1(left) and Dist2(right) with iterations(6(a)) and time(6(b), unit: second) for s = 5, m = 25, T = 400 and σ = 1 in one example 3. Take the logarithm of the CO and NO2 values and standardize each dimension in every task. 4. Take the logarithm of PM2.5 values. We assume that the coefficients of all tasks lie in an r-dimensional space. 80% of the tasks are used to train the model to get the … view at source ↗
Figure 7
Figure 7. Figure 7: Evolution of Dist1(left) and Dist2(right) with the number of tasks T for s = 5, m = 10 and σ = 1 [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Evolution of Dist1(left) and Dist2(right) with the number of tasks T for s = 5, m = 25 and σ = 0. consistently demonstrate that our method, Meta-SP, requires less data to solve the same problem and consistently achieves superior results under the same conditions. We have presented results pertaining to the parameter σ in [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Evolution of Dist1(left) and Dist2(right) with the number of tasks T for s = 25, m = 40 and σ = 1 [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Evolution of Dist1(left) and Dist2(right) with the number of tasks T for s = 25, m = 25 and σ = 1 fewer iterations and less time to yield favorable outcomes. B.2 Experimental Details In all the experiments, we generate the true task-invariant matrix B∗ by first QR factorizing of a d×d matrix with elements sampled i.i.d. from the standard normal N (0, 1), and then retrieving the first s columns. The elemen… view at source ↗
Figure 11
Figure 11. Figure 11: Evolution of Dist1(left) and Dist2(right) with the number of tasks T for s = 5, T = 1600 and σ = 1 [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Evolution of Dist1(left) and Dist2(right) with variance of noise σ for s = 5, T = 1600 and m = 10. experiments. In [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Evolution of Dist1(left) and Dist2(right) with variance of noise σ for s = 5, T = 6400 and m = 5. since the best value is zero when it follows the formulation in [TJJ21]. In [PITH_FULL_IMAGE:figures/full_fig_p031_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Evolution of Dist1 and Dist2 with iterations and computational time (unit: second) for s = 5, m = 25, T = 400 and σ = 1 in three examples. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Evolution of Dist1 and Dist2 over iterations and computational time (unit: second) for s = 5, m = 10, T = 3200 and σ = 1 in three examples. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗
read the original abstract

Data scarcity poses a serious threat to modern machine learning and artificial intelligence, as their practical success typically relies on the availability of big datasets. One effective strategy to mitigate the issue of insufficient data is to first harness information from other data sources possessing certain similarities in the study design stage, and then employ the multi-task or meta learning framework in the analysis stage. In this paper, we focus on multi-task (or multi-source) linear models whose coefficients across tasks share an invariant low-rank component, a popular structural assumption considered in the recent multi-task or meta learning literature. Under this assumption, we propose a new algorithm, called Meta Subspace Pursuit (abbreviated as Meta-SP), that provably learns this invariant subspace shared by different tasks. Under this stylized setup for multi-task or meta learning, we establish both the algorithmic and statistical guarantees of the proposed method. Extensive numerical experiments are conducted, comparing Meta-SP against several competing methods, including popular, off-the-shelf model-agnostic meta learning algorithms such as ANIL. These experiments demonstrate that Meta-SP achieves superior performance over the competing methods in various aspects.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Meta Subspace Pursuit (Meta-SP) for few-shot multi-task linear regression where task coefficient vectors share an invariant low-rank component. Under this structural assumption, the method is claimed to recover the shared subspace with both algorithmic convergence guarantees and statistical recovery bounds; extensive experiments compare it favorably to baselines including ANIL on synthetic and real data.

Significance. If the recovery guarantees hold, the work supplies a concrete, theoretically supported algorithm for exploiting shared low-rank structure across tasks in the few-shot regime, which is a common modeling choice in multi-task and meta-learning literature. The combination of subspace pursuit with meta-learning steps and the reported empirical gains over model-agnostic baselines constitute a modest but useful contribution.

major comments (2)
  1. [§4.2, Theorem 2] §4.2, Theorem 2: the statistical recovery bound for the subspace distance appears to require the number of tasks T to grow with the ambient dimension d; this dependence is not stated explicitly in the theorem statement and may limit applicability in the few-shot multi-task setting where T is moderate.
  2. [§3.3, Eq. (8)] §3.3, Eq. (8): the alternating minimization step for the low-rank component is analyzed under exact subspace knowledge, yet the algorithm description interleaves subspace estimation with coefficient updates; the proof sketch does not quantify the error propagation from the initial subspace estimate to the final coefficients.
minor comments (2)
  1. [§2] Notation for the shared subspace U is introduced in §2 but reused without redefinition in the algorithm pseudocode; a single consistent definition would improve readability.
  2. [Figure 3] Figure 3 caption does not specify the number of random seeds or error bars; adding this information would strengthen the experimental claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for minor revision. We address the two major comments point by point below.

read point-by-point responses
  1. Referee: [§4.2, Theorem 2] the statistical recovery bound for the subspace distance appears to require the number of tasks T to grow with the ambient dimension d; this dependence is not stated explicitly in the theorem statement and may limit applicability in the few-shot multi-task setting where T is moderate.

    Authors: We appreciate this observation. The statistical bound in Theorem 2 does require a scaling condition on T relative to d (specifically, T scaling at least linearly with d up to logarithmic factors to obtain the stated high-probability rate). We agree that this dependence should be stated explicitly rather than left implicit. In the revision we will update the statement of Theorem 2 to include the explicit condition on T and add a short paragraph discussing its implications for moderate T in the few-shot regime. revision: yes

  2. Referee: [§3.3, Eq. (8)] the alternating minimization step for the low-rank component is analyzed under exact subspace knowledge, yet the algorithm description interleaves subspace estimation with coefficient updates; the proof sketch does not quantify the error propagation from the initial subspace estimate to the final coefficients.

    Authors: This comment correctly identifies a presentational gap. Section 3.3 analyzes the alternating minimization for the coefficients under a fixed (exact) subspace as an intermediate building block. The overall guarantee in Section 4 combines this with the subspace estimation result, but the current proof sketch does not explicitly bound how the initial subspace error propagates through the interleaved updates. We will expand the proof to include a quantitative error-propagation argument showing that the total error is controlled by the sum of the subspace estimation error and the per-iteration optimization error, under the stated assumptions. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces Meta-SP under an explicit low-rank invariant component assumption for multi-task linear models and derives algorithmic plus statistical guarantees from that stylized setup. No quoted steps reduce a claimed prediction or uniqueness result to a fitted parameter, self-citation chain, or definitional tautology; the central claims rest on the problem formulation rather than re-labeling inputs as outputs. This is the normal case of a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the domain assumption of shared low-rank invariant features; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Coefficients across tasks share an invariant low-rank component
    This is the key structural assumption stated in the abstract for the multi-task models.

pith-pipeline@v0.9.0 · 5727 in / 1057 out tokens · 30174 ms · 2026-05-23T20:49:36.096898+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Near-optimal and Efficient First-Order Algorithm for Multi-Task Learning with Shared Linear Representation

    cs.LG 2026-05 unverdicted novelty 7.0

    A new first-order algorithm for multi-task learning with shared linear representation achieves near-optimal error rates in constant iterations, improving existing methods by a factor of k.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Multitask learning via shared features: Algorithms and hardness

    Konstantina Bairaktari, Guy Blanc, Li-Yang Tan, Jonathan Ullman, and Lydia Zakynthinou. Multitask learning via shared features: Algorithms and hardness. In The Thirty Sixth Annual Conference on Learning Theory , pages 747--772. PMLR, 2023

  2. [2]

    Iterative hard thresholding for compressed sensing

    Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis , 27(3):265--274, 2009

  3. [3]

    Trace norm regularization for multi-task learning with scarce data

    Etienne Boursier, Mikhail Konobeev, and Nicolas Flammarion. Trace norm regularization for multi-task learning with scarce data. In Conference on Learning Theory , pages 1303--1327. PMLR, 2022

  4. [4]

    Concentration inequalities

    St \'e phane Boucheron, G \'a bor Lugosi, and Olivier Bousquet. Concentration inequalities. In Summer School on Machine Learning , pages 208--240. Springer, 2003

  5. [5]

    A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization

    Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming , 95(2):329--357, 2003

  6. [6]

    A systematic review on data scarcity problem in deep learning: solution and applications

    Aayushi Bansal, Rewa Sharma, and Mamta Kathuria. A systematic review on data scarcity problem in deep learning: solution and applications. ACM Computing Surveys (CSUR) , 54(10s):1--29, 2022

  7. [7]

    Statistics for high-dimensional data: methods, theory and applications

    Peter B \"u hlmann and Sara van de Geer. Statistics for high-dimensional data: methods, theory and applications . Springer Science & Business Media, 2011

  8. [8]

    A singular value thresholding algorithm for matrix completion

    Jian-Feng Cai, Emmanuel J Cand \`e s, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization , 20(4):1956--1982, 2010

  9. [9]

    An iterative hard thresholding estimator for low rank matrix recovery with explicit limiting distribution

    Alexandra Carpentier and Arlene KH Kim. An iterative hard thresholding estimator for low rank matrix recovery with explicit limiting distribution. Statistica Sinica , 28(3):1371--1393, 2018

  10. [10]

    Polynomial time guarantees for the B urer- M onteiro method

    Diego Cifuentes and Ankur Moitra. Polynomial time guarantees for the B urer- M onteiro method. In Proceedings of the 36th International Conference on Neural Information Processing Systems , pages 23923--23935, 2022

  11. [11]

    MAML and ANIL provably learn representations

    Liam Collins, Aryan Mokhtari, Sewoong Oh, and Sanjay Shakkottai. MAML and ANIL provably learn representations. In International Conference on Machine Learning , pages 4238--4310. PMLR, 2022

  12. [12]

    _1 -magic: Recovery of sparse signals via convex programming

    Emmanuel Candes and Justin Romberg. _1 -magic: Recovery of sparse signals via convex programming. Technical report, California Institute of Technology, 2005

  13. [13]

    Decoding by linear programming

    Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE Transactions on Information Theory , 51(12):4203--4215, 2005

  14. [14]

    Few-shot learning via learning the representation, provably

    Simon Shaolei Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning the representation, provably. In International Conference on Learning Representations , 2020

  15. [15]

    Universality of approximate message passing with semirandom matrices

    Rishabh Dudeja, Yue M Lu, and Subhabrata Sen. Universality of approximate message passing with semirandom matrices. The Annals of Probability , 51(5):1616--1683, 2023

  16. [16]

    Subspace pursuit for compressive sensing signal reconstruction

    Wei Dai and Olgica Milenkovic. Subspace pursuit for compressive sensing signal reconstruction. IEEE Transactions on Information Theory , 55(5):2230--2249, 2009

  17. [17]

    High dimensional robust M -estimation: Asymptotic variance via approximate message passing

    David Donoho and Andrea Montanari. High dimensional robust M -estimation: Asymptotic variance via approximate message passing. Probability Theory and Related Fields , 166(3):935--969, 2016

  18. [18]

    Adaptive and robust multi-task learning

    Yaqi Duan and Kaizheng Wang. Adaptive and robust multi-task learning. The Annals of Statistics , 51(5):2015--2039, 2023

  19. [19]

    Model-agnostic meta-learning for fast adaptation of deep networks

    Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning , pages 1126--1135. PMLR, 2017

  20. [20]

    Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems

    M \'a rio AT Figueiredo, Robert D Nowak, and Stephen J Wright. Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE Journal of Selected Topics in Signal Processing , 1(4):586--597, 2007

  21. [21]

    Hard thresholding pursuit: an algorithm for compressive sensing

    Simon Foucart. Hard thresholding pursuit: an algorithm for compressive sensing. SIAM Journal on Numerical Analysis , 49(6):2543--2563, 2011

  22. [22]

    Probabilistic model-agnostic meta-learning

    Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems , pages 9537--9548, 2018

  23. [23]

    Convergence of fixed-point continuation algorithms for matrix rank minimization

    Donald Goldfarb and Shiqian Ma. Convergence of fixed-point continuation algorithms for matrix rank minimization. Foundations of Computational Mathematics , 11(2):183--210, 2011

  24. [24]

    Overcoming data scarcity with transfer learning

    Maxwell L Hutchinson, Erin Antono, Brenna M Gibbons, Sean Paradiso, Julia Ling, and Bryce Meredig. Overcoming data scarcity with transfer learning. arXiv preprint arXiv:1711.05099 , 2017

  25. [25]

    Universality of regularized regression estimators in high dimensions

    Qiyang Han and Yandi Shen. Universality of regularized regression estimators in high dimensions. The Annals of Statistics , 51(4):1799--1823, 2023

  26. [26]

    Fixed-point continuation for _1 -minimization: Methodology and convergence

    Elaine T Hale, Wotao Yin, and Yin Zhang. Fixed-point continuation for _1 -minimization: Methodology and convergence. SIAM Journal on Optimization , 19(3):1107--1130, 2008

  27. [27]

    Meta-learning with generalized ridge regression: High-dimensional asymptotics, optimality and hyper-covariance estimation

    Yanhao Jin, Krishnakumar Balasubramanian, and Debashis Paul. Meta-learning with generalized ridge regression: High-dimensional asymptotics, optimality and hyper-covariance estimation. arXiv preprint arXiv:2403.19720 , 2024

  28. [28]

    Guaranteed rank minimization via singular value projection

    Prateek Jain, Raghu Meka, and Inderjit Dhillon. Guaranteed rank minimization via singular value projection. Advances in Neural Information Processing Systems , 23, 2010

  29. [29]

    Meta-learning for mixed linear regression

    Weihao Kong, Raghav Somani, Zhao Song, Sham Kakade, and Sewoong Oh. Meta-learning for mixed linear regression. In International Conference on Machine Learning , pages 5394--5404. PMLR, 2020

  30. [30]

    Efficient and guaranteed rank minimization by atomic decomposition

    Kiryung Lee and Yoram Bresler. Efficient and guaranteed rank minimization by atomic decomposition. In 2009 IEEE International Symposium on Information Theory , pages 314--318. IEEE, 2009

  31. [31]

    Admira: Atomic decomposition for minimum rank approximation

    Kiryung Lee and Yoram Bresler. Admira: Atomic decomposition for minimum rank approximation. IEEE Transactions on Information Theory , 56(9):4402--4416, 2010

  32. [32]

    Transformers as algorithms: Generalization and stability in in-context learning

    Yingcong Li, Muhammed Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Transformers as algorithms: Generalization and stability in in-context learning. In International Conference on Machine Learning , pages 19565--19594. PMLR, 2023

  33. [33]

    Improved bounds for multi-task learning with trace norm regularization

    Weiwei Liu. Improved bounds for multi-task learning with trace norm regularization. In The Thirty Sixth Annual Conference on Learning Theory , pages 700--714. PMLR, 2023

  34. [34]

    On the limited memory BFGS method for large scale optimization

    Dong C Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming , 45(1-3):503--528, 1989

  35. [35]

    Understanding optimal feature transfer via a fine-grained bias-variance analysis

    Yufan Li, Subhabrata Sen, and Ben Adlam. Understanding optimal feature transfer via a fine-grained bias-variance analysis. arXiv preprint arXiv:2404.12481 , 2024

  36. [36]

    Low-rank semidefinite programming: Theory and applications

    Alex Lemon, Anthony Man-Cho So, and Yinyu Ye. Low-rank semidefinite programming: Theory and applications. Foundations and Trends in Optimization , 2(1-2):1--156, 2016

  37. [37]

    Interior-point method for nuclear norm approximation with application to system identification

    Zhang Liu and Lieven Vandenberghe. Interior-point method for nuclear norm approximation with application to system identification. SIAM Journal on Matrix Analysis and Applications , 31(3):1235--1256, 2010

  38. [38]

    Multi-target QSAR modelling in the analysis and design of HIV - HCV co-inhibitors: an in-silico study

    Qi Liu, Han Zhou, Lin Liu, Xi Chen, Ruixin Zhu, and Zhiwei Cao. Multi-target QSAR modelling in the analysis and design of HIV - HCV co-inhibitors: an in-silico study. BMC Bioinformatics , 12(1):1--20, 2011

  39. [39]

    Sparse principal component analysis and iterative thresholding

    Zongming Ma. Sparse principal component analysis and iterative thresholding. The Annals of Statistics , 41(2):772--801, 2013

  40. [40]

    Coherence analysis of iterative thresholding algorithms

    Arian Maleki. Coherence analysis of iterative thresholding algorithms. In 2009 47th Annual Allerton Conference on Communication, Control, and Computing (Allerton) , pages 236--243. IEEE, 2009

  41. [41]

    Fixed point and B regman iterative methods for matrix rank minimization

    Shiqian Ma, Donald Goldfarb, and Lifeng Chen. Fixed point and B regman iterative methods for matrix rank minimization. Mathematical Programming , 128(1-2):321--353, 2011

  42. [42]

    The benefit of multitask representation learning

    Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of multitask representation learning. Journal of Machine Learning Research , 17(81):1--32, 2016

  43. [43]

    Sparse approximate solutions to linear systems

    Balas K Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing , 24(2):227--234, 1995

  44. [44]

    Co S a MP : Iterative signal recovery from incomplete and inaccurate samples

    Deanna Needell and Joel A Tropp. Co S a MP : Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis , 26(3):301--321, 2009

  45. [45]

    Invariant models for causal transfer learning

    Mateo Rojas-Carulla, Bernhard Sch \"o lkopf, Richard Turner, and Jonas Peters. Invariant models for causal transfer learning. Journal of Machine Learning Research , 19(36):1--34, 2018

  46. [46]

    Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization

    Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review , 52(3):471--501, 2010

  47. [47]

    Rapid learning or feature reuse? T owards understanding the effectiveness of MAML

    Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse? T owards understanding the effectiveness of MAML . In International Conference on Learning Representations , 2020

  48. [48]

    Matrix perturbation theory

    Gilbert W Stewart and Ji-Guang Sun. Matrix perturbation theory . Academic Press, 1990

  49. [49]

    A framework to characterize performance of LASSO algorithms

    Mihailo Stojnic. A framework to characterize performance of lasso algorithms. arXiv preprint arXiv:1303.7291 , 2013

  50. [50]

    Precise error analysis of regularized M -estimators in high dimensions

    Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassibi. Precise error analysis of regularized M -estimators in high dimensions. IEEE Transactions on Information Theory , 64(8):5592--5628, 2018

  51. [51]

    Learning from similar linear representations: Adaptivity, minimaxity, and robustness

    Ye Tian, Yuqi Gu, and Yang Feng. Learning from similar linear representations: Adaptivity, minimaxity, and robustness. arXiv preprint arXiv:2303.17765 , 2023

  52. [52]

    Provable meta-learning of linear representations

    Nilesh Tripuraneni, Chi Jin, and Michael Jordan. Provable meta-learning of linear representations. In International Conference on Machine Learning , pages 10434--10443. PMLR, 2021

  53. [53]

    Statistically and computationally efficient linear meta-representation learning

    Kiran K Thekumparampil, Prateek Jain, Praneeth Netrapalli, and Sewoong Oh. Statistically and computationally efficient linear meta-representation learning. In Proceedings of the 35th International Conference on Neural Information Processing Systems , pages 18487--18500, 2021

  54. [54]

    An introduction to matrix concentration inequalities

    Joel A Tropp. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning , 8(1-2):1--230, 2015

  55. [55]

    On the conditions used to prove oracle results for the L asso

    Sara A van de Geer and Peter B \"u hlmann. On the conditions used to prove oracle results for the L asso. Electronic Journal of Statistics , 3:1360--1392, 2009

  56. [56]

    An automatic inequality prover and instance optimal identity testing

    Gregory Valiant and Paul Valiant. An automatic inequality prover and instance optimal identity testing. SIAM Journal on Computing , 46(1):429--455, 2017

  57. [57]

    First-order ANIL provably learns representations despite overparametrisation

    O g uz Y \"u ksel, Etienne Boursier, and Nicolas Flammarion. First-order ANIL provably learns representations despite overparametrisation. In NeurIPS 2023 Workshop on Mathematics of Modern Machine Learning , 2023

  58. [58]

    Bregman iterative algorithms for _1 -minimization with applications to compressed sensing

    Wotao Yin, Stanley Osher, Donald Goldfarb, and Jerome Darbon. Bregman iterative algorithms for _1 -minimization with applications to compressed sensing. SIAM Journal on Imaging Sciences , 1(1):143--168, 2008

  59. [59]

    A survey on multi-task learning

    Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering , 34(12):5586--5609, 2021