Two-block vs. Multi-block ADMM: An empirical evaluation of convergence

Andre Goncalves; Arindam Banerjee; XiaoLi Liu

arxiv: 1907.04524 · v1 · pith:DCHRH2L4new · submitted 2019-07-10 · 📊 stat.ML · cs.LG· math.OC

Two-block vs. Multi-block ADMM: An empirical evaluation of convergence

Andre Goncalves , XiaoLi Liu , Arindam Banerjee This is my paper

Pith reviewed 2026-05-24 23:50 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.OC

keywords ADMMmulti-block ADMMtwo-block ADMMmulti-task learningconvergenceempirical evaluationoptimization performance

0 comments

The pith

Multi-block ADMM outperforms two-block ADMM on equivalent problems in multi-task learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the standard two-block ADMM is always comparable to sequential multi-block ADMM when both solve the same convex optimization problem. Experiments on multi-task learning tasks show that multi-block ADMM reaches better optimization values for every dual step size tested, which then produces stronger prediction results on all datasets. This outcome holds uniformly rather than in isolated cases. A sympathetic reader cares because ADMM is a common tool for large-scale data problems and the default two-block choice may not be optimal.

Core claim

Through a comprehensive set of experiments on multi-task learning problems, multi-block ADMM applied to an equivalent formulation consistently outperformed two-block ADMM on optimization performance for the entire range of dual step sizes, and this translated directly into improved prediction performance across all datasets.

What carries the argument

Direct empirical comparison of convergence behavior for two-block versus sequential multi-block variable updates on equivalent problem formulations.

If this is right

Multi-block ADMM yields better optimization performance than two-block ADMM across all tested conditions.
The optimization gains from multi-block ADMM produce measurable improvements in prediction accuracy.
The performance edge of multi-block ADMM appears for every dual step size examined.
Practitioners can gain by reformulating problems to enable multi-block ADMM rather than defaulting to two-block.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The number of blocks may affect practical convergence speed even when the underlying mathematical problem stays the same.
The same head-to-head comparison could be run on convex problems outside multi-task learning to check generality.
The empirical pattern may prompt closer theoretical study of how block partitioning influences iteration behavior.

Load-bearing premise

The multi-block version solves a problem that is mathematically equivalent to the one solved by the two-block version.

What would settle it

An experiment on any of the tested datasets or dual step sizes where two-block ADMM reaches equal or better optimization values than multi-block ADMM on the equivalent problem.

Figures

Figures reproduced from arXiv: 1907.04524 by Andre Goncalves, Arindam Banerjee, XiaoLi Liu.

**Figure 2.** Figure 2: Comparison of two-block and multi-block ADMM for TS-MTL formulation on ADNI-2. Validation nMSE [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Average curves of primal residuals over 10 executions for all three cognitive scores with longer training time [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of two-block and multi-block ADMM for TS-MTL formulation on Parkinson’s disease dataset. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Average curves of objective function, primal residuals, and validation nMSE over 10 executions on the [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of two-Block and multi-Block ADMM for TS-MTL formulation on Air Quality data. Multi-block [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Average curves of objective value, primal residuals, and validation nMSE over 10 executions for Air Quality [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

read the original abstract

Alternating Direction Method of Multipliers (ADMM) has become a widely used optimization method for convex problems, particularly in the context of data mining in which large optimization problems are often encountered. ADMM has several desirable properties, including the ability to decompose large problems into smaller tractable sub-problems and ease of parallelization, that are essential in these scenarios. The most common form of ADMM is the two-block, in which two sets of primal variables are updated alternatingly. Recent years have seen advances in multi-block ADMM, which update more than two blocks of primal variables sequentially. In this paper, we study the empirical question: {\em Is two-block ADMM always comparable with sequential multi-block ADMM solving an equivalent problem?} In the context of optimization problems arising in multi-task learning, through a comprehensive set of experiments we surprisingly show that multi-block ADMM consistently outperformed two-block ADMM on optimization performance, and as a consequence on prediction performance, across all datasets and for the entire range of dual step sizes. Our results have an important practical implication: rather than simply using the popular two-block ADMM, one may considerably benefit from experimenting with multi-block ADMM applied to an equivalent problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Multi-block ADMM beats two-block on MTL tasks in the reported experiments, but the paper needs to show the two formulations actually solve identical problems.

read the letter

The main point here is straightforward: the authors ran a bunch of experiments on multi-task learning problems and found that sequential multi-block ADMM gave better optimization and prediction results than the usual two-block version, and this held across every dataset and every dual step size they tried. That is the new empirical observation they are highlighting, and it is presented as contrary to what one might expect from the literature on ADMM convergence. The experiments appear to be reasonably broad for this kind of study, covering multiple datasets and the full range of step sizes, which gives the comparison some weight. The practical takeaway they draw is also clear: practitioners might want to try the multi-block splitting even when a two-block version is available. That part is useful and directly actionable. The soft spot is the equivalence claim. The abstract states they are comparing the methods on an equivalent problem, but the stress-test note is right to flag that this needs explicit confirmation. Without reported checks that both versions reach the same optimal objective value within tolerance or satisfy the constraints to the same degree, it remains possible that the multi-block reformulation introduces small differences in the feasible set or auxiliary variables. If that is the case, the performance gap could be partly an artifact of solving slightly different programs rather than a property of the ADMM variant itself. The paper would be stronger if it included a short verification section on this point. The work is a clean empirical comparison with no circularity or invented entities. It is aimed at people who actually implement ADMM for structured convex problems in machine learning. I would bring it to a reading group as maybe, mainly to talk through the experimental controls and the equivalence verification. I would not cite it in my own work unless that gap is closed. It deserves peer review because the question is practical, the experiments are comprehensive enough to be worth referee scrutiny, and the result could affect default choices in the subfield even if revisions are needed.

Referee Report

2 major / 2 minor

Summary. The paper empirically compares two-block and sequential multi-block ADMM on convex optimization problems from multi-task learning. It claims that, when applied to equivalent problem formulations, multi-block ADMM consistently outperforms two-block ADMM in both optimization metrics and downstream prediction performance across all tested datasets and the full range of dual step sizes, with the practical recommendation that practitioners should experiment with multi-block variants rather than defaulting to two-block ADMM.

Significance. If the claimed equivalence of the reformulations holds and the performance gap is robust, the result would have clear practical value for large-scale convex optimization in data mining: it would indicate that multi-block splittings can yield measurable gains not predicted by existing convergence theory, encouraging broader empirical exploration of block variants beyond the standard two-block form.

major comments (2)

[Abstract] Abstract and experimental sections: the central claim requires that multi-block and two-block ADMM solve identical convex programs (same objective and feasible set). No verification is reported that both variants attain the same optimal objective value (within solver tolerance) or satisfy the original constraints to the same degree across the datasets; without this check, observed differences in objective or prediction error could arise from inequivalent formulations rather than the ADMM splitting strategy itself.
[Experiments] Experimental design (throughout results sections): the assertion of consistent outperformance 'across all datasets and for the entire range of dual step sizes' is presented without reported statistical significance tests, variance estimates across random seeds, or explicit controls for implementation details (e.g., stopping tolerances, initialization, or auxiliary-variable handling). These omissions make it impossible to assess whether the reported gap is load-bearing or could be explained by confounding factors.

minor comments (2)

[Problem formulation] Notation for the multi-block splitting and the mapping back to the original variables should be clarified with an explicit statement of how the auxiliary variables are eliminated to recover the original problem.
[Figures] Figure captions and axis labels for convergence plots would benefit from explicit mention of the tolerance used and whether the plotted quantity is the original or augmented Lagrangian.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We respond to each major comment below and indicate the changes we will make.

read point-by-point responses

Referee: [Abstract] Abstract and experimental sections: the central claim requires that multi-block and two-block ADMM solve identical convex programs (same objective and feasible set). No verification is reported that both variants attain the same optimal objective value (within solver tolerance) or satisfy the original constraints to the same degree across the datasets; without this check, observed differences in objective or prediction error could arise from inequivalent formulations rather than the ADMM splitting strategy itself.

Authors: We agree that explicit verification strengthens the central claim. Although the problem reformulations are mathematically equivalent by construction, we will add a new subsection reporting final objective values and constraint violation norms achieved by both variants on every dataset to confirm they reach equivalent solutions within tolerance. revision: yes
Referee: [Experiments] Experimental design (throughout results sections): the assertion of consistent outperformance 'across all datasets and for the entire range of dual step sizes' is presented without reported statistical significance tests, variance estimates across random seeds, or explicit controls for implementation details (e.g., stopping tolerances, initialization, or auxiliary-variable handling). These omissions make it impossible to assess whether the reported gap is load-bearing or could be explained by confounding factors.

Authors: We will revise the experimental section to explicitly document the shared implementation controls (identical stopping tolerances, initialization, and auxiliary-variable handling) used for both variants. The uniform outperformance across every dataset and step size already provides evidence against confounding; we view additional random-seed variance reporting as desirable but secondary, and will include it if new runs can be performed without altering the manuscript scope. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivations or self-referential predictions.

full rationale

This is an empirical evaluation paper whose central claim rests on experimental results comparing two-block and multi-block ADMM across datasets. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. The equivalence of problem formulations is asserted as a premise for fair comparison but is not derived within the paper via any of the enumerated circular patterns. The study is self-contained against external benchmarks (reproducible experiments on public datasets) and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical evaluation paper; no free parameters, axioms, or invented entities are introduced or relied upon beyond standard optimization background.

pith-pipeline@v0.9.0 · 5754 in / 978 out tokens · 21195 ms · 2026-05-24T23:50:05.145281+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 8 internal anchors

[1]

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundation and Trends in Machine Learning, pages 1–122, 2011

work page 2011
[2]

Wang and A

H. Wang and A. Banerjee. Bregman Alternating Direction Method of Multipliers. In Advances in Neural Information Processing Systems (NIPS), pages 2816–2824, 2014

work page 2014
[3]

Ye and X

G. Ye and X. Xie. Split Bregman method for large scale fused Lasso. Computational Statistics & Data Analysis, 55(4):1552–1569, 2011

work page 2011
[4]

Wahlberg, S

B. Wahlberg, S. Boyd, M. Annergren, and Y . Wang. An ADMM algorithm for a class of total variation regularized estimation problems. IFAC Proceedings Volumes, 45(16):83–88, 2012

work page 2012
[5]

Yang and Y

J. Yang and Y . Zhang. Alternating direction algorithms for L1-problems in compressive sensing.SIAM Journal on Scientiﬁc Computing, 33(1):250–278, 2011

work page 2011
[6]

W. Deng, M. Lai, Z. Peng, and W. Yin. Parallel Multi-Block ADMM with O(1/ k) Convergence. Journal of Scientiﬁc Computing, 71(2):712–736, May 2017

work page 2017
[7]

B. He, M. Tao, and X. Yuan. Alternating direction method with Gaussian back substitution for separable convex programming. SIAM Journal on Optimization, 22(2):313–340, 2012

work page 2012
[8]

Hu and J.T

E. Hu and J.T. Kwok. Efﬁcient Kernel Learning from Side Information Using ADMM. InIJCAI, pages 1408–1414, 2013

work page 2013
[9]

P. Das, N. Johnson, and A. Banerjee. Online lazy updates for portfolio selection with transaction costs. In AAAI, 2013

work page 2013
[10]

H. Wang, A. Banerjee, and Z. Luo. Parallel direction method of multipliers. In Advances in Neural Information Processing Systems (NIPS), pages 181–189, 2014

work page 2014
[11]

C. Chen, B. He, Y . Ye, and X. Yuan. The direct extension of ADMM for multi-block convex minimization problems is not necessarily convergent. Mathematical Programming, 155(1-2):57–79, 2016

work page 2016
[12]

Y . Wang, W. Yin, and J. Zeng. Global convergence of ADMM in nonconvex nonsmooth optimization. arXiv preprint arXiv:1511.06324, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[13]

F. Wang, W. Cao, and Z. Xu. Convergence of multi-block Bregman ADMM for nonconvex composite problems. arXiv preprint arXiv:1505.03063, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[14]

T. Lin, S. Ma, and S. Zhang. On the sublinear convergence rate of multi-block ADMM. Journal of the Operations Research Society of China, 3(3):251–274, 2015

work page 2015
[15]

X. Wang, M. Hong, S. Ma, and Z. Luo. Solving multiple-block separable convex minimization problems using two-block alternating direction method of multipliers. arXiv preprint arXiv:1308.5294, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[16]

Klee and G.J

V . Klee and G.J. Minty. How good is the simplex algorithm. Technical report, Washington University Seattle.. Dept of Mathematics, 1970

work page 1970
[17]

P Kingma and J

D. P Kingma and J. Ba. Adam: Amethod for stochastic optimization. In International Conference on Learning Representations, 2014

work page 2014
[18]

A Spielman and S

D. A Spielman and S. Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM (JACM), 51(3):385–463, 2004

work page 2004
[19]

S.S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient Descent Provably Optimizes Over-parameterized Neural Networks. arXiv:1810.02054 [cs, math, stat], 2018. 13 A PREPRINT

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

S.S. Du, J.D. Lee, H. Li, L. Wang, and X. Zhai. Gradient Descent Finds Global Minima of Deep Neural Networks. arXiv:1811.03804 [cs, math, stat], 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

D. Zou, Y . Cao, D. Zhou, and Q. Gu. Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks. arXiv:1811.08888 [cs, math, stat], November 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[22]

SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data

A. Brutzkus, A. Globerson, E. Malach, and S. Shalev-Shwartz. SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data. arXiv:1710.10174 [cs], 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

M. Hong, T. Chang, X. Wang, M. Razaviyayn, S. Ma, and Z. Luo. A block successive upper bound minimization method of multipliers for linearly constrained convex optimization. arXiv preprint arXiv:1401.7079, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[24]

Hong and Z

M. Hong and Z. Luo. On the linear convergence of the alternating direction method of multipliers. Mathematical Programming, 162(1):165–199, 2017

work page 2017
[25]

T. Lin, S. Ma, and S. Zhang. Iteration complexity analysis of multi-block admm for a family of convex minimization without strong convexity. Journal of Scientiﬁc Computing, 69(1):52–81, 2016

work page 2016
[26]

Parikh, S

N. Parikh, S. Boyd, et al. Proximal algorithms. Foundations and Trends in Optimization, 1(3):127–239, 2014

work page 2014
[27]

M. Hong, Z. Luo, and M. Razaviyayn. Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization, 26(1):337–364, 2016

work page 2016
[28]

Gonçalves, Dazhe Zhao, and Arindam Banerjee

Xiaoli Liu, Peng Cao, André R. Gonçalves, Dazhe Zhao, and Arindam Banerjee. Modeling alzheimer’s disease progression with fused laplacian sparse group lasso. ACM Trans. Knowl. Discov. Data, 12(6):65:1–65:35, 2018

work page 2018
[29]

Wasserman

L. Wasserman. All of Nonparametric Statistics. Springer-Verlag New York, 2006

work page 2006
[30]

L. Yuan, J. Liu, and J. Ye. Efﬁcient Methods for Overlapping Group Lasso.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(9):2104–2116, 2013

work page 2013
[31]

Y . L. Yu. Better approximation and faster algorithm using the proximal average. InAdvances in Neural Information Processing Systems (NIPS), pages 458–466, 2013

work page 2013
[32]

Y . L. Yu. On decomposing the proximal map. InAdvances in Neural Information Processing Systems (NIPS), pages 91–99, 2013

work page 2013
[33]

Argyriou, T

A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008

work page 2008
[34]

J. Zhou, J. Liu, V . A. Narayan, J. Ye, and Alzheimer’s Disease Neuroimaging Initiative. Modeling disease progression via multi-task learning. NeuroImage, 78:233–248, 2013

work page 2013
[35]

J. Wan, Z. Zhang, B. D. Rao, S. Fang, J. Yan, A. J. Saykin, L. Shen, and Alzheimer’s Disease Neuroimaging Ini- tiative. Identifying the Neuroanatomical Basis of Cognitive Impairment in Alzheimer’s Disease by Correlationand Nonlinearity-Aware Sparse Bayesian Learning. IEEE Transactions on Medical Imaging, 33(7):1475–1487, 2014

work page 2014
[36]

H. Wang, F. Nie, H. Huang, J. Yan, S. Kim, S. Risacher, A. Saykin, and L. Shen. High-Order Multi-Task Feature Learning to Identify Longitudinal Phenotypic Markers for Alzheimer’s Disease Progression Prediction. In Advances in Neural Information Processing Systems (NIPS), 2012

work page 2012
[37]

S. Khoo, D. Petillo, U. Kang, J. H. Resau, B. Berryhill, J. Linder, L. Forsgren, L.A. Neuman, and A. Tan. Plasma- based circulating microrna biomarkers for parkinson’s disease. Journal of Parkinson’s disease, 2(4):321–331, 2012

work page 2012
[38]

Parnetti, A

L. Parnetti, A. Castrioto, D. Chiasserini, E. Persichetti, N. Tambasco, O. El-Agnaf, and P. Calabresi. Cerebrospinal ﬂuid biomarkers in parkinson disease. Nature Reviews Neurology, 9(3):131, 2013. 14

work page 2013

[1] [1]

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundation and Trends in Machine Learning, pages 1–122, 2011

work page 2011

[2] [2]

Wang and A

H. Wang and A. Banerjee. Bregman Alternating Direction Method of Multipliers. In Advances in Neural Information Processing Systems (NIPS), pages 2816–2824, 2014

work page 2014

[3] [3]

Ye and X

G. Ye and X. Xie. Split Bregman method for large scale fused Lasso. Computational Statistics & Data Analysis, 55(4):1552–1569, 2011

work page 2011

[4] [4]

Wahlberg, S

B. Wahlberg, S. Boyd, M. Annergren, and Y . Wang. An ADMM algorithm for a class of total variation regularized estimation problems. IFAC Proceedings Volumes, 45(16):83–88, 2012

work page 2012

[5] [5]

Yang and Y

J. Yang and Y . Zhang. Alternating direction algorithms for L1-problems in compressive sensing.SIAM Journal on Scientiﬁc Computing, 33(1):250–278, 2011

work page 2011

[6] [6]

W. Deng, M. Lai, Z. Peng, and W. Yin. Parallel Multi-Block ADMM with O(1/ k) Convergence. Journal of Scientiﬁc Computing, 71(2):712–736, May 2017

work page 2017

[7] [7]

B. He, M. Tao, and X. Yuan. Alternating direction method with Gaussian back substitution for separable convex programming. SIAM Journal on Optimization, 22(2):313–340, 2012

work page 2012

[8] [8]

Hu and J.T

E. Hu and J.T. Kwok. Efﬁcient Kernel Learning from Side Information Using ADMM. InIJCAI, pages 1408–1414, 2013

work page 2013

[9] [9]

P. Das, N. Johnson, and A. Banerjee. Online lazy updates for portfolio selection with transaction costs. In AAAI, 2013

work page 2013

[10] [10]

H. Wang, A. Banerjee, and Z. Luo. Parallel direction method of multipliers. In Advances in Neural Information Processing Systems (NIPS), pages 181–189, 2014

work page 2014

[11] [11]

C. Chen, B. He, Y . Ye, and X. Yuan. The direct extension of ADMM for multi-block convex minimization problems is not necessarily convergent. Mathematical Programming, 155(1-2):57–79, 2016

work page 2016

[12] [12]

Y . Wang, W. Yin, and J. Zeng. Global convergence of ADMM in nonconvex nonsmooth optimization. arXiv preprint arXiv:1511.06324, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[13] [13]

F. Wang, W. Cao, and Z. Xu. Convergence of multi-block Bregman ADMM for nonconvex composite problems. arXiv preprint arXiv:1505.03063, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[14] [14]

T. Lin, S. Ma, and S. Zhang. On the sublinear convergence rate of multi-block ADMM. Journal of the Operations Research Society of China, 3(3):251–274, 2015

work page 2015

[15] [15]

X. Wang, M. Hong, S. Ma, and Z. Luo. Solving multiple-block separable convex minimization problems using two-block alternating direction method of multipliers. arXiv preprint arXiv:1308.5294, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[16] [16]

Klee and G.J

V . Klee and G.J. Minty. How good is the simplex algorithm. Technical report, Washington University Seattle.. Dept of Mathematics, 1970

work page 1970

[17] [17]

P Kingma and J

D. P Kingma and J. Ba. Adam: Amethod for stochastic optimization. In International Conference on Learning Representations, 2014

work page 2014

[18] [18]

A Spielman and S

D. A Spielman and S. Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM (JACM), 51(3):385–463, 2004

work page 2004

[19] [19]

S.S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient Descent Provably Optimizes Over-parameterized Neural Networks. arXiv:1810.02054 [cs, math, stat], 2018. 13 A PREPRINT

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

S.S. Du, J.D. Lee, H. Li, L. Wang, and X. Zhai. Gradient Descent Finds Global Minima of Deep Neural Networks. arXiv:1811.03804 [cs, math, stat], 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[21] [21]

D. Zou, Y . Cao, D. Zhou, and Q. Gu. Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks. arXiv:1811.08888 [cs, math, stat], November 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[22] [22]

SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data

A. Brutzkus, A. Globerson, E. Malach, and S. Shalev-Shwartz. SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data. arXiv:1710.10174 [cs], 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[23] [23]

M. Hong, T. Chang, X. Wang, M. Razaviyayn, S. Ma, and Z. Luo. A block successive upper bound minimization method of multipliers for linearly constrained convex optimization. arXiv preprint arXiv:1401.7079, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[24] [24]

Hong and Z

M. Hong and Z. Luo. On the linear convergence of the alternating direction method of multipliers. Mathematical Programming, 162(1):165–199, 2017

work page 2017

[25] [25]

T. Lin, S. Ma, and S. Zhang. Iteration complexity analysis of multi-block admm for a family of convex minimization without strong convexity. Journal of Scientiﬁc Computing, 69(1):52–81, 2016

work page 2016

[26] [26]

Parikh, S

N. Parikh, S. Boyd, et al. Proximal algorithms. Foundations and Trends in Optimization, 1(3):127–239, 2014

work page 2014

[27] [27]

M. Hong, Z. Luo, and M. Razaviyayn. Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization, 26(1):337–364, 2016

work page 2016

[28] [28]

Gonçalves, Dazhe Zhao, and Arindam Banerjee

Xiaoli Liu, Peng Cao, André R. Gonçalves, Dazhe Zhao, and Arindam Banerjee. Modeling alzheimer’s disease progression with fused laplacian sparse group lasso. ACM Trans. Knowl. Discov. Data, 12(6):65:1–65:35, 2018

work page 2018

[29] [29]

Wasserman

L. Wasserman. All of Nonparametric Statistics. Springer-Verlag New York, 2006

work page 2006

[30] [30]

L. Yuan, J. Liu, and J. Ye. Efﬁcient Methods for Overlapping Group Lasso.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(9):2104–2116, 2013

work page 2013

[31] [31]

Y . L. Yu. Better approximation and faster algorithm using the proximal average. InAdvances in Neural Information Processing Systems (NIPS), pages 458–466, 2013

work page 2013

[32] [32]

Y . L. Yu. On decomposing the proximal map. InAdvances in Neural Information Processing Systems (NIPS), pages 91–99, 2013

work page 2013

[33] [33]

Argyriou, T

A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008

work page 2008

[34] [34]

J. Zhou, J. Liu, V . A. Narayan, J. Ye, and Alzheimer’s Disease Neuroimaging Initiative. Modeling disease progression via multi-task learning. NeuroImage, 78:233–248, 2013

work page 2013

[35] [35]

J. Wan, Z. Zhang, B. D. Rao, S. Fang, J. Yan, A. J. Saykin, L. Shen, and Alzheimer’s Disease Neuroimaging Ini- tiative. Identifying the Neuroanatomical Basis of Cognitive Impairment in Alzheimer’s Disease by Correlationand Nonlinearity-Aware Sparse Bayesian Learning. IEEE Transactions on Medical Imaging, 33(7):1475–1487, 2014

work page 2014

[36] [36]

H. Wang, F. Nie, H. Huang, J. Yan, S. Kim, S. Risacher, A. Saykin, and L. Shen. High-Order Multi-Task Feature Learning to Identify Longitudinal Phenotypic Markers for Alzheimer’s Disease Progression Prediction. In Advances in Neural Information Processing Systems (NIPS), 2012

work page 2012

[37] [37]

S. Khoo, D. Petillo, U. Kang, J. H. Resau, B. Berryhill, J. Linder, L. Forsgren, L.A. Neuman, and A. Tan. Plasma- based circulating microrna biomarkers for parkinson’s disease. Journal of Parkinson’s disease, 2(4):321–331, 2012

work page 2012

[38] [38]

Parnetti, A

L. Parnetti, A. Castrioto, D. Chiasserini, E. Persichetti, N. Tambasco, O. El-Agnaf, and P. Calabresi. Cerebrospinal ﬂuid biomarkers in parkinson disease. Nature Reviews Neurology, 9(3):131, 2013. 14

work page 2013