pith. sign in

arxiv: 1907.04524 · v1 · pith:DCHRH2L4new · submitted 2019-07-10 · 📊 stat.ML · cs.LG· math.OC

Two-block vs. Multi-block ADMM: An empirical evaluation of convergence

Pith reviewed 2026-05-24 23:50 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.OC
keywords ADMMmulti-block ADMMtwo-block ADMMmulti-task learningconvergenceempirical evaluationoptimization performance
0
0 comments X

The pith

Multi-block ADMM outperforms two-block ADMM on equivalent problems in multi-task learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the standard two-block ADMM is always comparable to sequential multi-block ADMM when both solve the same convex optimization problem. Experiments on multi-task learning tasks show that multi-block ADMM reaches better optimization values for every dual step size tested, which then produces stronger prediction results on all datasets. This outcome holds uniformly rather than in isolated cases. A sympathetic reader cares because ADMM is a common tool for large-scale data problems and the default two-block choice may not be optimal.

Core claim

Through a comprehensive set of experiments on multi-task learning problems, multi-block ADMM applied to an equivalent formulation consistently outperformed two-block ADMM on optimization performance for the entire range of dual step sizes, and this translated directly into improved prediction performance across all datasets.

What carries the argument

Direct empirical comparison of convergence behavior for two-block versus sequential multi-block variable updates on equivalent problem formulations.

If this is right

  • Multi-block ADMM yields better optimization performance than two-block ADMM across all tested conditions.
  • The optimization gains from multi-block ADMM produce measurable improvements in prediction accuracy.
  • The performance edge of multi-block ADMM appears for every dual step size examined.
  • Practitioners can gain by reformulating problems to enable multi-block ADMM rather than defaulting to two-block.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The number of blocks may affect practical convergence speed even when the underlying mathematical problem stays the same.
  • The same head-to-head comparison could be run on convex problems outside multi-task learning to check generality.
  • The empirical pattern may prompt closer theoretical study of how block partitioning influences iteration behavior.

Load-bearing premise

The multi-block version solves a problem that is mathematically equivalent to the one solved by the two-block version.

What would settle it

An experiment on any of the tested datasets or dual step sizes where two-block ADMM reaches equal or better optimization values than multi-block ADMM on the equivalent problem.

Figures

Figures reproduced from arXiv: 1907.04524 by Andre Goncalves, Arindam Banerjee, XiaoLi Liu.

Figure 1
Figure 1. Figure 1: Primal residual, validation nMSE performance, and primal objective computed on ADAS cognitive scores [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of two-block and multi-block ADMM for TS-MTL formulation on ADNI-2. Validation nMSE [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average curves of primal residuals over 10 executions for all three cognitive scores with longer training time [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of two-block and multi-block ADMM for TS-MTL formulation on Parkinson’s disease dataset. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average curves of objective function, primal residuals, and validation nMSE over 10 executions on the [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of two-Block and multi-Block ADMM for TS-MTL formulation on Air Quality data. Multi-block [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Average curves of objective value, primal residuals, and validation nMSE over 10 executions for Air Quality [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
read the original abstract

Alternating Direction Method of Multipliers (ADMM) has become a widely used optimization method for convex problems, particularly in the context of data mining in which large optimization problems are often encountered. ADMM has several desirable properties, including the ability to decompose large problems into smaller tractable sub-problems and ease of parallelization, that are essential in these scenarios. The most common form of ADMM is the two-block, in which two sets of primal variables are updated alternatingly. Recent years have seen advances in multi-block ADMM, which update more than two blocks of primal variables sequentially. In this paper, we study the empirical question: {\em Is two-block ADMM always comparable with sequential multi-block ADMM solving an equivalent problem?} In the context of optimization problems arising in multi-task learning, through a comprehensive set of experiments we surprisingly show that multi-block ADMM consistently outperformed two-block ADMM on optimization performance, and as a consequence on prediction performance, across all datasets and for the entire range of dual step sizes. Our results have an important practical implication: rather than simply using the popular two-block ADMM, one may considerably benefit from experimenting with multi-block ADMM applied to an equivalent problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper empirically compares two-block and sequential multi-block ADMM on convex optimization problems from multi-task learning. It claims that, when applied to equivalent problem formulations, multi-block ADMM consistently outperforms two-block ADMM in both optimization metrics and downstream prediction performance across all tested datasets and the full range of dual step sizes, with the practical recommendation that practitioners should experiment with multi-block variants rather than defaulting to two-block ADMM.

Significance. If the claimed equivalence of the reformulations holds and the performance gap is robust, the result would have clear practical value for large-scale convex optimization in data mining: it would indicate that multi-block splittings can yield measurable gains not predicted by existing convergence theory, encouraging broader empirical exploration of block variants beyond the standard two-block form.

major comments (2)
  1. [Abstract] Abstract and experimental sections: the central claim requires that multi-block and two-block ADMM solve identical convex programs (same objective and feasible set). No verification is reported that both variants attain the same optimal objective value (within solver tolerance) or satisfy the original constraints to the same degree across the datasets; without this check, observed differences in objective or prediction error could arise from inequivalent formulations rather than the ADMM splitting strategy itself.
  2. [Experiments] Experimental design (throughout results sections): the assertion of consistent outperformance 'across all datasets and for the entire range of dual step sizes' is presented without reported statistical significance tests, variance estimates across random seeds, or explicit controls for implementation details (e.g., stopping tolerances, initialization, or auxiliary-variable handling). These omissions make it impossible to assess whether the reported gap is load-bearing or could be explained by confounding factors.
minor comments (2)
  1. [Problem formulation] Notation for the multi-block splitting and the mapping back to the original variables should be clarified with an explicit statement of how the auxiliary variables are eliminated to recover the original problem.
  2. [Figures] Figure captions and axis labels for convergence plots would benefit from explicit mention of the tolerance used and whether the plotted quantity is the original or augmented Lagrangian.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We respond to each major comment below and indicate the changes we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental sections: the central claim requires that multi-block and two-block ADMM solve identical convex programs (same objective and feasible set). No verification is reported that both variants attain the same optimal objective value (within solver tolerance) or satisfy the original constraints to the same degree across the datasets; without this check, observed differences in objective or prediction error could arise from inequivalent formulations rather than the ADMM splitting strategy itself.

    Authors: We agree that explicit verification strengthens the central claim. Although the problem reformulations are mathematically equivalent by construction, we will add a new subsection reporting final objective values and constraint violation norms achieved by both variants on every dataset to confirm they reach equivalent solutions within tolerance. revision: yes

  2. Referee: [Experiments] Experimental design (throughout results sections): the assertion of consistent outperformance 'across all datasets and for the entire range of dual step sizes' is presented without reported statistical significance tests, variance estimates across random seeds, or explicit controls for implementation details (e.g., stopping tolerances, initialization, or auxiliary-variable handling). These omissions make it impossible to assess whether the reported gap is load-bearing or could be explained by confounding factors.

    Authors: We will revise the experimental section to explicitly document the shared implementation controls (identical stopping tolerances, initialization, and auxiliary-variable handling) used for both variants. The uniform outperformance across every dataset and step size already provides evidence against confounding; we view additional random-seed variance reporting as desirable but secondary, and will include it if new runs can be performed without altering the manuscript scope. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivations or self-referential predictions.

full rationale

This is an empirical evaluation paper whose central claim rests on experimental results comparing two-block and multi-block ADMM across datasets. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. The equivalence of problem formulations is asserted as a premise for fair comparison but is not derived within the paper via any of the enumerated circular patterns. The study is self-contained against external benchmarks (reproducible experiments on public datasets) and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical evaluation paper; no free parameters, axioms, or invented entities are introduced or relied upon beyond standard optimization background.

pith-pipeline@v0.9.0 · 5754 in / 978 out tokens · 21195 ms · 2026-05-24T23:50:05.145281+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 8 internal anchors

  1. [1]

    S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundation and Trends in Machine Learning, pages 1–122, 2011

  2. [2]

    Wang and A

    H. Wang and A. Banerjee. Bregman Alternating Direction Method of Multipliers. In Advances in Neural Information Processing Systems (NIPS), pages 2816–2824, 2014

  3. [3]

    Ye and X

    G. Ye and X. Xie. Split Bregman method for large scale fused Lasso. Computational Statistics & Data Analysis, 55(4):1552–1569, 2011

  4. [4]

    Wahlberg, S

    B. Wahlberg, S. Boyd, M. Annergren, and Y . Wang. An ADMM algorithm for a class of total variation regularized estimation problems. IFAC Proceedings Volumes, 45(16):83–88, 2012

  5. [5]

    Yang and Y

    J. Yang and Y . Zhang. Alternating direction algorithms for L1-problems in compressive sensing.SIAM Journal on Scientific Computing, 33(1):250–278, 2011

  6. [6]

    W. Deng, M. Lai, Z. Peng, and W. Yin. Parallel Multi-Block ADMM with O(1/ k) Convergence. Journal of Scientific Computing, 71(2):712–736, May 2017

  7. [7]

    B. He, M. Tao, and X. Yuan. Alternating direction method with Gaussian back substitution for separable convex programming. SIAM Journal on Optimization, 22(2):313–340, 2012

  8. [8]

    Hu and J.T

    E. Hu and J.T. Kwok. Efficient Kernel Learning from Side Information Using ADMM. InIJCAI, pages 1408–1414, 2013

  9. [9]

    P. Das, N. Johnson, and A. Banerjee. Online lazy updates for portfolio selection with transaction costs. In AAAI, 2013

  10. [10]

    H. Wang, A. Banerjee, and Z. Luo. Parallel direction method of multipliers. In Advances in Neural Information Processing Systems (NIPS), pages 181–189, 2014

  11. [11]

    C. Chen, B. He, Y . Ye, and X. Yuan. The direct extension of ADMM for multi-block convex minimization problems is not necessarily convergent. Mathematical Programming, 155(1-2):57–79, 2016

  12. [12]

    Y . Wang, W. Yin, and J. Zeng. Global convergence of ADMM in nonconvex nonsmooth optimization. arXiv preprint arXiv:1511.06324, 2015

  13. [13]

    F. Wang, W. Cao, and Z. Xu. Convergence of multi-block Bregman ADMM for nonconvex composite problems. arXiv preprint arXiv:1505.03063, 2015

  14. [14]

    T. Lin, S. Ma, and S. Zhang. On the sublinear convergence rate of multi-block ADMM. Journal of the Operations Research Society of China, 3(3):251–274, 2015

  15. [15]

    X. Wang, M. Hong, S. Ma, and Z. Luo. Solving multiple-block separable convex minimization problems using two-block alternating direction method of multipliers. arXiv preprint arXiv:1308.5294, 2013

  16. [16]

    Klee and G.J

    V . Klee and G.J. Minty. How good is the simplex algorithm. Technical report, Washington University Seattle.. Dept of Mathematics, 1970

  17. [17]

    P Kingma and J

    D. P Kingma and J. Ba. Adam: Amethod for stochastic optimization. In International Conference on Learning Representations, 2014

  18. [18]

    A Spielman and S

    D. A Spielman and S. Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM (JACM), 51(3):385–463, 2004

  19. [19]

    S.S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient Descent Provably Optimizes Over-parameterized Neural Networks. arXiv:1810.02054 [cs, math, stat], 2018. 13 A PREPRINT

  20. [20]

    S.S. Du, J.D. Lee, H. Li, L. Wang, and X. Zhai. Gradient Descent Finds Global Minima of Deep Neural Networks. arXiv:1811.03804 [cs, math, stat], 2018

  21. [21]

    D. Zou, Y . Cao, D. Zhou, and Q. Gu. Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks. arXiv:1811.08888 [cs, math, stat], November 2018

  22. [22]

    SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data

    A. Brutzkus, A. Globerson, E. Malach, and S. Shalev-Shwartz. SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data. arXiv:1710.10174 [cs], 2017

  23. [23]

    M. Hong, T. Chang, X. Wang, M. Razaviyayn, S. Ma, and Z. Luo. A block successive upper bound minimization method of multipliers for linearly constrained convex optimization. arXiv preprint arXiv:1401.7079, 2014

  24. [24]

    Hong and Z

    M. Hong and Z. Luo. On the linear convergence of the alternating direction method of multipliers. Mathematical Programming, 162(1):165–199, 2017

  25. [25]

    T. Lin, S. Ma, and S. Zhang. Iteration complexity analysis of multi-block admm for a family of convex minimization without strong convexity. Journal of Scientific Computing, 69(1):52–81, 2016

  26. [26]

    Parikh, S

    N. Parikh, S. Boyd, et al. Proximal algorithms. Foundations and Trends in Optimization, 1(3):127–239, 2014

  27. [27]

    M. Hong, Z. Luo, and M. Razaviyayn. Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization, 26(1):337–364, 2016

  28. [28]

    Gonçalves, Dazhe Zhao, and Arindam Banerjee

    Xiaoli Liu, Peng Cao, André R. Gonçalves, Dazhe Zhao, and Arindam Banerjee. Modeling alzheimer’s disease progression with fused laplacian sparse group lasso. ACM Trans. Knowl. Discov. Data, 12(6):65:1–65:35, 2018

  29. [29]

    Wasserman

    L. Wasserman. All of Nonparametric Statistics. Springer-Verlag New York, 2006

  30. [30]

    L. Yuan, J. Liu, and J. Ye. Efficient Methods for Overlapping Group Lasso.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(9):2104–2116, 2013

  31. [31]

    Y . L. Yu. Better approximation and faster algorithm using the proximal average. InAdvances in Neural Information Processing Systems (NIPS), pages 458–466, 2013

  32. [32]

    Y . L. Yu. On decomposing the proximal map. InAdvances in Neural Information Processing Systems (NIPS), pages 91–99, 2013

  33. [33]

    Argyriou, T

    A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008

  34. [34]

    J. Zhou, J. Liu, V . A. Narayan, J. Ye, and Alzheimer’s Disease Neuroimaging Initiative. Modeling disease progression via multi-task learning. NeuroImage, 78:233–248, 2013

  35. [35]

    J. Wan, Z. Zhang, B. D. Rao, S. Fang, J. Yan, A. J. Saykin, L. Shen, and Alzheimer’s Disease Neuroimaging Ini- tiative. Identifying the Neuroanatomical Basis of Cognitive Impairment in Alzheimer’s Disease by Correlationand Nonlinearity-Aware Sparse Bayesian Learning. IEEE Transactions on Medical Imaging, 33(7):1475–1487, 2014

  36. [36]

    H. Wang, F. Nie, H. Huang, J. Yan, S. Kim, S. Risacher, A. Saykin, and L. Shen. High-Order Multi-Task Feature Learning to Identify Longitudinal Phenotypic Markers for Alzheimer’s Disease Progression Prediction. In Advances in Neural Information Processing Systems (NIPS), 2012

  37. [37]

    S. Khoo, D. Petillo, U. Kang, J. H. Resau, B. Berryhill, J. Linder, L. Forsgren, L.A. Neuman, and A. Tan. Plasma- based circulating microrna biomarkers for parkinson’s disease. Journal of Parkinson’s disease, 2(4):321–331, 2012

  38. [38]

    Parnetti, A

    L. Parnetti, A. Castrioto, D. Chiasserini, E. Persichetti, N. Tambasco, O. El-Agnaf, and P. Calabresi. Cerebrospinal fluid biomarkers in parkinson disease. Nature Reviews Neurology, 9(3):131, 2013. 14