Two-block vs. Multi-block ADMM: An empirical evaluation of convergence
Pith reviewed 2026-05-24 23:50 UTC · model grok-4.3
The pith
Multi-block ADMM outperforms two-block ADMM on equivalent problems in multi-task learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through a comprehensive set of experiments on multi-task learning problems, multi-block ADMM applied to an equivalent formulation consistently outperformed two-block ADMM on optimization performance for the entire range of dual step sizes, and this translated directly into improved prediction performance across all datasets.
What carries the argument
Direct empirical comparison of convergence behavior for two-block versus sequential multi-block variable updates on equivalent problem formulations.
If this is right
- Multi-block ADMM yields better optimization performance than two-block ADMM across all tested conditions.
- The optimization gains from multi-block ADMM produce measurable improvements in prediction accuracy.
- The performance edge of multi-block ADMM appears for every dual step size examined.
- Practitioners can gain by reformulating problems to enable multi-block ADMM rather than defaulting to two-block.
Where Pith is reading between the lines
- The number of blocks may affect practical convergence speed even when the underlying mathematical problem stays the same.
- The same head-to-head comparison could be run on convex problems outside multi-task learning to check generality.
- The empirical pattern may prompt closer theoretical study of how block partitioning influences iteration behavior.
Load-bearing premise
The multi-block version solves a problem that is mathematically equivalent to the one solved by the two-block version.
What would settle it
An experiment on any of the tested datasets or dual step sizes where two-block ADMM reaches equal or better optimization values than multi-block ADMM on the equivalent problem.
Figures
read the original abstract
Alternating Direction Method of Multipliers (ADMM) has become a widely used optimization method for convex problems, particularly in the context of data mining in which large optimization problems are often encountered. ADMM has several desirable properties, including the ability to decompose large problems into smaller tractable sub-problems and ease of parallelization, that are essential in these scenarios. The most common form of ADMM is the two-block, in which two sets of primal variables are updated alternatingly. Recent years have seen advances in multi-block ADMM, which update more than two blocks of primal variables sequentially. In this paper, we study the empirical question: {\em Is two-block ADMM always comparable with sequential multi-block ADMM solving an equivalent problem?} In the context of optimization problems arising in multi-task learning, through a comprehensive set of experiments we surprisingly show that multi-block ADMM consistently outperformed two-block ADMM on optimization performance, and as a consequence on prediction performance, across all datasets and for the entire range of dual step sizes. Our results have an important practical implication: rather than simply using the popular two-block ADMM, one may considerably benefit from experimenting with multi-block ADMM applied to an equivalent problem.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper empirically compares two-block and sequential multi-block ADMM on convex optimization problems from multi-task learning. It claims that, when applied to equivalent problem formulations, multi-block ADMM consistently outperforms two-block ADMM in both optimization metrics and downstream prediction performance across all tested datasets and the full range of dual step sizes, with the practical recommendation that practitioners should experiment with multi-block variants rather than defaulting to two-block ADMM.
Significance. If the claimed equivalence of the reformulations holds and the performance gap is robust, the result would have clear practical value for large-scale convex optimization in data mining: it would indicate that multi-block splittings can yield measurable gains not predicted by existing convergence theory, encouraging broader empirical exploration of block variants beyond the standard two-block form.
major comments (2)
- [Abstract] Abstract and experimental sections: the central claim requires that multi-block and two-block ADMM solve identical convex programs (same objective and feasible set). No verification is reported that both variants attain the same optimal objective value (within solver tolerance) or satisfy the original constraints to the same degree across the datasets; without this check, observed differences in objective or prediction error could arise from inequivalent formulations rather than the ADMM splitting strategy itself.
- [Experiments] Experimental design (throughout results sections): the assertion of consistent outperformance 'across all datasets and for the entire range of dual step sizes' is presented without reported statistical significance tests, variance estimates across random seeds, or explicit controls for implementation details (e.g., stopping tolerances, initialization, or auxiliary-variable handling). These omissions make it impossible to assess whether the reported gap is load-bearing or could be explained by confounding factors.
minor comments (2)
- [Problem formulation] Notation for the multi-block splitting and the mapping back to the original variables should be clarified with an explicit statement of how the auxiliary variables are eliminated to recover the original problem.
- [Figures] Figure captions and axis labels for convergence plots would benefit from explicit mention of the tolerance used and whether the plotted quantity is the original or augmented Lagrangian.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We respond to each major comment below and indicate the changes we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract and experimental sections: the central claim requires that multi-block and two-block ADMM solve identical convex programs (same objective and feasible set). No verification is reported that both variants attain the same optimal objective value (within solver tolerance) or satisfy the original constraints to the same degree across the datasets; without this check, observed differences in objective or prediction error could arise from inequivalent formulations rather than the ADMM splitting strategy itself.
Authors: We agree that explicit verification strengthens the central claim. Although the problem reformulations are mathematically equivalent by construction, we will add a new subsection reporting final objective values and constraint violation norms achieved by both variants on every dataset to confirm they reach equivalent solutions within tolerance. revision: yes
-
Referee: [Experiments] Experimental design (throughout results sections): the assertion of consistent outperformance 'across all datasets and for the entire range of dual step sizes' is presented without reported statistical significance tests, variance estimates across random seeds, or explicit controls for implementation details (e.g., stopping tolerances, initialization, or auxiliary-variable handling). These omissions make it impossible to assess whether the reported gap is load-bearing or could be explained by confounding factors.
Authors: We will revise the experimental section to explicitly document the shared implementation controls (identical stopping tolerances, initialization, and auxiliary-variable handling) used for both variants. The uniform outperformance across every dataset and step size already provides evidence against confounding; we view additional random-seed variance reporting as desirable but secondary, and will include it if new runs can be performed without altering the manuscript scope. revision: partial
Circularity Check
No circularity: purely empirical comparison with no derivations or self-referential predictions.
full rationale
This is an empirical evaluation paper whose central claim rests on experimental results comparing two-block and multi-block ADMM across datasets. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. The equivalence of problem formulations is asserted as a premise for fair comparison but is not derived within the paper via any of the enumerated circular patterns. The study is self-contained against external benchmarks (reproducible experiments on public datasets) and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundation and Trends in Machine Learning, pages 1–122, 2011
work page 2011
-
[2]
H. Wang and A. Banerjee. Bregman Alternating Direction Method of Multipliers. In Advances in Neural Information Processing Systems (NIPS), pages 2816–2824, 2014
work page 2014
- [3]
-
[4]
B. Wahlberg, S. Boyd, M. Annergren, and Y . Wang. An ADMM algorithm for a class of total variation regularized estimation problems. IFAC Proceedings Volumes, 45(16):83–88, 2012
work page 2012
-
[5]
J. Yang and Y . Zhang. Alternating direction algorithms for L1-problems in compressive sensing.SIAM Journal on Scientific Computing, 33(1):250–278, 2011
work page 2011
-
[6]
W. Deng, M. Lai, Z. Peng, and W. Yin. Parallel Multi-Block ADMM with O(1/ k) Convergence. Journal of Scientific Computing, 71(2):712–736, May 2017
work page 2017
-
[7]
B. He, M. Tao, and X. Yuan. Alternating direction method with Gaussian back substitution for separable convex programming. SIAM Journal on Optimization, 22(2):313–340, 2012
work page 2012
-
[8]
E. Hu and J.T. Kwok. Efficient Kernel Learning from Side Information Using ADMM. InIJCAI, pages 1408–1414, 2013
work page 2013
-
[9]
P. Das, N. Johnson, and A. Banerjee. Online lazy updates for portfolio selection with transaction costs. In AAAI, 2013
work page 2013
-
[10]
H. Wang, A. Banerjee, and Z. Luo. Parallel direction method of multipliers. In Advances in Neural Information Processing Systems (NIPS), pages 181–189, 2014
work page 2014
-
[11]
C. Chen, B. He, Y . Ye, and X. Yuan. The direct extension of ADMM for multi-block convex minimization problems is not necessarily convergent. Mathematical Programming, 155(1-2):57–79, 2016
work page 2016
-
[12]
Y . Wang, W. Yin, and J. Zeng. Global convergence of ADMM in nonconvex nonsmooth optimization. arXiv preprint arXiv:1511.06324, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[13]
F. Wang, W. Cao, and Z. Xu. Convergence of multi-block Bregman ADMM for nonconvex composite problems. arXiv preprint arXiv:1505.03063, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[14]
T. Lin, S. Ma, and S. Zhang. On the sublinear convergence rate of multi-block ADMM. Journal of the Operations Research Society of China, 3(3):251–274, 2015
work page 2015
-
[15]
X. Wang, M. Hong, S. Ma, and Z. Luo. Solving multiple-block separable convex minimization problems using two-block alternating direction method of multipliers. arXiv preprint arXiv:1308.5294, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[16]
V . Klee and G.J. Minty. How good is the simplex algorithm. Technical report, Washington University Seattle.. Dept of Mathematics, 1970
work page 1970
-
[17]
D. P Kingma and J. Ba. Adam: Amethod for stochastic optimization. In International Conference on Learning Representations, 2014
work page 2014
-
[18]
D. A Spielman and S. Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM (JACM), 51(3):385–463, 2004
work page 2004
-
[19]
S.S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient Descent Provably Optimizes Over-parameterized Neural Networks. arXiv:1810.02054 [cs, math, stat], 2018. 13 A PREPRINT
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[20]
S.S. Du, J.D. Lee, H. Li, L. Wang, and X. Zhai. Gradient Descent Finds Global Minima of Deep Neural Networks. arXiv:1811.03804 [cs, math, stat], 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[21]
D. Zou, Y . Cao, D. Zhou, and Q. Gu. Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks. arXiv:1811.08888 [cs, math, stat], November 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[22]
SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data
A. Brutzkus, A. Globerson, E. Malach, and S. Shalev-Shwartz. SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data. arXiv:1710.10174 [cs], 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[23]
M. Hong, T. Chang, X. Wang, M. Razaviyayn, S. Ma, and Z. Luo. A block successive upper bound minimization method of multipliers for linearly constrained convex optimization. arXiv preprint arXiv:1401.7079, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[24]
M. Hong and Z. Luo. On the linear convergence of the alternating direction method of multipliers. Mathematical Programming, 162(1):165–199, 2017
work page 2017
-
[25]
T. Lin, S. Ma, and S. Zhang. Iteration complexity analysis of multi-block admm for a family of convex minimization without strong convexity. Journal of Scientific Computing, 69(1):52–81, 2016
work page 2016
- [26]
-
[27]
M. Hong, Z. Luo, and M. Razaviyayn. Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization, 26(1):337–364, 2016
work page 2016
-
[28]
Gonçalves, Dazhe Zhao, and Arindam Banerjee
Xiaoli Liu, Peng Cao, André R. Gonçalves, Dazhe Zhao, and Arindam Banerjee. Modeling alzheimer’s disease progression with fused laplacian sparse group lasso. ACM Trans. Knowl. Discov. Data, 12(6):65:1–65:35, 2018
work page 2018
- [29]
-
[30]
L. Yuan, J. Liu, and J. Ye. Efficient Methods for Overlapping Group Lasso.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(9):2104–2116, 2013
work page 2013
-
[31]
Y . L. Yu. Better approximation and faster algorithm using the proximal average. InAdvances in Neural Information Processing Systems (NIPS), pages 458–466, 2013
work page 2013
-
[32]
Y . L. Yu. On decomposing the proximal map. InAdvances in Neural Information Processing Systems (NIPS), pages 91–99, 2013
work page 2013
-
[33]
A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008
work page 2008
-
[34]
J. Zhou, J. Liu, V . A. Narayan, J. Ye, and Alzheimer’s Disease Neuroimaging Initiative. Modeling disease progression via multi-task learning. NeuroImage, 78:233–248, 2013
work page 2013
-
[35]
J. Wan, Z. Zhang, B. D. Rao, S. Fang, J. Yan, A. J. Saykin, L. Shen, and Alzheimer’s Disease Neuroimaging Ini- tiative. Identifying the Neuroanatomical Basis of Cognitive Impairment in Alzheimer’s Disease by Correlationand Nonlinearity-Aware Sparse Bayesian Learning. IEEE Transactions on Medical Imaging, 33(7):1475–1487, 2014
work page 2014
-
[36]
H. Wang, F. Nie, H. Huang, J. Yan, S. Kim, S. Risacher, A. Saykin, and L. Shen. High-Order Multi-Task Feature Learning to Identify Longitudinal Phenotypic Markers for Alzheimer’s Disease Progression Prediction. In Advances in Neural Information Processing Systems (NIPS), 2012
work page 2012
-
[37]
S. Khoo, D. Petillo, U. Kang, J. H. Resau, B. Berryhill, J. Linder, L. Forsgren, L.A. Neuman, and A. Tan. Plasma- based circulating microrna biomarkers for parkinson’s disease. Journal of Parkinson’s disease, 2(4):321–331, 2012
work page 2012
-
[38]
L. Parnetti, A. Castrioto, D. Chiasserini, E. Persichetti, N. Tambasco, O. El-Agnaf, and P. Calabresi. Cerebrospinal fluid biomarkers in parkinson disease. Nature Reviews Neurology, 9(3):131, 2013. 14
work page 2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.