When I originally wrote about the Alternating Direction Method of
Multipliers algorithm, the community's understanding of its
convergence properties was light to say the least. While it has long been
known (See Boyd's excellent article, Appendix A) that ADMM *will*
converge, it is only recently that the community has begun to establish *how
fast* it converges (e.g. Hong, Deng, Feng, He).

In this article, we'll explore one way to establish an

# How does it work?

Let's begin by introducing the optimization problem ADMM solves,

This problem is characterized by 2 primal variables,

The ADMM algorithm then finds the "saddle point" of the Augmented Lagrangian for the corresponding problem,

Note that we say *Augmented* Lagrangian, as the typical Lagrangian does not
include the final quadratic term. It's easy to see, however, that the quadratic
does not affect the problem's optimal solution, as the constraint

The ADMM algorithm iteratively minimizes

**Input** Step size

- For
$t = 0, 1, \ldots$ - Let
$x^{(t+1)} = \underset{x}{\text{argmin}} \quad L_{\rho}( x , z^{(t)}, y^{(t)} )$ - Let
$z^{(t+1)} = \underset{z}{\text{argmin}} \quad L_{\rho}( x^{(t+1)}, z , y^{(t)} )$ - Let
$y^{(t+1)} = y^{(t)} + \rho ( Ax^{(t+1)} + Bz^{(t+1)} - c )$

- Let

Intuitively, the extra quadratic term prevents each iteration of the algorithm from stepping "too far" from the last iteration, an idea that's also at the core of Proximal Gradient Descent.

In the remainder of the article, we'll often use the following notation for conciseness,

# Why does it work?

Unlike other convergence proofs presented on this website, we won't directly
show that the objective converges to its minimum as

For the following proof, we'll replace

**Assumptions**

The assumptions on ADMM are almost as light as we can imagine. This is
largely due to the fact that we needn't use gradients or subgradients for

$f(x) + g(z)$ is convex.- There exists a solution
$[ x^{*}; z^{*} ]$ that minimizes$f(x) + g(z)$ while respecting the constraint$Ax + Bz = c$ .

**Proof Outline**

The proof presented hereafter is a particularly simple if unintuitive one. Theoretically, the only tools necessary are the linear lower bound definition of a convex function, the subgradient condition for optimality in an unconstrained optimization problem, and Jensen's Inequality. Steps 1 and 2 below rely purely on the first 2 of these tools. Step 3 merely massages a preceding equation into a simpler form via completing squares. Step 4 closes by exploiting a telescoping sum and Jensen's Inequality to obtain the desired result,

As

**Step 1** Optimality conditions for Step A. In this portion of the proof,
we'll use the fact that

We begin by recognizing that

As

Substituting in our subgradient for

Now recall Step C of the algorithm:

We finish by moving everything *not* multiplied by

**Step 2** Optimality conditions for Step B. Similar to Step 1, we'll use the
fact that

As

Substituting in the previously derived subgradient and moving all terms to the left side, we obtain,

Substituting in Step C's definition for

**Step 3** We now sum Equation

We begin by summing equations

Next, we use the definitions of

Then, moving the last term on the left side of the inequality over and
observing that Step C implies

We will now tackle the two components on the right hand side of the
inequality in isolation. Our goal is to rewrite these inner products in terms
of sums of

We'll start with

We'll do the same for

Finally, let's plug equations

Recall that

We can substitute that into the right hand side of the preceding equation to cancel out a couple terms,

Finally dropping the portion of the equation that's always non-positive
(doing so doesn't affect the validity of the inequality), we obtain a concise
inequality in terms of sums of

**Step 4** Averaging across iterations. We're now in the home stretch. In this
step, we'll sum the previous equation across

We begin by summing the previous equation across iterations,

For convenience, we'll choose

Finally, recall that for a convex function

The same is true for each of

The right hand side decreases as

# When should I use it?

Similar to the proximal methods presented on this website, ADMM is only
efficient if we can perform each of its steps efficiently. Solving
2 optimization problems at each iteration may be very fast or very slow,
depending on if a closed form solution exists for

ADMM has been particularly useful in supervised machine learning, where

All in all, ADMM is *not* a quick method, but it is a scalable one. ADMM is
best suited when data is too large to fit on a single machine or when

# Extensions

**Accelerated** As ADMM is so closely related to Proximal Gradient-based
methods, one might ask if there exists an accelerated variant with a better
convergence rate. The answer is a resounding yes, as shown by Goldstein et
al., though care must be taken for non-strongly convex
objectives. In their article, Goldstein et al. show that a convergence rate of

**Online** In online learning, one is interested in solving a series of
supervised machine learning instances in sequence with minimal error. At each
iteration, the algorithm is presented with an input

In this setting, Wang has shown that an online variant to ADMM
can achieve regret competitive with the best possible (

**Stochastic** In a stochastic setting, one is interested in minimizing the
*average* value of

**Multi Component** Traditional ADMM considers an objective with only
2 components

# References

**ADMM** While ADMM has existed for decades, it has only recently been brought
to light by Boyd's article describing its applications for statistical
machine learning. It is from this work from which I initially learned of ADMM.

**Proof of Convergence** The proof of convergence presented here is a verbose
expansion of that presented in Wang's paper on Online ADMM.

# Reference Implementation

Using the `optim`

Python package, we can generate the animation above,

```
"""
Example usage of ADMM solver.
A gif is generated showing the iterates as they converge.
"""
from matplotlib import animation
from optim.admm import *
from optim.tests.test_admm import quadratic1
import itertools as it
import numpy as np
import pylab as pl
import sys
if len(sys.argv) != 2:
sys.stderr.write("Usage: %s OUTPUT\n" % (sys.argv[0],))
sys.exit(1)
else:
output = sys.argv[1]
prob, state = quadratic1()
admm = ADMM(rho=0.1)
iterates = list(it.islice(admm.solve(prob, state), 0, 50))
pl.figure()
_ = np.linspace(-2, 5)
xs = np.asarray([s.x for s in iterates])
zs = np.asarray([s.z for s in iterates])
xs2 = [prob.primal(State(x=v,z=v,y=0)) for v in xs]
zs2 = [prob.primal(State(x=v,z=v,y=0)) for v in zs]
def animate(i):
print 'iteration:', i
pl.cla()
pl.title("Iteration #%d" % i)
pl.plot (_, [prob.f(v) + prob.g(v) for v in _], 'k-' , label='f(x)+g(z)')
pl.plot (_, [prob.f(v) for v in _], 'g--', label='f(x)' )
pl.plot (_, [ prob.g(v) for v in _], 'b--', label= 'g(z)')
pl.scatter(xs[i], xs2[i], c='g', label='x')
pl.scatter(zs[i], zs2[i], c='b', label='z')
pl.xlim(min(_), max(_))
pl.legend()
anim = animation.FuncAnimation(pl.gcf(), animate, frames=len(iterates))
anim.save(output, writer='imagemagick', fps=4)
```

## Comments