In the previous post, we considered Stochastic Gradient Descent, a popular method for optimizing "separable" functions (that is, functions that are purely sums of other functions) in a large, distributed environment. However, Stochastic Gradient Descent is not the only algorithm out there.
So why consider anything else? First of all, we have to choose step sizes
Secondly, Stochastic Gradient Descent is naturally sequential. You have to update
Alternating Direction Method of Multipliers (ADMM) is an entirely different method of distributed optimization that is far better oriented for MapReduce and which only requires a single parameter to specify the learning rate. However, using it requires quite a bit more mathematical preparation.
The basic idea is that if we have an optimization problem specified as follows,
Then we can derive the Lagrangian and add a quadratic penalty for violating the constraint,
Finally we apply the following algorithm
Input Initial primal and dual iterates

Optimize over the first primal variable,
$$ x_{t+1} = \text{argmin}_x L_{\rho}(x,z_t, y_t) $$ 
Optimize over the second primal variable,
$$ z_{t+1} = \text{argmin}_x L_{\rho}(x_{t+1},z, y_t) $$ 
Take a gradient step for the dual variable
$$ y_{t+1} = y_t + \rho (A x_{t+1} + B z_{t+1}  c) $$
Notice the choice of step size for updating
The question now becomes, how can we apply this seemingly restricted method to make a distributed algorithm? Suppose we want to minimize our usual separable function
We can reformulate this problem by giving each
This means that we can optimize each
Input Initial primal and dual iterates

For each machine
$i$ in parallel, optimize the local variable$x_i$ $$ \begin{align} x_{t+1, i} & = \text{argmin}_x f_i(x) + y_{t,i}^T (x  z_t) + \frac{\rho}{2} (xz)^T (xz) \\ \end{align} $$ 
Aggregate the resulting
$x_{t,i+1}$ and optimize the global variable$z$ ,$$ \begin{align} z_{t+1} &= \text{argmin}_z y_{t,i}^T (x_{t+1, i}  z) + \frac{\rho}{2} (x_{t+1, i}  z)^T (x_{t+1, i}  z) \\ &= \frac{1}{N} \sum_{i=1}^{N} \left( x_{t+1, i} + \frac{1}{\rho} y_{t, i} \right) \end{align} $$ 
Update the dual variables
$y_{t,i}$ ,$$ y_{t+1, i} = y_{t, i} + \rho ( x_{t+1,i}  z_{t+1} ) $$
This is already pretty cool, but there's even more. It ends up that ADMM works splendidly even when we add a regularization penalty to the primal problem, such as the
On a final note, the proofs on convergence for ADMM are currently not as complete as those for other methods like Stochastic Gradient Descent. While it is known that the dual variable
References
 Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers
 A Proof of Convergence For the Alternating Direction Method of Multipliers Applied to PolyhedralConstrained Functions
 HOGWILD!: A LockFree Approach to Parallelizing Stochastic Gradient Descent
 Alternating Direction Method of Multipliers
Comments