In this post, we'll take a look at the Frank-Wolfe Algorithm
also known as the Conditional Gradient Method, an algorithm particularly suited
for solving problems with compact domains. Like the Proximal
Gradient and Accelerated Proximal
Gradient algorithms, Frank-Wolfe requires we
exploit problem structure to quickly solve a mini-optimization problem. Our
reward for doing so is a converge rate of *extremely sparse solutions*.

Returning to my valley-finding metaphor, Frank-Wolfe is a bit like this,

- Look around you and see which way points the most downwards
- Walk as far as possible in that direction until you hit a wall
- Go back in the direction you started, stop part way along the path, then repeat.

# How does it work?

Frank-Wolfe is designed to solve problems of the form,

where

**Input**: Initial iterate

- For
$t = 0, 1, 2, \ldots$ - Let
$s^{(t+1)} = \arg\min_{s \in D} \langle \nabla f(x^{(t)}), s \rangle$ - If
$g(x) = \langle \nabla f(x^{(t)}), x - s^{(t+1)} \rangle \le \epsilon$ , break - Let
$x^{(t+1)} = (1 - \alpha^{(t)}) x^{(t)} + \alpha^{(t)} s^{(t+1)}$

- Let

The proof relies on

then take a step in that direction. We can immediately see that if

**Upper Bound** One nice property of Frank-Wolfe is that it comes with its
own upper bound on

Since,

# A Small Example

For this example, we'll minimize a simple univariate quadratic function constrained to lie in an interval,

Its derivative is given by

# Why does it work?

We begin by making the two assumptions given earlier,

$f$ is convex, differentiable, and finite for all$x \in D$ $D$ is compact

**Assumptions** First, notice that we never needed to assume that a solution

Secondly, we never made a Lipschitz assumption on

This immediate implies the following upper bound on

**Proof Outline** The proof for Frank-Wolfe is surprisingly simple. The idea
is to first upper bound

**Step 1** Upper bound

**Step 2** Use induction on

Now, we employ induction on

We'll assume that the step size is

Next, for the recursive case, we use the inductive assumption on

Thus, if we want an error tolerance of

# When should I use it?

Like Proximal Gradient, efficient use of Frank-Wolfe requires solving a
mini-optimization problem at each iteration. Unlike Proximal Gradient, however,
this mini-problem will lead to unbounded iterates if the input space is not
compact -- in other words, Frank-Wolfe cannot directly be applied when your
domain is all of

**Sparsity** The primary reason machine learning researchers have recently
taken an interest in Frank-Wolfe is because in certain problems the iterates
*gradients*.

**Atomic Norms** One particular case where Frank-Wolfe shines is when
minimizing

For example,

# Extensions

**Step Size** The proof above relied on a step size of

**Approximate Linear Solutions** Though not stated in the proof above,
another cool point about Frank-Wolfe is that you don't actually need to solve
the linear mini-problem exactly, but you will still converge to the optimal
solution (albet at a slightly slower rate). In particular, assume that each
mini-problem can be solved approximately with additive error

then Frank-Wolfe's rate of convergence is

The proof for this can be found in the supplement to Jaggi's excellent survey on Frank-Wolfe for machine learning.

# Linear Invariance

Another cool fact about Frank-Wolfe is that it's *linearly invariant* -- that
is, if you rotate and scale the space, nothing changes about the convergence
rate. This is in direct contrast to many other methods which depend on the
condition number of a function (for functions with
Hessians, this is the ratio between the largest and smallest eigenvalues,

Suppose we transform our input space with a surjective (that is, onto) linear
transformation

Let's look at the solution to the per-iteration mini-problem we need to solve for Frank-Wolfe,

In other words, we will find the same

# References

**Proof of Convergence, Linear Invariance** Pretty much everything in this
article comes from Jaggi's fantastic article on Frank-Wolfe for
machine learning.

# Reference Implementation

```
def frank_wolfe(minisolver, gradient, alpha, x0, epsilon=1e-2):
"""Frank-Wolfe Algorithm
Parameters
----------
minisolver : function
minisolver(x) = argmin_{s \in D} <x, s>
gradient : function
gradient(x) = gradient[f](x)
alpha : function
learning rate
x0 : array
initial value for x
epsilon : float
desired accuracy
"""
xs = [x0]
iteration = 0
while True:
x = xs[-1]
g = gradient(x)
s_next = minisolver(g)
if g * (x - s_next) <= epsilon:
break
a = alpha(iteration=iteration, x=x, direction=s_next)
x_next = (1 - a) * x + a * s_next
xs.append(x_next)
iteration += 1
return xs
def default_learning_rate(iteration, **kwargs):
return 2.0 / (iteration + 2.0)
if __name__ == '__main__':
import os
import numpy as np
import pylab as pl
import yannopt.plotting as plotting
### FRANK WOLFE ALGORITHM ###
# problem definition
function = lambda x: (x - 0.5) ** 2 + 2 * x
gradient = lambda x: 2 * (x - 0.5) + 2
minisolver = lambda y: -1 if y > 0 else 2 # D = [-1, 2]
x0 = 1.0
# run gradient descent
iterates = frank_wolfe(minisolver, gradient, default_learning_rate, x0)
### PLOTTING ###
plotting.plot_iterates_vs_function(iterates, function,
path='figures/iterates.png', y_star=0.0)
plotting.plot_iteration_vs_function(iterates, function,
path='figures/convergence.png', y_star=0.0)
# make animation
iterates = np.asarray(iterates)
try:
os.makedirs('figures/animation')
except OSError:
pass
for t in range(len(iterates)-1):
x = iterates[t]
x_plus = iterates[t+1]
s_plus = minisolver(gradient(x))
f = function
g = gradient
f_hat = lambda y: f(x) + g(x) * (y - x)
xmin, xmax = plotting.limits([-1, 2])
ymin, ymax = -4, 8
pl.plot(np.linspace(xmin ,xmax), function(np.linspace(xmin, xmax)), alpha=0.2)
pl.xlim([xmin, xmax])
pl.ylim([ymin, ymax])
pl.xlabel('x')
pl.ylabel('f(x)')
pl.plot([xmin, xmax], [f_hat(xmin), f_hat(xmax)], '--', alpha=0.2)
pl.vlines([-1, 2], ymin, ymax, color=np.ones((2,3)) * 0.2, linestyle='solid')
pl.scatter(x, f(x), c=[0.8, 0.0, 0.0], marker='o', alpha=0.8)
pl.scatter(x_plus, f(x_plus), c=[0.0, 0.8, 0.0], marker='D', alpha=0.8)
pl.vlines(x_plus, f_hat(x_plus), f(x_plus), color=[0.0,0.8,0.0], linestyle='dotted')
pl.scatter(s_plus, f_hat(s_plus), c=0.35, marker='x', alpha=0.8)
pl.savefig('figures/animation/%02d.png' % t)
pl.close()
```

## Comments