1. # Accelerated Proximal Gradient Descent

Thu 25 April 2013

$$\def\prox{\text{prox}}$$ In a previous post, I presented Proximal Gradient, a method for bypassing the $$O(1 / \epsilon^2)$$ convergence rate of Subgradient Descent. This method relied on assuming that the objective function could be expressed as the sum of 2 functions, $$g(x)$$ and $$h(x)$$, with …

2. # Coordinate Ascent for Convex Clustering

Tue 23 April 2013

Convex clustering is the reformulation of k-means clustering as a convex problem. While the two problems are not equivalent, the former can be seen as a relaxation of the latter that allows us to easily find globally optimal solutions (as opposed to only locally optimal ones).

Suppose we have a …

3. # Why does L1 produce sparse solutions?

Mon 22 April 2013

Supervised machine learning problems are typically of the form "minimize your error while regularizing your parameters." The idea is that while many choices of parameters may make your training error low, the goal isn't low training error -- it's low test-time error. Thus, parameters should be minimize training error while remaining …

$$\def\prox{\text{prox}}  In a [previous post][subgradient_descent_usage], I mentioned that one cannot hope to asymptotically outperform the O(\frac{1}{\epsilon^2})$$ convergence rate of Subgradient Descent when dealing with a non-differentiable objective function. This is in fact only half-true; Subgradient Descent cannot be beat using only first-order …
In the mid-1980s, Yurii Nesterov hit the equivalent of an academic home run. At the same time, he established the Accelerated Gradient Method, proved that its convergence rate superior to Gradient Descent ($$O(1/\sqrt{\epsilon})$$ iterations instead of $$O(1/\epsilon)$$), and then proved that no other first-order (that …