What is the difference between Ridge Regression, the LASSO, and ElasticNet?

tldr: “Ridge” is a fancy name for L2-regularization, “LASSO” means L1-regularization, “ElasticNet” is a ratio of L1 and L2 regularization. If still confused keep reading…

Logistic Regression

This article is about different ways of regularizing regressions. In the context of classification, we might use logistic regression but these ideas apply just as well to any kind of regression or GLM.

h(x|theta) = sigmoid(x dot theta + b)

and seeks a theta which minimizes some objective function, usually

loss(theta)= ∑ y*log(h(x|theta)) + (1−y)log(1−h(x|theta))

which is obfuscated by a couple clever tricks. It is derived from the intuitive objective function:

loss(theta)= ∑ (y - h(x|theta))

i.e. the number of misclassified x, which makes sense to try to minimize.


In many cases, you wish to regularize your parameter vector theta. This means you want to both minimize the number of misclassified examples while also minimizing the magnitude of the parameter vector. These objectives are in opposition, and so the data scientist needs to decide on the appropriate balance between those objectives using their intuition, or via many empirical tests (e.g. by cross validation).

loss(theta)= ∑ y*log(h(x|theta)) + (1−y)log(1−h(x|theta))

the basic_loss(theta). Our new, regularized loss function will look like:

loss(theta) = basic_loss(theta) + k * magnitude(theta)

Recall we’re trying to minimize loss(theta) which means we’re applying downwards pressure on both the number of mistakes we make as well as the magnitude of theta. In the above loss function, k is a hyperparameter which modulates the tradeoff of how much downwards pressure we apply to the error of the classifier defined by theta versus the magnitude of theta. Therefore, k encodes our prior beliefs, our intuitions, as to how the process we’re modeling is most likely to behave.


Now on to the interesting part. It turns out there is not one, but many ways of defining the magnitude (also called the Norm) of a vector. The most commonly used norms are the p-norms, which have the following character:

loss(theta) = basic_loss(theta) + k * L1(theta)

is called “the LASSO”. If we choose the L2 norm,

loss(theta) = basic_loss(theta) + k * L2(theta)

is called Ridge Regression (which also turns out to have other names). If we decide we’d like a little of both,

loss(theta) = basic_loss(theta) + k(j*L1(theta) + (1-j)L2(theta))

is called “Elastic Net”. Notice the addition of a second hyperparameter here. Notice also that ElasticNet encompasses both the LASSO and Ridge, by setting hyperparameter j to 1 or 0.

On the Naming of Algorithms

Academia has a complicated incentive structure. One aspect of that incentive structure is that it is desirable to have a unique name for your algorithmic invention, even when that invention is a minor derivative of another idea, or even the same idea applied in a different context. Take, for example, Principal Component Analysis.

There should be one - and preferably only one - obvious way to do it

Comparing regularization techniques — Intuition

Now that we have disambiguated what these regularization techniques are, let’s finally address the question: What is the difference between Ridge Regression, the LASSO, and ElasticNet?

Comparing regularization techniques — In Practice

There are a number of reasons to regularize regressions. Typically, the goal is to prevent overfitting, and in that case, L2 has some nice theoretical guarantees built into it. Another purpose for regularization is often interpretability, and in that case, L1-regularization can be quite powerful.

L2-regularized Logistic Regression
L1-regularized Logistic Regression


Regularization can be very powerful, but it’s somewhat under-appreciated, partially, I think, because the intuitions aren’t always well explained.
The ideas are mostly very simple, but not terribly well documented much of the time. I hope this article helps mend that deficit.

conscious mammalian organism, fanatical tea snob.