Comparison of lasso and ridge regularization

From Machinelearning
Jump to: navigation, search

Case of generalized linear model (such as linear regression or logistic regression)

Case on features Result under lasso Result under ridge
Two copies of the same feature. If only one copy had been included, a parameter value of w would have been learned. The weight w gets split across the features, but in an indeterminate way. In other words, you could get any distribution \alpha w, (1 - \alpha) w. The weight w gets evenly split across the features: w/2 each.
One primary feature, that for simplicity we take as a binary feature that is nonzero on some fraction of examples. Two other "backup" features, that are each nonzero in disjoint halves of the case the primary feature is nonzero. Once the primary feature is known, there is no additional signal in knowing the values of the backup features. If the backup features were excluded, the primary feature would learn a weight of w. The weight w all goes on the primary features, and the backup features get weights of zero. The primary features gets weights (2/3)w and each of the backup features get weight (1/3)w.