算法原理

所谓脊回归就是在线性回归的基础上加了一个L2正则化,而套索回归则是在线性回归的基础上加上L1正则化。关于线性回归的算法算法请看我的这篇博客:https://blog.csdn.net/just_sort/article/details/101216607 。在线性回归的基础上,脊回归的损失函数可以表示为:
J w = min w { X w y 2 + α w 2 } J_{w}=\min_{w}\{\left|| Xw-y \right||^2+\alpha\left|| w \right||^2\} Jw=minw{Xwy2+αw2}
其中, α > = 0 \alpha>=0 α>=0 α \alpha α越大,那么正则项,也是惩罚项的作用就越明显; α \alpha α的数值越小,惩罚项的作用越小,特殊的当 α \alpha α=0的时候,脊回归就退化为了套索回归。要实现脊回归也是比较简单的,只需要在我们之前实现的线性回归计算梯度的时候加上一个附加项就可以了。这里来讲一讲利用拉格朗日函数求驻点的方法求解 w w w

  • 首先,对于线性回归,构造拉格朗日函数:
    <mstyle displaystyle="true" scriptlevel="0"> J w = <munder> min w </munder> { X w y 2 } </mstyle> <mstyle displaystyle="true" scriptlevel="0"> w 2 t </mstyle> \begin{aligned} J_{w} =\min_{w}\{\left|| Xw-y \right||^2\}\\ \left|| w \right||^2 \leq t \end{aligned} Jw=wmin{Xwy2}w2t
  • 在《线性回归》中,给出了线性回归的损失函数可以写为:
    X w y 2 = ( X w y ) T ( X w y ) \left \| Xw-y \right \|^{2}=(Xw-y)^T(Xw-y) Xwy2=(Xwy)T(Xwy)
  • 关于参数 w w w求导后得:
    X T ( X w y ) + X T ( X w y ) = 0 X^T(Xw-y) + X^T(Xw-y)=0 XT(Xwy)+XT(Xwy)=0
  • 求解得:
    w = ( X T X ) 1 X T y w=(X^TX)^{-1}X^Ty w=(XTX)1XTy
    同理,对于脊回归,损失函数为:
    X w y 2 + α w 2 = ( X w y ) T ( X w y ) + α w T w \left \| Xw-y \right \|^{2} +\alpha\left|| w \right||^2 =(Xw-y)^T(Xw-y)+\alpha w^Tw Xwy2+αw2=(Xwy)T(Xwy)+αwTw
    关于参数 w w w求导后得:
    X T ( X w y ) + X T ( X w y ) + α w = 0 X^T(Xw-y) + X^T(Xw-y) + \alpha w=0 XT(Xwy)+XT(Xwy)+αw=0
  • 求解得:
    w = ( X T X + α I ) 1 X T y w=(X^TX+\alpha I)^{-1}X^Ty w=(XTX+αI)1XTy

OK,到这里所有的推导就完成了,不过这里为了沿用上次的线性回归代码仍然使用SGD来优化脊回归的目标函数,代码实现如下:

#coding=utf-8
from linear_regression import LinearRegreession
from linear_regression import run_time
from linear_regression import load_data
from linear_regression import min_max_scale
from linear_regression import train_test_split
from linear_regression import get_r2

from numpy import ndarray

class Ridge(LinearRegreession):
    """脊回归类
        损失函数:
        L = (y - y_hat) ^ 2 + L2
        L = (y - W * X - b) ^ 2 + α * (W, b) ^ 2
        Get partial derivative of W:
        dL/dW = -2 * (y - W * X - b) * X + 2 * α * W
        dL/dW = -2 * (y - y_hat) * X + 2 * α * W
        Get partial derivative of b:
        dL/db = -2 * (y - W * X - b) + 2 * α * b
        dL/db = -2 * (y - y_hat) + 2 * α * b
        ----------------------------------------------------------------
        超参数:
            bias: b
            weights: W
            alpha: α
    """
    def __init__(self):
        super(Ridge, self).__init__()
        self.alpha = None

    def get_gradient_delta(self, Xi, yi):
        y_hat = self._predict(Xi)
        bias_grad_delta = yi - y_hat - self.alpha * self.bias
        weights_grad_delta = [(yi - y_hat) * Xij - self.alpha * wj
                              for Xij, wj in zip(Xi, self.weights)]
        return bias_grad_delta, weights_grad_delta

    def fit(self, X, y, lr, epochs, alpha, method="batch", sample_rate=1.0):
        self.alpha = alpha
        assert method in ("batch", "stochastic")
        if method == "batch":
            self.batch_gradient_descent(X, y, lr, epochs)
        if method == "stochastic":
            self.stochastic_gradient_descent(X, y, lr, epochs, sample_rate)

@run_time
def main():
    print("Tesing the performance of Ridge Regressor(stochastic)...")
    # Load data
    data, label = load_data()
    data = min_max_scale(data)
    # Split data randomly, train set rate 70%
    data_train, data_test, label_train, label_test = train_test_split(data, label, random_state=10)
    # Train model
    reg = Ridge()
    reg.fit(X=data_train, y=label_train, lr=0.001, epochs=1000, method="stochastic", sample_rate=0.5, alpha=1e-4)
    # Model evaluation
    get_r2(reg, data_test, label_test)

对波士顿房价数据集预测结果