5.1 学习目标

  • 了解常用的机器学习模型,并掌握机器学习模型的建模与调参流程
  • 完成相应学习打卡任务

5.2 内容介绍

  1. 线性回归模型:
    • 线性回归对于特征的要求;
    • 处理长尾分布;
    • 理解线性回归模型;
  2. 模型性能验证:
    • 评价函数与目标函数;
    • 交叉验证方法;
    • 留一验证方法;
    • 针对时间序列问题的验证;
    • 绘制学习率曲线;
    • 绘制验证曲线;
  3. 嵌入式特征选择:
    • Lasso回归;
    • Ridge回归;
    • 决策树;
  4. 模型对比:
    • 常用线性模型;
    • 常用非线性模型;
  5. 模型调参:
    • 贪心调参方法;
    • 网格调参方法;
    • 贝叶斯调参方法;

5.3 相关原理介绍与推荐

由于相关算法原理篇幅较长,本文推荐了一些博客与教材供初学者们进行学习。

5.3.1 线性回归模型

https://zhuanlan.zhihu.com/p/49480391

5.3.2 决策树模型

https://zhuanlan.zhihu.com/p/65304798

5.3.3 GBDT模型

https://zhuanlan.zhihu.com/p/45145899

5.3.4 XGBoost模型

https://zhuanlan.zhihu.com/p/86816771

5.3.5 LightGBM模型

https://zhuanlan.zhihu.com/p/89360721

5.3.6 推荐教材:

5.4 模型调参部分

利用xgb进行五折交叉验证查看模型的参数效果

## xgb-Model
xgr = xgb.XGBRegressor(n_estimators=120, learning_rate=0.1, gamma=0, subsample=0.8,\
        colsample_bytree=0.9, max_depth=7) #,objective ='reg:squarederror'

scores_train = []
scores = []

## 5折交叉验证方式
sk=StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
for train_ind,val_ind in sk.split(X_data,Y_data):
    
    train_x=X_data.iloc[train_ind].values
    train_y=Y_data.iloc[train_ind]
    val_x=X_data.iloc[val_ind].values
    val_y=Y_data.iloc[val_ind]
    
    xgr.fit(train_x,train_y)
    pred_train_xgb=xgr.predict(train_x)
    pred_xgb=xgr.predict(val_x)
    
    score_train = mean_absolute_error(train_y,pred_train_xgb)
    scores_train.append(score_train)
    score = mean_absolute_error(val_y,pred_xgb)
    scores.append(score)

print('Train mae:',np.mean(score_train))
print('Val mae',np.mean(scores))

定义xgb和lgb模型函数

def build_model_xgb(x_train,y_train):
    model = xgb.XGBRegressor(n_estimators=150, learning_rate=0.1, gamma=0, subsample=0.8,\
        colsample_bytree=0.9, max_depth=7) #, objective ='reg:squarederror'
    model.fit(x_train, y_train)
    return model

def build_model_lgb(x_train,y_train):
    estimator = lgb.LGBMRegressor(num_leaves=127,n_estimators = 150)
    param_grid = {
   
        'learning_rate': [0.01, 0.05, 0.1, 0.2],
    }
    gbm = GridSearchCV(estimator, param_grid)
    gbm.fit(x_train, y_train)
    return gbm

切分数据集(Train,Val)进行模型训练,评价和预测

## Split data with val
x_train,x_val,y_train,y_val = train_test_split(X_data,Y_data,test_size=0.3)

print('Train lgb...')
model_lgb = build_model_lgb(x_train,y_train)
val_lgb = model_lgb.predict(x_val)
MAE_lgb = mean_absolute_error(y_val,val_lgb)
print('MAE of val with lgb:',MAE_lgb)

print('Predict lgb...')
model_lgb_pre = build_model_lgb(X_data,Y_data)
subA_lgb = model_lgb_pre.predict(X_test)
print('Sta of Predict lgb:')
Sta_inf(subA_lgb)

print('Train xgb...')
model_xgb = build_model_xgb(x_train,y_train)
val_xgb = model_xgb.predict(x_val)
MAE_xgb = mean_absolute_error(y_val,val_xgb)
print('MAE of val with xgb:',MAE_xgb)

print('Predict xgb...')
model_xgb_pre = build_model_xgb(X_data,Y_data)
subA_xgb = model_xgb_pre.predict(X_test)
print('Sta of Predict xgb:')
Sta_inf(subA_xgb)