学习链接:www.cnblogs.com/massquantity/p/8640991.html

前言:

\quad 至今已经做了一些数据清洗,特征提取这些特征工程相关的东西,同时在房价预测项目中也使用了随机深林,SVR,线性分类器,Xgboost来做回归,发现结果只能到Top 65%,所以开始着手学习Ensemble Learning,也就是模型融合,提升比赛的竞争力。

Ensemble Generation介绍

\quad Ensemble Learning是指将多个Base Model组合成一个Ensemble Model的方法。它可以同时降低最终模型的Bias和Variance,从而提高分数并降低Overfitting的风险。常见的Ensemble方法有这么几种:

  • Bagging:使用训练数据的不同的随机子集来训练每个Base Model,最后进行每个Base Model权重的Vote。也就是Random Forest的原理。
  • Blending:用不相交的数据训练不同的Base Model,将它们的输出值求平均。实现简单,但对训练数据的使用少了。
  • Stacking。

知识补充

  • sklearn.base: Base classes and utility functions(基类和效用函数)
  • base.BaseEstimator scikit学习中所有估计的基础类
  • base.ClassifierMixin 所有分类器的混合类在scikit学习
  • base.ClusterMixin 所有聚类估计器的混合类在scikit学习中
  • base.RegressorMixin 所有回归估计器的混合类在scikit学习
  • base.TransformerMixin 所有变压器的混合类在scikit学习
  • base.clone(estimator[, safe]) 构造具有相同参数的新估计器
  • pd.set_option() pandas的DataFrame不换行
  • qcut(data, n) 照分位数分n组,比如分2组,那么就按照中位数来分,分4组,就按照四分位数来分。
  • groupby:
    对数据进行分组操作的过程可以概括为:split-apply-combine三步:
    1.按照键值(key)或者分组变量将数据分组。
    2.对于每组应用我们的函数,这一步非常灵活,可以是python自带函数,可以是我们自己编写的函数。
    3.将函数计算后的结果聚合。

    -agg pandas引入了agg函数,它提供基于列的聚合操作。而groupby可以看做是基于行,或者说index的聚合操作
  • get_dummies one-hot编码,优点有:
    1、能够处理非连续型数值特征。
    2、在一定程度上也扩充了特征。比如性别本身是一个特征,经过one hot编码以后,就变成了男或女两个特征。
    3、使用one-hot编码,将离散特征的取值扩展到了欧式空间,离散特征的某个取值就对应欧式空间的某个点,在回归,分类,聚类等机器学习算法中,特征之间距离的计算或相似度的计算是非常重要的,而我们常用的距离或相似度的计算都是在欧式空间的相似度计算,计算余弦相似性,基于的就是欧式空间。
    5、将离散型特征使用one-hot编码,可以让特征之间的距离计算更加合理,相当于做了归一化处理。
  • select_dtypes 选中自己需要的某一列
  • apply (func, *args, **kwargs)当一个函数的参数存在于一个元组或者一个字典中,用来间接的调用这个函数,并肩元祖或者字典中的参数按照顺序传递给参数
  • skew 偏度,是统计数据分布倾斜方向和程度的度量,是统计数据分布非对称程度的数字特征。偏度(Skewness)也叫偏态,偏态系数。表征概率分布密度函数曲线相对于平均值不对称程度的特征数。直观来看就是密度函数曲线尾部的相对长度。定义上偏度是样本的三阶标准化矩: s k e w ( X ) E [ ( X u δ ) 3 ] skew(X) - E[(\frac{X-u}{\delta})^3] skew(X)E[(δXu)3]

#code

#coding=utf-8
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.svm import SVR, LinearSVR
from sklearn.linear_model import ElasticNet, SGDRegressor, BayesianRidge
from sklearn.kernel_ridge import KernelRidge
from scipy.stats import skew
from sklearn.decomposition import PCA, KernelPCA
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold

from xgboost import XGBRegressor

train = pd.read_csv("/home/zxy/PycharmProjects/Kagglemarsggbo/data/train.csv")
test = pd.read_csv("/home/zxy/PycharmProjects/Kagglemarsggbo/data/test.csv")

#Exploratory Visualization
#去掉离群点,inplace代表直接在原来的数据修改,不创建新数据
train.drop(train[(train["GrLivArea"]>4000)&(train["SalePrice"]<300000)].index, inplace=True)
full = pd.concat([train, test], ignore_index=True)
full.drop(['Id'], axis=1, inplace=True)

#Data Cleaning
full["LotAreaCut"] = pd.qcut(full.LotArea, 10)

full['LotFrontage']=full.groupby(['LotAreaCut'])['LotFrontage'].transform(lambda x: x.fillna(x.median()))

cols=["MasVnrArea", "BsmtUnfSF", "TotalBsmtSF", "GarageCars", "BsmtFinSF2", "BsmtFinSF1", "GarageArea"]
for col in cols:
    full[col].fillna(0, inplace=True)

cols1 = ["PoolQC" , "MiscFeature", "Alley", "Fence", "FireplaceQu", "GarageQual", "GarageCond", "GarageFinish",
         "GarageYrBlt", "GarageType", "BsmtExposure", "BsmtCond", "BsmtQual", "BsmtFinType2", "BsmtFinType1", "MasVnrType"]
for col in cols1:
    full[col].fillna("None", inplace=True)

# fill in with mode
cols2 = ["MSZoning", "BsmtFullBath", "BsmtHalfBath", "Utilities", "Functional", "Electrical", "KitchenQual", "SaleType","Exterior1st", "Exterior2nd"]
for col in cols2:
    full[col].fillna(full[col].mode()[0], inplace=True)

full['LotFrontage'] = full.groupby(['LotAreaCut', 'Neighborhood'])['LotFrontage'].transform(lambda x: x.fillna(x.median()))

#Feature Engineering
full.groupby(['MSSubClass'])[['SalePrice']].agg(['mean', 'median', 'count'])
#Convert some numerical features into categorical features. It's better to use LabelEncoder and get_dummies for these features.
NumStr = ["MSSubClass","BsmtFullBath","BsmtHalfBath","HalfBath","BedroomAbvGr","KitchenAbvGr","MoSold","YrSold","YearBuilt","YearRemodAdd","LowQualFinSF","GarageYrBlt"]
for col in NumStr:
    full[col]=full[col].astype(str)

def map_values():
    full["oMSSubClass"] = full.MSSubClass.map({'180': 1,
                                               '30': 2, '45': 2,
                                               '190': 3, '50': 3, '90': 3,
                                               '85': 4, '40': 4, '160': 4,
                                               '70': 5, '20': 5, '75': 5, '80': 5, '150': 5,
                                               '120': 6, '60': 6})

    full["oMSZoning"] = full.MSZoning.map({'C (all)': 1, 'RH': 2, 'RM': 2, 'RL': 3, 'FV': 4})

    full["oNeighborhood"] = full.Neighborhood.map({'MeadowV': 1,
                                                   'IDOTRR': 2, 'BrDale': 2,
                                                   'OldTown': 3, 'Edwards': 3, 'BrkSide': 3,
                                                   'Sawyer': 4, 'Blueste': 4, 'SWISU': 4, 'NAmes': 4,
                                                   'NPkVill': 5, 'Mitchel': 5,
                                                   'SawyerW': 6, 'Gilbert': 6, 'NWAmes': 6,
                                                   'Blmngtn': 7, 'CollgCr': 7, 'ClearCr': 7, 'Crawfor': 7,
                                                   'Veenker': 8, 'Somerst': 8, 'Timber': 8,
                                                   'StoneBr': 9,
                                                   'NoRidge': 10, 'NridgHt': 10})

    full["oCondition1"] = full.Condition1.map({'Artery': 1,
                                               'Feedr': 2, 'RRAe': 2,
                                               'Norm': 3, 'RRAn': 3,
                                               'PosN': 4, 'RRNe': 4,
                                               'PosA': 5, 'RRNn': 5})

    full["oBldgType"] = full.BldgType.map({'2fmCon': 1, 'Duplex': 1, 'Twnhs': 1, '1Fam': 2, 'TwnhsE': 2})

    full["oHouseStyle"] = full.HouseStyle.map({'1.5Unf': 1,
                                               '1.5Fin': 2, '2.5Unf': 2, 'SFoyer': 2,
                                               '1Story': 3, 'SLvl': 3,
                                               '2Story': 4, '2.5Fin': 4})

    full["oExterior1st"] = full.Exterior1st.map({'BrkComm': 1,
                                                 'AsphShn': 2, 'CBlock': 2, 'AsbShng': 2,
                                                 'WdShing': 3, 'Wd Sdng': 3, 'MetalSd': 3, 'Stucco': 3, 'HdBoard': 3,
                                                 'BrkFace': 4, 'Plywood': 4,
                                                 'VinylSd': 5,
                                                 'CemntBd': 6,
                                                 'Stone': 7, 'ImStucc': 7})

    full["oMasVnrType"] = full.MasVnrType.map({'BrkCmn': 1, 'None': 1, 'BrkFace': 2, 'Stone': 3})

    full["oExterQual"] = full.ExterQual.map({'Fa': 1, 'TA': 2, 'Gd': 3, 'Ex': 4})

    full["oFoundation"] = full.Foundation.map({'Slab': 1,
                                               'BrkTil': 2, 'CBlock': 2, 'Stone': 2,
                                               'Wood': 3, 'PConc': 4})

    full["oBsmtQual"] = full.BsmtQual.map({'Fa': 2, 'None': 1, 'TA': 3, 'Gd': 4, 'Ex': 5})

    full["oBsmtExposure"] = full.BsmtExposure.map({'None': 1, 'No': 2, 'Av': 3, 'Mn': 3, 'Gd': 4})

    full["oHeating"] = full.Heating.map({'Floor': 1, 'Grav': 1, 'Wall': 2, 'OthW': 3, 'GasW': 4, 'GasA': 5})

    full["oHeatingQC"] = full.HeatingQC.map({'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5})

    full["oKitchenQual"] = full.KitchenQual.map({'Fa': 1, 'TA': 2, 'Gd': 3, 'Ex': 4})

    full["oFunctional"] = full.Functional.map(
        {'Maj2': 1, 'Maj1': 2, 'Min1': 2, 'Min2': 2, 'Mod': 2, 'Sev': 2, 'Typ': 3})

    full["oFireplaceQu"] = full.FireplaceQu.map({'None': 1, 'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5})

    full["oGarageType"] = full.GarageType.map({'CarPort': 1, 'None': 1,
                                               'Detchd': 2,
                                               '2Types': 3, 'Basment': 3,
                                               'Attchd': 4, 'BuiltIn': 5})

    full["oGarageFinish"] = full.GarageFinish.map({'None': 1, 'Unf': 2, 'RFn': 3, 'Fin': 4})

    full["oPavedDrive"] = full.PavedDrive.map({'N': 1, 'P': 2, 'Y': 3})

    full["oSaleType"] = full.SaleType.map({'COD': 1, 'ConLD': 1, 'ConLI': 1, 'ConLw': 1, 'Oth': 1, 'WD': 1,
                                           'CWD': 2, 'Con': 3, 'New': 3})

    full["oSaleCondition"] = full.SaleCondition.map(
        {'AdjLand': 1, 'Abnorml': 2, 'Alloca': 2, 'Family': 2, 'Normal': 3, 'Partial': 4})

    return "Done!"

map_values()
full.drop(['LotAreaCut'],axis=1,inplace=True)
full.drop(['SalePrice'],axis=1,inplace=True)

#Pipline
#Label Encoding three 'Year' features
class labelenc(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        lab = LabelEncoder()
        X["YearBuilt"] = lab.fit_transform(X["YearBuilt"])
        X["YearRemodAdd"] = lab.fit_transform(X["YearRemodAdd"])
        X["GarageYrBlt"] = lab.fit_transform(X["GarageYrBlt"])
        return X

#Apply log1p to the skewed features, then get_dummies
class skew_dummies(BaseEstimator, TransformerMixin):
    def __init__(self, skew):
        self.skew = skew
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X_numeric = X.select_dtypes(exclude=["object"])
        skewness = X_numeric.apply(lambda x : skew(x))
        skewness_features = skewness[abs(skewness) >= self.skew].index
        X[skewness_features] = np.log1p(X[skewness_features])
        x = pd.get_dummies(X)
        return x
#build pipeline
pipe = Pipeline([
    ('labenc', labelenc()),
    ('skew_dummies', skew_dummies(skew=1))
])

full2 = full.copy()
data_pipe = pipe.fit_transform(full2)
print(data_pipe.shape)
print(data_pipe.head())

#use robustscaler since maybe there are other outliers
scaler = RobustScaler()
n_train = train.shape[0]

X = data_pipe[:n_train]
test_X = data_pipe[n_train:]
y = train.SalePrice

X_scaled = scaler.fit(X).transform(X)
y_log = np.log(train.SalePrice)
test_X_scaled = scaler.transform(test_X)

#Feature Selection
class add_feature(BaseEstimator, TransformerMixin):
    def __init__(self, additional=1):
        self.additional = additional

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if self.additional == 1:
            X["TotalHouse"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"]
            X["TotalArea"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"]

        else:
            X["TotalHouse"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"]
            X["TotalArea"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"]

            X["+_TotalHouse_OverallQual"] = X["TotalHouse"] * X["OverallQual"]
            X["+_GrLivArea_OverallQual"] = X["GrLivArea"] * X["OverallQual"]
            X["+_oMSZoning_TotalHouse"] = X["oMSZoning"] * X["TotalHouse"]
            X["+_oMSZoning_OverallQual"] = X["oMSZoning"] + X["OverallQual"]
            X["+_oMSZoning_YearBuilt"] = X["oMSZoning"] + X["YearBuilt"]
            X["+_oNeighborhood_TotalHouse"] = X["oNeighborhood"] * X["TotalHouse"]
            X["+_oNeighborhood_OverallQual"] = X["oNeighborhood"] + X["OverallQual"]
            X["+_oNeighborhood_YearBuilt"] = X["oNeighborhood"] + X["YearBuilt"]
            X["+_BsmtFinSF1_OverallQual"] = X["BsmtFinSF1"] * X["OverallQual"]

            X["-_oFunctional_TotalHouse"] = X["oFunctional"] * X["TotalHouse"]
            X["-_oFunctional_OverallQual"] = X["oFunctional"] + X["OverallQual"]
            X["-_LotArea_OverallQual"] = X["LotArea"] * X["OverallQual"]
            X["-_TotalHouse_LotArea"] = X["TotalHouse"] + X["LotArea"]
            X["-_oCondition1_TotalHouse"] = X["oCondition1"] * X["TotalHouse"]
            X["-_oCondition1_OverallQual"] = X["oCondition1"] + X["OverallQual"]

            X["Bsmt"] = X["BsmtFinSF1"] + X["BsmtFinSF2"] + X["BsmtUnfSF"]
            X["Rooms"] = X["FullBath"] + X["TotRmsAbvGrd"]
            X["PorchArea"] = X["OpenPorchSF"] + X["EnclosedPorch"] + X["3SsnPorch"] + X["ScreenPorch"]
            X["TotalPlace"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"] + X["OpenPorchSF"] + X[
                "EnclosedPorch"] + X["3SsnPorch"] + X["ScreenPorch"]
            return X

pipe = Pipeline([
    ('labenc', labelenc()),
    ('add_feature', add_feature(additional=2)),
    ('skew_dummies', skew_dummies(skew=1))
])

# PCA
full_pipe = pipe.fit_transform(full)
print(full_pipe.shape)

n_train = train.shape[0]
X = full_pipe[:n_train]
test_X = full_pipe[n_train:]
y = train.SalePrice

X_scaled = scaler.fit(X).transform(X)
y_log = np.log(train.SalePrice)
test_X_scaled = scaler.transform(test_X)

pca = PCA(n_components=410)
X_scaled = pca.fit_transform(X_scaled)
test_X_scaled = pca.transform(test_X_scaled)
print(X_scaled.shape)
print(test_X_scaled.shape)

# 建立模型和评估
#k折交叉验证(k-fold)
def rmse_cv(model, X, y):
    rmse = np.sqrt(-cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=5))
    return rmse

#13models
# models = [LinearRegression(), Ridge(), Lasso(alpha=0.01, max_iter=10000), RandomForestRegressor(), GradientBoostingRegressor(),
#           SVR(), LinearSVR(), ElasticNet(alpha=0.001, max_iter=10000), SGDRegressor(max_iter=1000,tol=1e-3), BayesianRidge(),
#           KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5), ExtraTreesRegressor(), XGBRegressor()]
#
# names = ["LR", "Ridge", "Lasso", "RF", "GBR", "SVR", "LinSVR", "Ela", "SGD", "Bay", "Ker", "Extra", "Xgb"]
# for name, model in zip(names, models):
#     score = rmse_cv(model, X_scaled, y_log)
#     print("{}: {:.6f}, {:.4f}".format(name, score.mean(), score.std()))
#
# #hyperparameters tuning
# class grid():
#     def __init__(self, model):
#         self.model = model
#     def grid_get(self, X, y, param_grid):
#         grid_search = GridSearchCV(self.model, param_grid, cv=5, scoring="neg_mean_squared_error")
#         grid_search.fit(X, y)
#         print(grid_search.best_params_, np.sqrt(-grid_search.best_score_))
#         grid_search.cv_results_['mean_test_score'] = np.sqrt(-grid_search.cv_results_['mean_test_score'])
#         print(pd.DataFrame(grid_search.cv_results_)[['params', 'mean_test_score', 'std_test_score']])
# #Lasso的调参
# grid(Lasso()).grid_get(X_scaled, y_log, {'alpha':[0.0004, 0.0005, 0.0006, 0.0007, 0.0008, 0.0009], 'max_iter':[10000]}
# #Ridge的调参
# grid(Ridge()).grid_get(X_scaled, y_log, {'alpha':[35,40,45,50,55,60,65,70,80,90]})
# # SVR调参
# grid(SVR()).grid_get(X_scaled, y_log, {'C':[11,12,13,14,15], 'kernel':["rbf"], "gamma":[0.0003,0.0004], "epsilon":[0.008,0.009]})
# # Kernel Ridge
# grid(KernelRidge()).grid_get(X_scaled, y_log, {'alpha':[0.2,0.3,0.4,0.5], 'kernel':["polynomial"], 'degree':[3], 'coed0':[0.8,1,1.2]})
# # ElasticNet
# grid(ElasticNet()).grid_get(X_scaled, y_log, {'alpha':[0.0005,0.0008,0.004,0.005], 'l1_ratio':[0.08,0.1,0.3,0.5,0.7], 'max_iter':[10000]})

# 集成方法
class AverageWeight(BaseEstimator, RegressorMixin):
    def __init__(self,mod,weight):
        self.mod = mod
        self.weight = weight
    def fit(self,X,y):
        self.models_ = [clone(x) for x in self.mod]
        for model in self.models_:
            model.fit(X,y)
        return self
    def predict(self, X):
        w = list()
        pred = np.array([model.predict(X) for model in self.models_])
        for data in range(pred.shape[1]):
            single = [pred[model,data]*weight for model,weight in zip(range(pred.shape[0]), self.weight)]
            w.append(np.sum(single))
        return w

lasso = Lasso(alpha=0.0005, max_iter=10000)
ridge = Ridge(alpha=60)
svr = SVR(gamma=0.0004, kernel='rbf', C=13, epsilon=0.009)
ker = KernelRidge(alpha=0.2, kernel='polynomial', degree=3, coef0=0.8)
ela = ElasticNet(alpha=0.005, l1_ratio=0.08, max_iter=10000)
bay = BayesianRidge()
w1 = 0.02
w2 = 0.2
w3 = 0.25
w4 = 0.3
w5 = 0.03
w6 = 0.2
weight_avg = AverageWeight(mod = [lasso, ridge, svr, ker, ela, bay], weight = [w1, w2, w3, w4, w5, w6])
print(rmse_cv(weight_avg, X_scaled, y_log), rmse_cv(weight_avg, X_scaled, y_log).mean())
# print(score.mean())

#Stacking
# class stacking(BaseEstimator, RegressorMixin, TransformerMixin):
#     def __int__(self, mod, meta_model):
#         self.mod = mod
#         self.meta_model = meta_model
#         self.kf = KFold(n_splits=5, random_state=42, shuffle=True)
#     def fit(self,X,y):
#         self.saved_model = [list() for i in self.mod]
#         off_train = np.zeros((X.shape[0], len(self.mod)))
#         for i, model in enumerate(self.mod):
#             for train_index, val_index in self.kf.split(X,y):
#                 renew_model = clone(model)
#                 renew_model.fit(X[train_index], y[train_index])
#                 self.saved_model[i].append(renew_model)
#                 oof_train[val_index, i] = renew_model.predict(X[val_index])
#         self.meta_model.fit(off_train, y)
#         return self
#     def predict(self, X):
#         whole_test = np.column_stack([np.column_stack(model.predict(X) for model in single_model).mean(axis=1) for single_model in self.saved_model])
#         return self.meta_model.predict(whole_test)

class stacking(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, mod, meta_model):
        self.mod = mod
        self.meta_model = meta_model
        self.kf = KFold(n_splits=5, random_state=42, shuffle=True)

    def fit(self, X, y):
        self.saved_model = [list() for i in self.mod]
        oof_train = np.zeros((X.shape[0], len(self.mod)))

        for i, model in enumerate(self.mod):
            for train_index, val_index in self.kf.split(X, y):
                renew_model = clone(model)
                renew_model.fit(X[train_index], y[train_index])
                self.saved_model[i].append(renew_model)
                oof_train[val_index, i] = renew_model.predict(X[val_index])

        self.meta_model.fit(oof_train, y)
        return self

    def predict(self, X):
        whole_test = np.column_stack([np.column_stack(model.predict(X) for model in single_model).mean(axis=1)
                                      for single_model in self.saved_model])
        return self.meta_model.predict(whole_test)

a = Imputer().fit_transform(X_scaled)
b = Imputer().fit_transform(y_log.values.reshape(-1,1)).ravel()

stack_model = stacking(mod=[lasso,ridge,svr,ker,ela,bay],meta_model=ker)
stack_model.fit(a, b)

pred = np.exp(stack_model.predict(test_X_scaled))

result = pd.DataFrame({'Id':test.Id, 'SalePrice':pred})
result.to_csv("Submission.csv", index=False)