我是个弱小白,纯粹是记录加分享,求大佬们轻拍
有一说一,起这个名字是怕你们不仔细看分享的内容,纯属标题党。
最终成绩
b榜0.7805,50名左右,一共900多不到1000人参赛。
1 赛题介绍
招行比赛
2020年5月11日
9:44
比赛规则
1.竞赛时间:4月29日11:00-5月12日17:00;
2.采用数据竞赛的形式,4月29日11:00-5月9日24:00,赛道开放A榜数据,预测结果数据每天限提交5次;5月10日00:00-5月12日17:00,赛道开放B榜数据,预测结果数据每天限提交3次。结果提交后请务必点击“运行”按钮,方可查看当前个人排名,最终排名以B榜成绩为准;
3.请您务必在比赛截止时间前,提交最终运行结果所对应的代码材料。
重要提示:请您勿将竞赛题目外传,平台将对您的竞赛全程进行监测,并对您所提交结果进行审查。一经发现有作弊行为,平台将取消您的参赛资格,并录入招商银行校园招聘诚信档案。
赛题说明
【招商银行2020FinTech精英训练营数据赛道重要通知】
各位同学:
根据前期竞赛通知,平台已完成数据集更新,并对部分字段内容进行了调整,请各位同学使用前认真查看,细致分析,并重新提交运行结果。
如有疑问,可反馈至联系邮箱:kaoshi@nowcoder.com;联系qq群:732244713(加群请填写“个人姓名+报名手机号”)。预祝您取得好成绩!
招商银行FinTech精英训练营项目组
2020年4月30日
信用风险评分 收起题目详情
一、赛题背景
在当今大数据时代,信用评分不仅仅用在办理信用卡、贷款等金融场景,类似的评分产品已经触及到我们生活的方方面面,比如借充电宝免押金、打车先用后付等,甚至在招聘、婚恋场景都有一席之地。
招行作为金融科技的先行者,APP月活用户数上亿,APP服务不仅涵盖资金交易、理财、信贷等金融场景,也延伸到饭票、影票、出行、资讯等非金融场景,可以构建用户的信用评分,基于信用评分为用户提供更优质便捷的服务。
二、课题研究要求
本次大赛为参赛选手提供了两个数据集(训练数据集和评分数据集),包含用户标签数据、过去60天的交易行为数据、过去30天的APP行为数据。希望参赛选手基于训练数据集,通过有效的特征提取,构建信用违约预测模型,并将模型应用在评分数据集上,输出评分数据集中每个用户的违约概率。
三、评价指标
其中D^+与D^-分别为评分数据集中发生违约与未发生违约用户集合,|D^+ |与|D^- |为集合中的用户量,f(x)为参赛者对于评分数据集中用户发生违约的概率估计值,I为逻辑函数。
四、数据说明
1.训练数据集tag.csv,评分数据集_tag.csv提供了训练数据集和评分数据集的用户标签数据;
2.训练数据集_trd.csv,评分数据集_trd.csv提供了训练数据集和评分数据集的用户60天交易行为数据;
3.训练数据集_beh.csv,评分数据集 beh.csv提供了训练数据集和评分数据集的用户30天APP行为数据;
4.数据说明.xlsx为数据集字段说明和数据示例;
5.提交样例:
5.1采⽤UTF-8⽆BOM编码的txt⽂件提交,⼀共提交⼀份txt⽂件。
5.2输出评分数据集中每个用户违约的预测概率,输出字段为:用户标识和违约预测概率,用\t分割,每个用户的预测结果为一行,注意不能有遗漏的数据或多出的数据。
来自 https://www.nowcoder.com/activity/2020cmb/index?type=2
2 简单思路介绍
三个表,tag、beh、trd
训练集39923行、a榜测试集6000行,b榜测试集4000行。
tag和trd都是全的,beh的数据大概只有全量数据的三分之一,(训练集11000行左右、a榜测试集2000行左右,b榜测试集1200行左右)
我这边是beh一加就掉分,最后也没找到合适的使用方法。所以就主要是对tag和trd做特征,tag单表能做到0.69左右,trd单表能做到0.72左右。tag表有时间序列,能做的特征更多一些,在金融科技数据中,时间序列相关的数据的价值总是更大一些,下面放代码介绍一下思路。
3 代码分享
-- coding: utf-8 --
'''
三个表的特征都有
xgb预测
'''
### 基础工具包导入 import numpy as np import pandas as pd import warnings import matplotlib import matplotlib.pyplot as plt import seaborn as sns from scipy.special import jn from IPython.display import display, clear_output import time from dateutil.parser import parse import datetime ###模型预测的 from sklearn import linear_model from sklearn import preprocessing from sklearn.svm import SVR from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor ##数据降维处理的 from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCA import lightgbm as lgb import xgboost as xgb from xgboost import plot_importance from matplotlib import pyplot ##参数搜索和评价的 from sklearn.model_selection import GridSearchCV,cross_val_score,StratifiedKFold,train_test_split from sklearn.metrics import mean_squared_error, mean_absolute_error ### 通过Pandas对于数据进行读取 (pandas是一个很友好的数据读取函数库) data_path = '../data/' Train_data = pd.read_csv(data_path+'train_data/train_tag.csv') TestA_data = pd.read_csv(data_path+'test_data/test_tag_b.csv') Train_data = Train_data.replace('\\N',-2) TestA_data = TestA_data.replace('\\N',-2) num_ori = len(TestA_data.columns) ### step2 对trd这个表进行特征加工 Train_trd = pd.read_csv(data_path+'train_data/train_trd.csv').sort_values(["id",'trx_tm']) TestA_trd = pd.read_csv(data_path+'test_data/test_trd_b.csv').sort_values(["id",'trx_tm']) def fea_cishu(df,name):#得到次数统计 tmp = df.groupby(['id'],as_index = False).count() tmp = tmp[['id','trx_tm']] tmp = tmp.rename(columns={'trx_tm':(name+'count')}) return tmp def amt_sum(df,name):#得到金额统计 tmp_sum = df.groupby(['id'],as_index = False).sum() tmp_sum = tmp_sum[['id','cny_trx_amt']] tmp_sum = tmp_sum.rename(columns={'cny_trx_amt':(name+'sum')}) tmp_mean = df.groupby(['id'],as_index = False).mean() tmp_mean = tmp_mean[['id','cny_trx_amt']] tmp_mean = tmp_mean.rename(columns={'cny_trx_amt':(name+'mean')}) tmp = pd.merge(tmp_sum,tmp_mean,on = 'id',how = 'left') return tmp def get_fea_cishu(TestA_trd_fea,df,name): TestA_trd_fea_zhichu = fea_cishu(df,name+'_trd_') TestA_trd_fea = pd.merge(TestA_trd_fea,TestA_trd_fea_zhichu,on = 'id',how = 'left') return TestA_trd_fea def get_amt_sum(TestA_trd_fea,df,name): TestA_trd_fea_zhichu = amt_sum(df,name+'_trd_amt_') TestA_trd_fea = pd.merge(TestA_trd_fea,TestA_trd_fea_zhichu,on = 'id',how = 'left') return TestA_trd_fea def get_long_trd_fea(TestA_trd,pre_name): TestA_trd_fea = fea_cishu(TestA_trd,pre_name+'trd_count') #根据id算出每个人的金额加和 tmp = amt_sum(TestA_trd,pre_name+'trd_amt_sum') TestA_trd_fea = pd.merge(TestA_trd_fea,tmp,on = 'id',how = 'left') ##支付与收入分离 TestA_trd_zhichu = TestA_trd[TestA_trd['Dat_Flg1_Cd']=='B'] TestA_trd_shouru = TestA_trd[TestA_trd['Dat_Flg1_Cd']=='C'] #交易次数 TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_zhichu,pre_name+'zhichu') TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_shouru,pre_name+'shouru') #交易额 TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_zhichu,pre_name+'zhichu') TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_shouru,pre_name+'shouru') ## 支付方式分离 TestA_trd_A = TestA_trd[TestA_trd['Dat_Flg3_Cd']=='A'] TestA_trd_B = TestA_trd[TestA_trd['Dat_Flg3_Cd']=='B'] TestA_trd_C = TestA_trd[TestA_trd['Dat_Flg3_Cd']=='C'] #交易次数 TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_A,pre_name+'A') TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_B,pre_name+'B') TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_C,pre_name+'C') #交易额 TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_A,pre_name+'A') TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_B,pre_name+'B') TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_C,pre_name+'C') ## 一级分类分离 TestA_trd_cod1_1 = TestA_trd[TestA_trd['Trx_Cod1_Cd']==1] TestA_trd_cod1_2 = TestA_trd[TestA_trd['Trx_Cod1_Cd']==2] TestA_trd_cod1_3 = TestA_trd[TestA_trd['Trx_Cod1_Cd']==3] #交易次数 TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod1_1,pre_name+'cod1_1') TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod1_2,pre_name+'cod1_2') TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod1_3,pre_name+'cod1_3') #交易额 TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod1_1,pre_name+'cod1_1') TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod1_1,pre_name+'cod1_2') TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod1_3,pre_name+'cod1_3') ## 二级分类分离 TestA_trd_cod2_136 = TestA_trd[TestA_trd['Trx_Cod2_Cd']==136] TestA_trd_cod2_132 = TestA_trd[TestA_trd['Trx_Cod2_Cd']==132] TestA_trd_cod2_309 = TestA_trd[TestA_trd['Trx_Cod2_Cd']==309] TestA_trd_cod2_308 = TestA_trd[TestA_trd['Trx_Cod2_Cd']==308] TestA_trd_cod2_213 = TestA_trd[TestA_trd['Trx_Cod2_Cd']==213] TestA_trd_cod2_111 = TestA_trd[TestA_trd['Trx_Cod2_Cd']==111] TestA_trd_cod2_103 = TestA_trd[TestA_trd['Trx_Cod2_Cd']==103] TestA_trd_cod2_117 = TestA_trd[TestA_trd['Trx_Cod2_Cd']==117] TestA_trd_cod2_208 = TestA_trd[TestA_trd['Trx_Cod2_Cd']==208] TestA_trd_cod2_102 = TestA_trd[TestA_trd['Trx_Cod2_Cd']==102] #交易次数 TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod2_136,pre_name+'cod2_136') TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod2_132,pre_name+'cod2_132') TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod2_309,pre_name+'cod2_309') TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod2_308,pre_name+'cod2_308') TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod2_213,pre_name+'cod2_213') TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod2_111,pre_name+'cod2_111') TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod2_103,pre_name+'cod2_103') TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod2_117,pre_name+'cod2_117') TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod2_208,pre_name+'cod2_208') TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod2_102,pre_name+'cod2_102') #交易额 TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod2_136,pre_name+'cod2_136') TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod2_132,pre_name+'cod2_132') TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod2_309,pre_name+'cod2_309') TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod2_308,pre_name+'cod2_308') TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod2_213,pre_name+'cod2_213') TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod2_111,pre_name+'cod2_111') TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod2_103,pre_name+'cod2_103') TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod2_117,pre_name+'cod2_117') TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod2_208,pre_name+'cod2_208') TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod2_102,pre_name+'cod2_102') return TestA_trd_fea def get_short_trd_fea(TestA_trd,pre_name): TestA_trd_fea = fea_cishu(TestA_trd,pre_name+'trd_count') #根据id算出每个人的金额加和 tmp = amt_sum(TestA_trd,pre_name+'trd_amt_sum') TestA_trd_fea = pd.merge(TestA_trd_fea,tmp,on = 'id',how = 'left') ##支付与收入分离 TestA_trd_zhichu = TestA_trd[TestA_trd['Dat_Flg1_Cd']=='B'] TestA_trd_shouru = TestA_trd[TestA_trd['Dat_Flg1_Cd']=='C'] #交易次数 TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_zhichu,pre_name+'zhichu') TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_shouru,pre_name+'shouru') #交易额 TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_zhichu,pre_name+'zhichu') TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_shouru,pre_name+'shouru') ## 支付方式分离 TestA_trd_B = TestA_trd[TestA_trd['Dat_Flg3_Cd']=='B'] #交易次数 TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_B,pre_name+'B') #交易额 TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_B,pre_name+'B') ## 一级分类分离 TestA_trd_cod1_1 = TestA_trd[TestA_trd['Trx_Cod1_Cd']==1] TestA_trd_cod1_3 = TestA_trd[TestA_trd['Trx_Cod1_Cd']==3] #交易次数 TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod1_1,pre_name+'cod1_1') TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod1_3,pre_name+'cod1_3') #交易额 TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod1_1,pre_name+'cod1_1') TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod1_3,pre_name+'cod1_3') ## 二级分类分离 TestA_trd_cod2_309 = TestA_trd[TestA_trd['Trx_Cod2_Cd']==309] #交易次数 TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod2_309,pre_name+'cod2_309') #交易额 TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod2_309,pre_name+'cod2_309') return TestA_trd_fea #最近一个月特征 TestA_trd['trx_tm'] = TestA_trd['trx_tm'].apply(lambda x: parse(x)) I15day_TestA_trd = TestA_trd[TestA_trd['trx_tm']>datetime.datetime(2019, 6, 15, 23, 59, 59)] I1month_TestA_trd = TestA_trd[TestA_trd['trx_tm']>datetime.datetime(2019, 5, 30, 23, 59, 59)] I45day_TestA_trd = TestA_trd[TestA_trd['trx_tm']>datetime.datetime(2019, 5, 15, 23, 59, 59)] Train_trd['trx_tm'] = Train_trd['trx_tm'].apply(lambda x: parse(x)) I15day_Train_trd = Train_trd[Train_trd['trx_tm']>datetime.datetime(2019, 6, 15, 23, 59, 59)] I1month_Train_trd = Train_trd[Train_trd['trx_tm']>datetime.datetime(2019, 5, 30, 23, 59, 59)] I45day_Train_trd = Train_trd[Train_trd['trx_tm']>datetime.datetime(2019, 5, 15, 23, 59, 59)] I15day_Train_trd_fea = get_short_trd_fea(I15day_Train_trd,'I15d_') I1month_Train_trd_fea = get_long_trd_fea(I1month_Train_trd,'I1m_') I45day_Train_trd_fea = get_long_trd_fea(I45day_Train_trd,'I45d_') I15day_TestA_trd_fea = get_short_trd_fea(I15day_TestA_trd,'I15d_') I1month_TestA_trd_fea = get_long_trd_fea(I1month_TestA_trd,'I1m_') I45day_TestA_trd_fea = get_long_trd_fea(I45day_TestA_trd,'I45d_') #四个时间段的特征 Train_trd['hour'] = Train_trd['trx_tm'].apply(lambda x: x.hour) TestA_trd['hour'] = TestA_trd['trx_tm'].apply(lambda x: x.hour) Per_1_Train_trd = Train_trd[Train_trd['hour']<6] Per_2_Train_trd = Train_trd[(Train_trd['hour']>6)&(Train_trd['hour']<12)] Per_3_Train_trd = Train_trd[(Train_trd['hour']>12)&(Train_trd['hour']<18)] Per_4_Train_trd = Train_trd[(Train_trd['hour']>18)&(Train_trd['hour']<24)] Per_1_TestA_trd = TestA_trd[TestA_trd['hour']<6] Per_2_TestA_trd = TestA_trd[(TestA_trd['hour']>6)&(TestA_trd['hour']<12)] Per_3_TestA_trd = TestA_trd[(TestA_trd['hour']>12)&(TestA_trd['hour']<18)] Per_4_TestA_trd = TestA_trd[(TestA_trd['hour']>18)&(TestA_trd['hour']<24)] Per_1_Train_trd_fea = get_short_trd_fea(Per_1_Train_trd,'Per_1_') Per_2_Train_trd_fea = get_short_trd_fea(Per_2_Train_trd,'Per_2_') Per_3_Train_trd_fea = get_short_trd_fea(Per_3_Train_trd,'Per_3_') Per_4_Train_trd_fea = get_short_trd_fea(Per_4_Train_trd,'Per_4_') Per_1_TestA_trd_fea = get_short_trd_fea(Per_1_TestA_trd,'Per_1_') Per_2_TestA_trd_fea = get_short_trd_fea(Per_2_TestA_trd,'Per_2_') Per_3_TestA_trd_fea = get_short_trd_fea(Per_3_TestA_trd,'Per_3_') Per_4_TestA_trd_fea = get_short_trd_fea(Per_4_TestA_trd,'Per_4_') TestA_trd['weekend_bool'] = TestA_trd['trx_tm'].apply(lambda x: True if x.dayofweek in [5, 6] else False) Train_trd['weekend_bool'] = Train_trd['trx_tm'].apply(lambda x: True if x.dayofweek in [5, 6] else False) TestA_trd_weekday = TestA_trd[TestA_trd['weekend_bool'] == False] TestA_trd_weekend = TestA_trd[TestA_trd['weekend_bool'] == True] Train_trd_weekday = Train_trd[Train_trd['weekend_bool'] == False] Train_trd_weekend = Train_trd[Train_trd['weekend_bool'] == True] weekday_TestA_trd_fea = get_short_trd_fea(TestA_trd_weekday,'weekday_') weekend_TestA_trd_fea = get_short_trd_fea(TestA_trd_weekend,'weekend_') weekday_Train_trd_fea = get_short_trd_fea(Train_trd_weekday,'weekday_') weekend_Train_trd_fea = get_short_trd_fea(Train_trd_weekend,'weekend_') ##全量特征 Train_trd_fea = get_long_trd_fea(Train_trd,'') TestA_trd_fea = get_long_trd_fea(TestA_trd,'') Train_data = pd.merge(Train_data,Train_trd_fea,on = 'id',how = 'left') Train_data = pd.merge(Train_data,I1month_Train_trd_fea,on = 'id',how = 'left') Train_data = pd.merge(Train_data,I15day_Train_trd_fea,on = 'id',how = 'left') Train_data = pd.merge(Train_data,I45day_Train_trd_fea,on = 'id',how = 'left') Train_data = pd.merge(Train_data,Per_1_Train_trd_fea,on = 'id',how = 'left') Train_data = pd.merge(Train_data,Per_2_Train_trd_fea,on = 'id',how = 'left') Train_data = pd.merge(Train_data,Per_3_Train_trd_fea,on = 'id',how = 'left') Train_data = pd.merge(Train_data,Per_4_Train_trd_fea,on = 'id',how = 'left') Train_data = pd.merge(Train_data,weekday_Train_trd_fea,on = 'id',how = 'left') Train_data = pd.merge(Train_data,weekend_Train_trd_fea,on = 'id',how = 'left') TestA_data = pd.merge(TestA_data,TestA_trd_fea,on = 'id',how = 'left') TestA_data = pd.merge(TestA_data,I1month_TestA_trd_fea,on = 'id',how = 'left') TestA_data = pd.merge(TestA_data,I15day_TestA_trd_fea,on = 'id',how = 'left') TestA_data = pd.merge(TestA_data,I45day_TestA_trd_fea,on = 'id',how = 'left') TestA_data = pd.merge(TestA_data,Per_1_TestA_trd_fea,on = 'id',how = 'left') TestA_data = pd.merge(TestA_data,Per_2_TestA_trd_fea,on = 'id',how = 'left') TestA_data = pd.merge(TestA_data,Per_3_TestA_trd_fea,on = 'id',how = 'left') TestA_data = pd.merge(TestA_data,Per_4_TestA_trd_fea,on = 'id',how = 'left') TestA_data = pd.merge(TestA_data,weekday_TestA_trd_fea,on = 'id',how = 'left') TestA_data = pd.merge(TestA_data,weekend_TestA_trd_fea,on = 'id',how = 'left') ''' 读入间隔信息 ''' ###step2 对trd表中的时间间隔进行特征加工 def get_jiange(TestA_trd): list_tmp = list(TestA_trd['trx_tm']) list_tmp.insert(0,list_tmp[0]) list_tmp = list_tmp[:-1] TestA_trd['trx_tm_2'] = list_tmp TestA_trd['jiange_trx_tm'] = TestA_trd['trx_tm']-TestA_trd['trx_tm_2'] TestA_trd = TestA_trd[TestA_trd['jiange_trx_tm']>datetime.timedelta(seconds=0.01)] TestA_trd['jiange_trx_tm'] = TestA_trd['jiange_trx_tm'].apply(lambda x:x.days+round(x.seconds/86400,2)) del TestA_trd['trx_tm_2'] return TestA_trd TestA_trd_jiange = get_jiange(TestA_trd) Train_trd_jiange = get_jiange(Train_trd) def jiange_sta(df,name): tmp_mean = df.groupby(['id'],as_index = False).mean() tmp_mean = tmp_mean[['id','jiange_trx_tm']] tmp_mean = tmp_mean.rename(columns={'jiange_trx_tm':(name+'jiange_mean')}) tmp_max = df.groupby(['id'],as_index = False).max() tmp_max = tmp_max[['id','jiange_trx_tm']] tmp_max = tmp_max.rename(columns={'jiange_trx_tm':(name+'jiange_max')}) tmp_min = df.groupby(['id'],as_index = False).min() tmp_min = tmp_min[['id','jiange_trx_tm']] tmp_min = tmp_min.rename(columns={'jiange_trx_tm':(name+'jiange_min')}) tmp = pd.merge(tmp_mean,tmp_max,on = 'id',how = 'left') tmp = pd.merge(tmp,tmp_min,on = 'id',how = 'left') return tmp def get_jiange_sta(TestA_trd_fea,df,name): TestA_trd_fea_zhichu = jiange_sta(df,name+'_trd_') TestA_trd_fea = pd.merge(TestA_trd_fea,TestA_trd_fea_zhichu,on = 'id',how = 'left') return TestA_trd_fea def get_jiange_fea(TestA_trd_jiange,pre_name): # 先计算总的 TestA_trd_jiange_fea = jiange_sta(TestA_trd_jiange,pre_name+'trd_count') #根据id算出每个人的金额加和 ##支付与收入分离 TestA_trd_jiange_zhichu = TestA_trd_jiange[TestA_trd_jiange['Dat_Flg1_Cd']=='B'] TestA_trd_jiange_shouru = TestA_trd_jiange[TestA_trd_jiange['Dat_Flg1_Cd']=='C'] TestA_trd_jiange_fea = get_jiange_sta(TestA_trd_jiange_fea,TestA_trd_jiange_zhichu,pre_name+'zhichu') TestA_trd_jiange_fea = get_jiange_sta(TestA_trd_jiange_fea,TestA_trd_jiange_shouru,pre_name+'shouru') ## 支付方式分离 TestA_trd_jiange_A = TestA_trd_jiange[TestA_trd_jiange['Dat_Flg3_Cd']=='A'] TestA_trd_jiange_B = TestA_trd_jiange[TestA_trd_jiange['Dat_Flg3_Cd']=='B'] TestA_trd_jiange_C = TestA_trd_jiange[TestA_trd_jiange['Dat_Flg3_Cd']=='C'] #交易次数 TestA_trd_jiange_fea = get_jiange_sta(TestA_trd_jiange_fea,TestA_trd_jiange_A,pre_name+'A') TestA_trd_jiange_fea = get_jiange_sta(TestA_trd_jiange_fea,TestA_trd_jiange_B,pre_name+'B') TestA_trd_jiange_fea = get_jiange_sta(TestA_trd_jiange_fea,TestA_trd_jiange_C,pre_name+'C') ## 一级分类分离 TestA_trd_jiange_cod1_1 = TestA_trd_jiange[TestA_trd_jiange['Trx_Cod1_Cd']==1] TestA_trd_jiange_cod1_2 = TestA_trd_jiange[TestA_trd_jiange['Trx_Cod1_Cd']==2] TestA_trd_jiange_cod1_3 = TestA_trd_jiange[TestA_trd_jiange['Trx_Cod1_Cd']==3] #交易次数 TestA_trd_jiange_fea = get_jiange_sta(TestA_trd_jiange_fea,TestA_trd_jiange_cod1_1,pre_name+'cod1_1') TestA_trd_jiange_fea = get_jiange_sta(TestA_trd_jiange_fea,TestA_trd_jiange_cod1_2,pre_name+'cod1_2') TestA_trd_jiange_fea = get_jiange_sta(TestA_trd_jiange_fea,TestA_trd_jiange_cod1_3,pre_name+'cod1_3') return TestA_trd_jiange_fea Train_trd_jiange_fea = get_jiange_fea(Train_trd_jiange,'') TestA_trd_jiange_fea = get_jiange_fea(TestA_trd_jiange,'') jiange_fea_cols =['id', 'trd_countjiange_mean', 'trd_countjiange_max', 'trd_countjiange_min', 'zhichu_trd_jiange_mean', 'zhichu_trd_jiange_max', 'zhichu_trd_jiange_min', 'shouru_trd_jiange_mean', 'shouru_trd_jiange_max', 'shouru_trd_jiange_min', 'B_trd_jiange_mean', 'B_trd_jiange_max', 'B_trd_jiange_min', 'cod1_1_trd_jiange_mean', 'cod1_1_trd_jiange_max', 'cod1_1_trd_jiange_min', 'cod1_3_trd_jiange_mean', 'cod1_3_trd_jiange_max', 'cod1_3_trd_jiange_min'] Train_trd_jiange_fea = Train_trd_jiange_fea[jiange_fea_cols] TestA_trd_jiange_fea = TestA_trd_jiange_fea[jiange_fea_cols] Train_data = pd.merge(Train_data,Train_trd_jiange_fea,on = 'id',how = 'left') TestA_data = pd.merge(TestA_data,TestA_trd_jiange_fea,on = 'id',how = 'left') num_trd = len(TestA_data.columns) ###step3 对tag这个表构造一些特征 def jin_fea(quan_data,col_a,col_b): dict_jin = {'oneyear':'l1y_crd_card_csm_amt_dlm_cd','yongjiu':'perm_crd_lmt_cd', 'aum':'l6mon_daim_aum_cd','qianli':'pot_ast_lvl_cd'} quan_data['jin_'+col_a+'_jia_'+col_b] = quan_data[dict_jin[col_a]]+quan_data[dict_jin[col_b]] quan_data['jin_'+col_a+'_jian_'+col_b] = quan_data[dict_jin[col_a]]-quan_data[dict_jin[col_b]] quan_data['jin_'+col_a+'_cheng_'+col_b] = quan_data[dict_jin[col_a]]*quan_data[dict_jin[col_b]] quan_data['jin_'+col_a+'_chu_'+col_b] = quan_data[dict_jin[col_a]]/(quan_data[dict_jin[col_b]]+3) return quan_data def tag_fea(quan_data): print('tag维度',len(quan_data)) # 把能转的转换成数值类型 for col in quan_data.columns: try: quan_data[col] = quan_data[col].apply(float) except: print('这列不能转换:{}'.format(col)) ### 组内 quan_data['crecnt_chu_lvl'] = quan_data['cur_credit_cnt']/(quan_data['hld_crd_card_grd_cd']+5)#信用卡天数/等级 #quan_data['debcnt_chu_lvl'] = quan_data['cur_credit_cnt']/(quan_data['hld_crd_card_grd_cd']+5)#借贷卡天数/等级 # 张数 quan_data['cre_chu_deb_cnt'] =( quan_data['cur_credit_cnt']+1)/(quan_data['cur_debit_cnt']+1)#信用卡张数/借贷卡张数 quan_data['zhang_deb_jia_cre'] = quan_data['cur_debit_cnt']+quan_data['cur_credit_cnt'] quan_data['zhang_deb_jian_cre'] = quan_data['cur_debit_cnt']-quan_data['cur_credit_cnt'] # 天数 quan_data['tian_deb_jia_cre'] = quan_data['cur_debit_min_opn_dt_cnt']+quan_data['cur_credit_min_opn_dt_cnt'] quan_data['tian_deb_jian_cre'] = quan_data['cur_debit_min_opn_dt_cnt']-quan_data['cur_credit_min_opn_dt_cnt'] quan_data['tian_cre_chu_cre'] = quan_data['cur_credit_min_opn_dt_cnt']/(quan_data['cur_debit_min_opn_dt_cnt']+3) #等级 quan_data['deng_deb_jia_cre'] = quan_data['cur_debit_crd_lvl']+quan_data['hld_crd_card_grd_cd'] quan_data['deng_deb_jian_cre'] = quan_data['cur_debit_crd_lvl']-quan_data['hld_crd_card_grd_cd'] quan_data['deng_deb_jian_cre'] = quan_data['cur_debit_crd_lvl']-quan_data['hld_crd_card_grd_cd'] quan_data['deng_cre_chu_cre'] = quan_data['cur_debit_crd_lvl']/(quan_data['hld_crd_card_grd_cd']+3) #各种金额之间 quan_data = jin_fea(quan_data,'oneyear','yongjiu') quan_data = jin_fea(quan_data,'oneyear','aum') quan_data = jin_fea(quan_data,'oneyear','qianli') quan_data = jin_fea(quan_data,'yongjiu','aum') quan_data = jin_fea(quan_data,'yongjiu','qianli') quan_data = jin_fea(quan_data,'aum','qianli') ### 组间 #等级*/天数,等级*/张数, quan_data['deng_cre_tian_chu_deng'] = quan_data['cur_credit_min_opn_dt_cnt']/(quan_data['hld_crd_card_grd_cd']+3) quan_data['deng_cre_tian_cheng_deng'] = quan_data['cur_credit_min_opn_dt_cnt']*quan_data['hld_crd_card_grd_cd'] quan_data['deng_deb_tian_chu_deng'] = quan_data['cur_debit_min_opn_dt_cnt']/(quan_data['cur_debit_crd_lvl']+3) quan_data['deng_deb_tian_cheng_deng'] = quan_data['cur_debit_min_opn_dt_cnt']*quan_data['cur_debit_crd_lvl'] quan_data['deng_cre_zhang_cheng_deng'] = quan_data['cur_credit_cnt']*quan_data['hld_crd_card_grd_cd'] quan_data['deng_cre_zhang_chu_deng'] = quan_data['cur_credit_cnt']/(quan_data['hld_crd_card_grd_cd']+3) quan_data['deng_deb_zhang_cheng_deng'] = quan_data['cur_debit_cnt']*quan_data['hld_crd_card_grd_cd'] quan_data['deng_deb_zhang_chu_deng'] = quan_data['cur_debit_cnt']/(quan_data['hld_crd_card_grd_cd']+3) #信用卡消费金额:等级+-/*金额 ,张数+-/*金额 quan_data['jin_oy_cre_deng_cheng_jin'] = quan_data['hld_crd_card_grd_cd']*quan_data['l1y_crd_card_csm_amt_dlm_cd'] quan_data['jin_oy_cre_jin_chu_deng'] = quan_data['l1y_crd_card_csm_amt_dlm_cd']/(quan_data['hld_crd_card_grd_cd']+3) quan_data['jin_oy_cre_deng_jia_jin'] = quan_data['hld_crd_card_grd_cd']+quan_data['l1y_crd_card_csm_amt_dlm_cd'] quan_data['jin_oy_cre_deng_jian_jin'] = quan_data['hld_crd_card_grd_cd']-quan_data['l1y_crd_card_csm_amt_dlm_cd'] quan_data['jin_oy_cre_zhang_cheng_jin'] = quan_data['cur_credit_cnt']*quan_data['l1y_crd_card_csm_amt_dlm_cd'] quan_data['jin_oy_cre_jin_chu_zhang'] = quan_data['l1y_crd_card_csm_amt_dlm_cd']/(quan_data['cur_credit_cnt']+3) quan_data['jin_oy_cre_zhang_jia_jin'] = quan_data['cur_credit_cnt']+quan_data['l1y_crd_card_csm_amt_dlm_cd'] quan_data['jin_oy_cre_zhang_jian_jin'] = quan_data['cur_credit_cnt']-quan_data['l1y_crd_card_csm_amt_dlm_cd'] #信用卡永久信用额度分层: 额度+-*等级,额度+-*张数,天数*/额度, quan_data['jin_yj_cre_deng_cheng_jin'] = quan_data['hld_crd_card_grd_cd']*quan_data['perm_crd_lmt_cd'] quan_data['jin_yj_cre_deng_jia_jin'] = quan_data['hld_crd_card_grd_cd']+quan_data['perm_crd_lmt_cd'] quan_data['jin_yj_cre_deng_jian_jin'] = quan_data['hld_crd_card_grd_cd']-quan_data['perm_crd_lmt_cd'] quan_data['jin_yj_cre_zhang_cheng_jin'] = quan_data['cur_credit_cnt']*quan_data['perm_crd_lmt_cd'] quan_data['jin_yj_cre_zhang_jia_jin'] = quan_data['cur_credit_cnt']+quan_data['perm_crd_lmt_cd'] quan_data['jin_yj_cre_zhang_jian_jin'] = quan_data['cur_credit_cnt']-quan_data['perm_crd_lmt_cd'] quan_data['jin_yj_cre_tian_c***'] = quan_data['cur_debit_min_opn_dt_cnt']/(quan_data['perm_crd_lmt_cd']+3) quan_data['jin_yj_cre_tian_cheng_jin'] = quan_data['cur_debit_min_opn_dt_cnt']*quan_data['perm_crd_lmt_cd'] #job_year quan_data['job_cre_tian_chu_job'] = quan_data['cur_credit_min_opn_dt_cnt']/((quan_data['job_year']+0.5)*360) quan_data['job_deb_tian_chu_job'] = quan_data['cur_debit_min_opn_dt_cnt']/((quan_data['job_year']+0.5)*360) #dnl_mbl_bnk_ind 下载、绑定以及活跃度 quan_data['xia_jia_huo'] = quan_data['dnl_mbl_bnk_ind']+quan_data['crd_card_act_ind'] quan_data['xia_jia_bang'] = quan_data['dnl_mbl_bnk_ind']+quan_data['dnl_bind_cmb_lif_ind'] quan_data['huo_jia_bang'] = quan_data['crd_card_act_ind']+quan_data['dnl_bind_cmb_lif_ind'] quan_data['xia_jia_huo_jia_bang'] = quan_data['dnl_mbl_bnk_ind']+quan_data['crd_card_act_ind']+quan_data['dnl_bind_cmb_lif_ind'] #car house quan_data['car_jia_house'] = quan_data['hav_car_grp_ind']+quan_data['hav_hou_grp_ind'] print('tag维度',len(quan_data)) return quan_data Train_data = tag_fea(Train_data) TestA_data = tag_fea(TestA_data) num_tag = len(TestA_data.columns) for col in Train_data.columns: try: Train_data[col] = Train_data[col].apply(float) TestA_data[col] = TestA_data[col].apply(float) except: print('这列不能转换:{}'.format(col)) ####Step 3:特征与标签构建 ### 3.1 提取数值类型特征列名 numerical_cols = Train_data.select_dtypes(exclude = 'object').columns print(numerical_cols) categorical_cols = Train_data.select_dtypes(include = 'object').columns print(categorical_cols) shaoqu_0503 = ['id', 'flag', 'ic_ind', 'fr_or_sh_ind', 'hav_hou_grp_ind', 'cust_inv_rsk_endu_lvl_cd','l12mon_buy_fin_mng_whl_tms', 'l12_mon_fnd_buy_whl_tms', 'l12_mon_insu_buy_whl_tms','l12_mon_gld_buy_whl_tms', 'loan_act_ind', 'pl_crd_lmt_cd','ovd_30d_loan_tot_cnt', 'his_lng_ovd_day'] jihao =['cur_credit_min_opn_dt_cnt','l1y_crd_card_csm_amt_dlm_cd','gdr_cd', 'edu_deg_cd'] ### 3.2 构建训练和测试样本 ## 选择特征列 numerical_cols = [col for col in numerical_cols if col not in shaoqu_0503] print(numerical_cols) categorical_cols = [col for col in categorical_cols if col not in shaoqu_0503] print(categorical_cols) feature_cols = numerical_cols + categorical_cols feature_cols = [col for col in feature_cols if 'Type' not in col] ## 提前特征列,标签列构造训练样本和测试样_data = Train_data[feature_cols].replace('\\N',-2) X_data = Train_data[feature_cols].replace('\\N',-2) Y_data = Train_data['flag'] X_test = TestA_data[feature_cols].replace('\\N',-2) ## 定义SMOTE模型,random_state相当于随机数种子的作用 #smo = SMOTE(random_state=42) #X_smo, y_smo = smo.fit_sample(X_data, Y_data) # 进行类型转换,以适应xgb def label_encoder_tag(tag_data): #类别变量转换为数值 gdr_dict = { 'F': 1, 'M': 0} tag_data['gdr_cd'] = tag_data['gdr_cd'].map(gdr_dict) mrg_situ_dict = { 'A': 1, 'B': 2, 'O': 3, 'Z': 4} tag_data['mrg_situ_cd'] = tag_data['mrg_situ_cd'].map(mrg_situ_dict) edu_dict = { 'A': 1, 'B': 2,'C': 3, 'D': 4,'F': 5, 'G': 6,'J': 7, 'K': 8,'L': 9, 'M': 10,'Z': 11} tag_data['edu_deg_cd'] = tag_data['edu_deg_cd'].map(edu_dict) acdm_deg_dict = { 'C': 1, 'D': 2,'F': 3, 'G': 4,'Z': 5, 30: 6, 31: 7} tag_data['acdm_deg_cd'] = tag_data['acdm_deg_cd'].map(acdm_deg_dict) deg_dict = { 'A': 1, 'B': 2,'C': 3, 'D': 4,'Z': 5} tag_data['deg_cd'] = tag_data['deg_cd'].map(deg_dict) return tag_data try: X_data = label_encoder_tag(X_data) X_test = label_encoder_tag(X_test) except: print('类别转换存在问题') # 此自定义函数参照书本提供的代码,但是依据Python的风格,应该有相应的内置函数?晚点再找。 num_oht = len(X_test.columns) ## 定义了一个统计函数,方便后续信息统计 def Sta_inf(data): print('_min',np.min(data)) print('_max:',np.max(data)) print('_mean',np.mean(data)) print('_ptp',np.ptp(data))#极差 print('_std',np.std(data)) print('_var',np.var(data)) print('Sta of label:') Sta_inf(Y_data) ## 绘制标签的统计图,查看标签分布 plt.hist(Y_data) plt.show() plt.close() ### step 3.3 特征筛选 # 删除常变量(单个值超过0.8)和缺失变量(超过0.6的) col_shan = [] def shan_que_chang(col_shan,X_test): yuzhi = 0.85 queshi_yuzhi = 0.8 test_hang = 4000 num = X_test.isna().sum() df_queshi = pd.DataFrame() df_queshi['col_name'] = num.index df_queshi['rate'] = list(num/test_hang) df_shan = df_queshi[df_queshi['rate']>queshi_yuzhi] col_shan = col_shan+list(df_shan['col_name']) for col in X_test.columns: vc_test =pd.DataFrame(X_test[col].value_counts()/test_hang) if vc_test.iloc[0,0] >yuzhi: col_shan.append(col) return col_shan col_shan = shan_que_chang(col_shan,X_test) X_data = X_data.drop(columns = col_shan) X_test = X_test.drop(columns = col_shan) ## 共线性的去除 # Threshold for removing correlated variables threshold = 0.97 # Absolute value correlation matrix corr_matrix = X_data.corr().abs() # Upper triangle of correlations upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool)) # Select columns with correlations above threshold to_drop = [column for column in upper.columns if any(upper[column] > threshold)] print('There are %d columns to remove.' % (len(to_drop))) X_data = X_data.drop(columns = to_drop) X_test = X_test.drop(columns = to_drop) ## 特征重要性选择 # Initialize an empty array to hold feature importances feature_importances = np.zeros(X_data.shape[1]) # Create the model with several hyperparameters model_fi = xgb.XGBRegressor(n_estimators=120, learning_rate=0.1, gamma=0, subsample=0.8,\ colsample_bytree=0.9, max_depth=7) # Fit the model twice to avoid overfitting for i in range(2): # Split into training and validation set train_features, valid_features, train_y, valid_y = train_test_split(X_data, Y_data, test_size = 0.25, random_state = i) # Train using early stopping model_fi.fit(train_features, train_y, early_stopping_rounds=100, eval_set = [(valid_features, valid_y)], eval_metric = 'auc', verbose = 200) # Record the feature importances feature_importances += model_fi.feature_importances_ # Make sure to average feature importances! feature_importances = feature_importances / 2 feature_importances = pd.DataFrame({'feature': list(X_data.columns), 'importance': feature_importances}).sort_values('importance', ascending = False) # Find the features with zero importance zero_features = list(feature_importances[feature_importances['importance'] == 0.0]['feature']) print('There are %d features with 0.0 importance' % len(zero_features)) X_data = X_data.drop(columns = zero_features) X_test = X_test.drop(columns = zero_features) print(zero_features) #### Step 4:模型训练与预测 ### 4.1 利用xgb进行五折交叉验证查看模型的参数效果 ## xgb-Model lgr = lgb.LGBMRegressor(objective='regression',num_leaves=130,learning_rate=0.05,n_estimators=150) #,objective ='reg:squarederror' scores_train = [] scores = [] ## 5折交叉验证方式 sk=StratifiedKFold(n_splits=5,shuffle=True,random_state=0) for train_ind,val_ind in sk.split(X_data,Y_data): train_x=X_data.iloc[train_ind].values train_y=Y_data.iloc[train_ind] val_x=X_data.iloc[val_ind].values val_y=Y_data.iloc[val_ind] lgr.fit(train_x,train_y) pred_train_xgb=lgr.predict(train_x) pred_xgb=lgr.predict(val_x) score_train = mean_absolute_error(train_y,pred_train_xgb) scores_train.append(score_train) score = mean_absolute_error(val_y,pred_xgb) scores.append(score) print('Train mae:',np.mean(score_train)) print('Val mae',np.mean(scores)) ## 定义xgb和lgb模型函数 def build_model_xgb(x_train,y_train,r_seed): model = xgb.XGBRegressor(n_estimators=150,seed = r_seed, learning_rate=0.1, gamma=0, subsample=0.8,\ colsample_bytree=0.9, max_depth=7) #, objective ='reg:squarederror' model.fit(x_train, y_train) return model def build_model_lgb(x_train,y_train,r_seed): estimator = lgb.LGBMRegressor(num_leaves=127,seed = r_seed,n_estimators = 150) param_grid = { 'learning_rate': [0.01, 0.05, 0.1, 0.2], } gbm = GridSearchCV(estimator, param_grid) gbm.fit(x_train, y_train) return gbm ## 切分数据集(Train,Val)进行模型训练,评价和预测 ## Split data with val x_train,x_val,y_train,y_val = train_test_split(X_data,Y_data,test_size=0.3) print('Train xgb...') model_xgb = build_model_xgb(x_train,y_train,123) val_xgb = model_xgb.predict(x_val) MAE_xgb = mean_absolute_error(y_val,val_xgb) print('MAE of val with xgb:',MAE_xgb) print('Predict xgb...') model_xgb_pre = build_model_xgb(X_data,Y_data,123) subA_xgb = model_xgb_pre.predict(X_test) print('Sta of Predict xgb:') Sta_inf(subA_xgb) print('Predict xgb...') model_xgb_pre_1 = build_model_xgb(X_data,Y_data,1) subA_xgb_1 = model_xgb_pre_1.predict(X_test) print('Sta of Predict xgb:') Sta_inf(subA_xgb_1) print('Train lgb...') model_lgb = build_model_lgb(x_train,y_train,123) val_lgb = model_lgb.predict(x_val) MAE_lgb = mean_absolute_error(y_val,val_lgb) print('MAE of val with lgb:',MAE_lgb) print('Predict lgb...') model_lgb_pre = build_model_lgb(X_data,Y_data,123) subA_lgb = model_lgb_pre.predict(X_test) print('Sta of Predict lgb:') Sta_inf(subA_lgb) print('Predict lgb...') model_lgb_pre_1 = build_model_lgb(X_data,Y_data,1) subA_lgb_1 = model_lgb_pre_1.predict(X_test) print('Sta of Predict lgb:') Sta_inf(subA_lgb_1) # plot feature importance #plot_importance(model_xgb_pre) #pyplot.show() # 打印出有多少特征重要性非零的特征 feature_score_dict = {} for fn, s in zip(X_data.columns, model_xgb_pre.feature_importances_): feature_score_dict[fn] = s m = 0 for k in feature_score_dict: if feature_score_dict[k] == 0.0: m += 1 print ('number of not-zero features:' + str(len(feature_score_dict) - m)) # 打印出特征重要性 feature_score_dict_sorted = sorted(feature_score_dict.items(), key=lambda d: d[1], reverse=True) print ('xgb_feature_importance:') for ii in range(len(feature_score_dict_sorted)): print (feature_score_dict_sorted[ii][0], feature_score_dict_sorted[ii][1]) print ('\n') f = open('../eda/sub_0511_4_xgb_feature_importance.txt', 'w') f.write('Rank\tFeature Name\tFeature Importance\n') for i in range(len(feature_score_dict_sorted)): f.write(str(i) + '\t' + str(feature_score_dict_sorted[i][0]) + '\t' + str(feature_score_dict_sorted[i][1]) + '\n') f.close() ## 这里我们采取了简单的rank加权融合的方式 val_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*val_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*val_xgb val_Weighted[val_Weighted<0]=10 # 由于我们发现预测的最小值有负数,而真实情况下,price为负是不存在的,由此我们进行对应的后修正 print('MAE of val with Weighted ensemble:',mean_absolute_error(y_val,val_Weighted)) sub_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*subA_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*subA_xgb tmps = pd.DataFrame() tmps['id'] = TestA_data.id tmps['tag_weight'] = sub_Weighted tmps['tag_xgb'] = subA_xgb tmps['tag_xgb_1'] = subA_xgb_1 tmps['tag_lgb'] = subA_lgb tmps['tag_lgb_1'] = subA_lgb_1 for met in ['weight','xgb','lgb','xgb_1','lgb_1']: tmps[met+'_rank'] = tmps['tag_'+met].rank(method = 'min') tmps.head() tmps['score_rank'] = tmps[[m+'_rank' for m in ['weight','xgb','lgb','xgb_1','lgb_1']]].sum(1) max_min_scaler = lambda x:(x-np.min(x))/(np.max(x)-np.min(x)) tmps['score'] = tmps[['score_rank']].apply(max_min_scaler) ## 查看预测值的统计进行 plt.hist(sub_Weighted) plt.show() plt.close() sub = pd.DataFrame() sub['id'] = TestA_data.id sub['tag'] = tmps['score'] sub['tag'] = sub['tag'].apply(lambda r: max(r,0)) sub['tag'] = sub['tag'].apply(lambda r: min(r,1)) sub = sub[['id','tag']] #sub.to_csv('../result/sub_0511_ronghe_4_b.txt', sep='\t', index=False,header = None ,encoding = 'utf-8')