1.19记录

在本地Spyder上跑：memory error 内存不够
在服务器上跑：安装conda包无法在线下载（用镜像加速也不行），只能离线安装

离线安装方法：
1.下载whl安装包，地址为：https://pypi.org/project/gensim/#files
2. 将安装包粘贴到服务器的anaconda/lib/python3.6/site-packages/路径下
3. 输入解压缩命令 pip install
图片说明
4. 如果离线安装显示网络问题安装失败，可能是因为有些前置的依赖包没有装好。

开始调试程序

算法第一步：读入train和test
处理时间戳‘ts’

train_df = pd.read_csv('train.csv')
train_df['date'] = pd.to_datetime(
train_df['ts'].apply(lambda x: time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x / 1000)))
)
1. time.localtime([secs])：将一个时间戳转换为当前时区的struct_time。secs参数未提供，则以当前时间为准。
2. time.strftime(format[, t])：把一个代表时间的元组或者struct_time（如由time.localtime()和time.gmtime()返回）转化为格式化的时间字符串。如果t未指定，将传入time.localtime()。
  举例：time.strftime(“%Y-%m-%d %X”, time.localtime()) #输出’2017-10-01 12:14:23’
3. df[''date]数据类型为“object”，通过pd.to_datetime将该列数据转换为时间类型，即datetime
减少内存reduce_mem_usage函数

https://www.jianshu.com/p/d54fc84f3b42
https://zhuanlan.zhihu.com/p/68092069

除了用gc清理不再需要的变量以外，还能用以下函数：

循环每列
判断是否该列类型为numeric
判断是否该列类型为int
找到最小最大值
找到一个最节省内存的datatype去fit这一列

def reduce_mem(df):
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    print('{:.2f} Mb, {:.2f} Mb ({:.2f} %)'.format(start_mem, end_mem, 100 * (start_mem - end_mem) / start_mem))
    gc.collect()
    return df

1.20记录

在服务器上跑的时候出现ValueError，检查后发现是train文件粘贴过去后缺失数据
用50M左右的train跑通了一遍，提交上去是50多的正确率
用下午5点多全部数据开始跑，半夜11点无结果，第二天起来电脑关机了，不知道是否运行完，无结果。
快成功的时候网络断了，要气的吐血了
晚上一个后台跑，一个前台跑，memory error
用nohup后台运行，kill -9 编号 强制杀死进程

相关命令介绍：

https://www.cnblogs.com/dion-90/articles/9048627.html ps -aux
https://www.jianshu.com/p/ca5f9a9af70d nohup
https://www.cnblogs.com/yunwangjun-python-520/p/10713564.html

1.21记录

运行10+个小时，得到结果，评分0.8106

1.28记录

读入训练数据集

t = time.time()
train_df = pd.read_csv('train.csv')
#修改时间戳格式
train_df['date'] = pd.to_datetime(
 train_df['ts'].apply(lambda x: time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x / 1000)))
)
train_df['day'] = train_df['date'].dt.day
train_df.loc[train_df['day'] == 7, 'day'] = 8 #将7号的数据归到8号
train_df['hour'] = train_df['date'].dt.hour
train_df['minute'] = train_df['date'].dt.minute
train_num = train_df.shape[0]
labels = train_df['target'].values
print('runtime:', time.time() - t

训练数据集包括2019年11月7，8，9，10四天的数据，但是7号的数据比较少，合并后每天的数据量如下：
图片说明

对点击数据的分析
对曝光到点击的时间间隔（反应时间）进行统计几乎没用，因此去掉了，其存在很大的负数，不知道是什么原因

click_df = train_df[train_df['target'] == 1].sort_values('timestamp').reset_index(drop=True) #reset重新编号
click_df['exposure_click_gap'] = click_df['timestamp'] - click_df['ts'] # 从曝光到点击的时间差（反应时间）
click_df = click_df[click_df['exposure_click_gap'] >= 0].reset_index(drop=True)
click_df['date'] = pd.to_datetime(
  click_df['timestamp'].apply(lambda x: time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x / 1000)))
)
click_df['day'] = click_df['date'].dt.day
click_df.loc[click_df['day'] == 7, 'day'] = 8
del train_df['target'], train_df['timestamp']
for f in ['date', 'exposure_click_gap', 'timestamp', 'ts', 'target', 'hour', 'minute']:
  del click_df[f]
print('runtime:', time.time() - t)

处理测试集

对类别进行编码

print('=============================================== cate enc ===============================================')
df['lng_lat'] = df['lng'].astype('str') + '_' + df['lat'].astype('str') #位置信息，经度纬度
del df['guid']
click_df['lng_lat'] = click_df['lng'].astype('str') + '_' + click_df['lat'].astype('str')
sort_df = df.sort_values('ts').reset_index(drop=True)
cate_cols = [
 'deviceid', 'newsid', 'pos', 'app_version', 'device_vendor',
 'netmodel', 'osversion', 'device_version', 'lng', 'lat', 'lng_lat'
]
for f in cate_cols:
 print(f)
 map_dict = dict(zip(df[f].unique(), range(df[f].nunique()))) #unique()计算有多少个不同的值
 df[f] = df[f].map(map_dict).fillna(-1).astype('int32')  #将string类别信息转化为数字类别
 click_df[f] = click_df[f].map(map_dict).fillna(-1).astype('int32')
 sort_df[f] = sort_df[f].map(map_dict).fillna(-1).astype('int32')
 df[f + '_count'] = df[f].map(df[f].value_counts())
df = reduce_mem(df)
click_df = reduce_mem(click_df)
sort_df = reduce_mem(sort_df)
print('runtime:', time.time() - t)

用zip建立字典的方法

keys = ['a', 'b', 'c']
values = [1, 2, 3]
dictionary = dict(zip(keys, values))
print(dictionary)
{'a': 1, 'b': 2, 'c': 3}

map_dict: 以app_version为例, df['app_version'].unique()得到一个长度为27的列表，包含26种app版本，作为字典中的key。而df['app_version'].nunique()得到一共有多少种的app版本，即27，则0~26就是对应的value。
df.map：将df中的相应的key映射成value，比如之前的字典中有{‘2.1.1’：‘3’}，则df中app_version列下版本为2.1.1的值更新为3.
以上步骤相当于文本数据的编码过程。

特征编码

特征一：对于前一天：(历史信息，即前一天的点击量、曝光量、点击率)

f = deviceid:
1). 每个设备每天有多少次点击行为_prev_day_click_count
2). 在原表格中每个设备每天出现多少次，也就是看到了多少个视频_prev_day_count
3). 计算点击率_prev_day_ctr

f = [pos,deviceid]:
1). 视频位置k，设备i，在第j天有多少次点击行为_prev_day_click_count
2). 视频位置k，设备i，在第j天看到了多少个视频_prev_day_count
3). 计算点击率_prev_day_ctr

特征二：前x次曝光、后x次曝光到当前的时间差，后x次到当前曝光的时间差是穿越特征，并且是最强的特征；

特征三：二阶交叉特征。

print('*************************** cross feat (second order) ***************************')
# 二阶交叉特征，可以继续做更高阶的交叉。
cross_cols = ['deviceid', 'newsid', 'pos', 'netmodel', 'lng_lat']
for f in cross_cols:
  for col in cross_cols:
      if col == f:
          continue
      print('------------------ {} {} ------------------'.format(f, col))
      df = df.merge(df[[f, col]].groupby(f, as_index=False)[col].agg({
          'cross_{}_{}_nunique'.format(f, col): 'nunique',
          'cross_{}_{}_ent'.format(f, col): lambda x: entropy(x.value_counts() / x.shape[0]) # 熵
      }), on=f, how='left')
      if 'cross_{}_{}_count'.format(f, col) not in df.columns.values and 'cross_{}_{}_count'.format(col, f) not in df.columns.values:
          df = df.merge(df[[f, col, 'id']].groupby([f, col], as_index=False)['id'].agg({
              'cross_{}_{}_count'.format(f, col): 'count' # 共现次数
          }), on=[f, col], how='left')
      if 'cross_{}_{}_count_ratio'.format(col, f) not in df.columns.values:
          df['cross_{}_{}_count_ratio'.format(col, f)] = df['cross_{}_{}_count'.format(f, col)] / df[f + '_count'] # 比例偏好
      if 'cross_{}_{}_count_ratio'.format(f, col) not in df.columns.values:
          df['cross_{}_{}_count_ratio'.format(f, col)] = df['cross_{}_{}_count'.format(f, col)] / df[col + '_count'] # 比例偏好
      df['cross_{}_{}_nunique_ratio_{}_count'.format(f, col, f)] = df['cross_{}_{}_nunique'.format(f, col)] / df[f + '_count']
      print('runtime:', time.time() - t)
  df = reduce_mem(df)
del df['id']
gc.collect()

特征四：使用Word2vec作embedding

使用word2vec构建推荐系统https://blog.csdn.net/fendouaini/article/details/98915122?utm_source=distribute.pc_relevant.none-task

def emb(df, f1, f2):
emb_size = 8
print('====================================== {} {} ======================================'.format(f1, f2))
tmp = df.groupby(f1, as_index=False)[f2].agg({'{}_{}_list'.format(f1, f2): list})
sentences = tmp['{}_{}_list'.format(f1, f2)].values.tolist()
del tmp['{}_{}_list'.format(f1, f2)]
for i in range(len(sentences)):
    sentences[i] = [str(x) for x in sentences[i]]
model = Word2Vec(sentences, size=emb_size, window=5, min_count=5, sg=0, hs=1, seed=2019)
emb_matrix = []
for seq in sentences:
    vec = []
    for w in seq:
        if w in model:
            vec.append(model[w])
    if len(vec) > 0:
        emb_matrix.append(np.mean(vec, axis=0))
    else:
        emb_matrix.append([0] * emb_size)
for i in range(emb_size):
    emb_matrix = np.array(emb_matrix)
    tmp['{}_{}_emb_{}'.format(f1, f2, i)] = emb_matrix[:, i]
del model, emb_matrix, sentences
tmp = reduce_mem(tmp)
print('runtime:', time.time() - t)
return tmp
请在这里输入引用内容
请在这里输入引用内容
emb_cols = [
['deviceid', 'newsid'],
['deviceid', 'lng_lat'],
['newsid', 'lng_lat'],
# ...
]
for f1, f2 in emb_cols:
df = df.merge(emb(sort_df, f1, f2), on=f1, how='left')
df = df.merge(emb(sort_df, f2, f1), on=f2, how='left')
del sort_df
gc.collect()

分离训练集

train_df = df[:train_num].reset_index(drop=True)
test_df = df[train_num:].reset_index(drop=True)
del df
gc.collect()

train_idx = train_df[train_df['day'] < 10].index.tolist()
val_idx = train_df[train_df['day'] == 10].index.tolist()

train_x = train_df.iloc[train_idx].reset_index(drop=True)
train_y = labels[train_idx]
val_x = train_df.iloc[val_idx].reset_index(drop=True)
val_y = labels[val_idx]

del train_x['day'], val_x['day'], train_df['day'], test_df['day']
gc.collect()
print('runtime:', time.time() - t)

训练模型

模型调参
其实，对于基于决策树的模型，调参的方法都是大同小异。一般都需要如下步骤：
1. 首先选择较高的学习率，大概0.1附近，这样是为了加快收敛的速度。这对于调参是很有必要的。
2. 对决策树基本参数调参
3. 正则化参数调参
4. 最后降低学习率，这里是为了最后提高准确率

这里取 learning_rate = 0.1，其次确定估计器boosting/boost/boosting_type的类型，不过默认都会选gbdt。
为了确定估计器的数目，也就是boosting迭代的次数，也可以说是残差树的数目，参数名为n_estimators.我们可以先将该参数设成一个较大的数，然后在结果中查看最优的迭代次数
'num_leaves': 255 由于lightGBM是leaves_wise生长，官方说法是要小于2^max_depth
'subsample':0.8 数据采样
'colsample_bytree': 0.8 列特征采样
early_stopping_rounds 当设置的迭代次数较大时，early_stopping_rounds 可在一定的迭代次数内准确率没有提升就停止训练
verbose：日志显示

fea_imp_list = []
clf = LGBMClassifier(
    learning_rate=0.01,
    n_estimators=5000,
    num_leaves=255,
    subsample=0.9,
    colsample_bytree=0.8,
    random_state=2019,
    metric=None
)
print('************** training **************')
clf.fit(
    train_x, train_y,
    eval_set=[(val_x, val_y)],
    eval_metric='auc',
    categorical_feature=cate_cols,
    early_stopping_rounds=200,
    verbose=50
)
print('runtime:', time.time() - t)
print('************** validate predict **************')
best_rounds = clf.best_iteration_
best_auc = clf.best_score_['valid_0']['auc']
val_pred = clf.predict_proba(val_x)[:, 1]
fea_imp_list.append(clf.feature_importances_)

推荐系统比赛遇到的坑