推荐系统Lambda架构算法（九）：基于内容的电影推荐——用户画像

文章目录

- 基于内容的电影推荐：用户画像
- - - 用户画像建立

基于内容的电影推荐：用户画像

用户画像构建步骤：

根据用户的评分历史，结合物品画像，将有观影记录的电影的画像标签作为初始标签反打到用户身上
通过对用户观影标签的次数进行统计，计算用户的每个初始标签的权重值，排序后选取TOP-N作为用户最终的画像标签

用户画像建立

import pandas as pd
import numpy as np
from gensim.models import TfidfModel

from functools import reduce
import collections

from pprint import pprint

# ......

''' user profile画像建立： 1. 提取用户观看列表 2. 根据观看列表和物品画像为用户匹配关键词，并统计词频 3. 根据词频排序，最多保留TOP-k个词，这里K设为100，作为用户的标签 '''

def create_user_profile():
    watch_record = pd.read_csv("datasets/ml-latest-small/ratings.csv", usecols=range(2), dtype={
   "userId":np.int32, "movieId": np.int32})

    watch_record = watch_record.groupby("userId").agg(list)
    # print(watch_record)

    movie_dataset = get_movie_dataset()
    movie_profile = create_movie_profile(movie_dataset)

    user_profile = {
   }
    for uid, mids in watch_record.itertuples():
        record_movie_prifole = movie_profile.loc[list(mids)]
        counter = collections.Counter(reduce(lambda x, y: list(x)+list(y), record_movie_prifole["profile"].values))
        # 取出出现次数最多的前50个词
        interest_words = counter.most_common(50)
        # 取出出现次数最多的词 出现的次数
        maxcount = interest_words[0][1]
        # 利用次数计算权重 出现次数最多的词权重为1
        interest_words = [(w,round(c/maxcount, 4)) for w,c in interest_words]
        user_profile[uid] = interest_words

    return user_profile

user_profile = create_user_profile()
pprint(user_profile)

reduce函数
- 描述
  
  reduce() 函数会对参数序列中元素进行累积。
  
  函数将一个数据集合（链表，元组等）中的所有数据进行下列操作：用传给 reduce 中的函数 function（有两个参数）先对集合中的第 1、2 个元素进行操作，得到的结果再与第三个数据用 function 函数运算，最后得到一个结果。
- 语法
  
  reduce() 函数语法：
```
reduce(function, iterable[, initializer])
```
- 参数
  - function – 函数，有两个参数
  - iterable – 可迭代对象
  - initializer – 可选，初始参数
- 返回值
  
  返回函数计算结果。
- 示例
```
>>>def add(x, y) :            # 两数相加
...     return x + y
... 
>>> reduce(add, [1,2,3,4,5])   # 计算列表和：1+2+3+4+5
15
>>> reduce(lambda x, y: x+y, [1,2,3,4,5])  # 使用 lambda 匿名函数
15
```

使用collections.Counter类统计列表元素出现次数

from collections import Counter
names = ["Stanley", "Lily", "Bob", "Well", "Peter", "Bob", "Well", "Peter", "Well", "Peter", "Bob","Stanley", "Lily", "Bob", "Well", "Peter", "Bob", "Bob", "Well", "Peter", "Bob", "Well"]
names_counts = Counter(names)