DRN_牛客博客

DRN: A deep reinforcement learning framework for news recommendation
遇到的问题：无法体现dynamic nature of news recommendation
First, they only try to model current reward
Second, very few studies consider to use user feedback other than click/no click labels (how frequent user returns) to help improve recommendations.
Third, these methods tend to keep recommending similar news to users, which may cause users to bored.

$\epsilon$ -greedy问题: 会推荐完全不相关的东西

$UCB$ 需要尝试多次才能准确得到价值估计
*contribution: *
1.强化学习框架
2.用户活跃度，比起仅仅用点击和不点击的反馈要好很多
3.Dueling Bandit Gradient Descent
4.效果确实好

method：
We use a continuous state feature representation of users and continuous action feature representation of items as inputs to DQN.
model framework:

Push: when a user sends a news request to the system, the recommendation agent G will take the feature representation of the current user and news candidates as input, and generates a top-k list of news to recommend L. L is generated by combining the exploitation of current model and exploration of movie items
Feedback: User u who has received recommended news L will give their feedback B by his clicks on this set of news.
Minor update: After each timestamp, with feature representation of the previous user u and news list L, and the feedback B. G 会比较两个DQN exploitation Q Network and exploration Q network, 哪个效果好，如果后者效果好，现在的模型会朝着exploration更新一点。
Major update: 经验回放, agent保留者最近历史点击和用户活跃度记录.

User Activeness
$\lambda(t) = lim_{dt->0} \frac {Pr(t<=T<t+dt|T>t)}{dt}$