目录
DRL - 01导论
ML 23-1 deep reinforcement learning
scenario of deep reinforcement learning
- learning to play GO
- Supervised vs Reinforcement
-
applications
Universe: https://openai.com/blog/universe/
-
difficulties of reinforcement learning
reward delay 一些没有奖励的动作在当前看起来没有用,但对未来会产生影响,帮助在未来得到奖励。
agent's actions affect the subsequent data it recevives,agent 需要去探索,不管是好的行为还是坏的。
-
outline
Policy-based Approach - Learning an Actor
- machine learning looking for a function
- 找function 的三大步骤
-
DRL
-
neural network as actor
input: vector、matrix,eg: pixels
output: action 采取行动的几率,stochastic
-
goodness of function
supervised learning vs DRL
-
pick the best
- gradient ascent
- add a baseline
-
critics
评估observation
Actor-Critic
ML 23-2 policy gradient (Supplementary Explanation)
ML 23-3 RL
interact with environments
机器学到的行为会影响下一步的发展,所有的action 当成整体看待
components
env、reward function不能控制,只能调整actor的行为
critic
评估critic:
Monre-Carlo:
Temporal defference:
Q
actor 如果⽆法穷举则会爆炸,采用PDPG
pathwise derivative policy gradient
Asynchronous A3C
imitation learning
类似GAN:
DRL - 02 Proximal Policy Optimization (PPO)
policy gradient
on-policy and off-policy
add constraint
DRL - 03 Q-learning
introduction of Q-learning
Tips of Q-learning
Q-learning for Continuous Actions
DRL - 04 Actor-critic
AC A2C A3C
pathwise derivative policy gradient
DRL - 05 Sparse Reward
reward shaping
curriculum learning
hierarchical RL
DRL - 06 Imitation Learning
behavior cloning
inverse reinforcement learning