目录

DRL - 01导论

ML 23-1 deep reinforcement learning

scenario of deep reinforcement learning

  • learning to play GO
  • Supervised vs Reinforcement
  • applications

    Gym: https://gym.openai.com/

    Universe: https://openai.com/blog/universe/

  • difficulties of reinforcement learning

    reward delay 一些没有奖励的动作在当前看起来没有用,但对未来会产生影响,帮助在未来得到奖励。

    agent's actions affect the subsequent data it recevives,agent 需要去探索,不管是好的行为还是坏的。

  • outline

Policy-based Approach - Learning an Actor

  • machine learning \approx looking for a function
  • 找function 的三大步骤
  • DRL

    1. neural network as actor

      input: vector、matrix,eg: pixels

      output: action 采取行动的几率,stochastic

    1. goodness of function

      supervised learning vs DRL

    1. pick the best

    • gradient ascent

    alt alt

    • add a baseline

critics

评估observation

Actor-Critic

ML 23-2 policy gradient (Supplementary Explanation)

ML 23-3 RL

interact with environments

机器学到的行为会影响下一步的发展,所有的action 当成整体看待

components

env、reward function不能控制,只能调整actor的行为

alt

critic

alt

评估critic:

Monre-Carlo:

alt

Temporal defference:

alt

Q

alt

actor 如果⽆法穷举则会爆炸,采用PDPG

pathwise derivative policy gradient

alt

Asynchronous A3C

alt

imitation learning

alt alt alt

类似GAN:

alt

DRL - 02 Proximal Policy Optimization (PPO)

policy gradient

on-policy and off-policy

add constraint

DRL - 03 Q-learning

introduction of Q-learning

Tips of Q-learning

Q-learning for Continuous Actions

DRL - 04 Actor-critic

AC A2C A3C

pathwise derivative policy gradient

DRL - 05 Sparse Reward

reward shaping

curriculum learning

hierarchical RL

DRL - 06 Imitation Learning

behavior cloning

inverse reinforcement learning