yang(Emily

碎碎念 ...

碎碎念

开发(3) 数学(12) 数据库(4) 数据结构(1)

/ 注册

...

671 浏览 0 回复 2021-11-20

yang(Emily

+关注

目录

DRL - 01导论
DRL - 02 Proximal Policy Optimization (PPO)
DRL - 03 Q-learning
DRL - 04 Actor-critic
- AC A2C A3C
- pathwise derivative policy gradient
DRL - 05 Sparse Reward
DRL - 06 Imitation Learning
- behavior cloning
- inverse reinforcement learning

DRL - 01导论

ML 23-1 deep reinforcement learning

scenario of deep reinforcement learning

learning to play GO

Supervised vs Reinforcement

applications

Gym: https://gym.openai.com/

Universe: https://openai.com/blog/universe/
difficulties of reinforcement learning

reward delay 一些没有奖励的动作在当前看起来没有用，但对未来会产生影响，帮助在未来得到奖励。

agent's actions affect the subsequent data it recevives，agent 需要去探索，不管是好的行为还是坏的。
outline

Policy-based Approach - Learning an Actor

machine learning $\approx$ looking for a function

找function 的三大步骤

DRL
1. neural network as actor
  
  input: vector、matrix，eg: pixels
  
  output: action 采取行动的几率，stochastic
1. goodness of function
  
  supervised learning vs DRL
1. pick the best
- gradient ascent
- add a baseline

critics

评估observation

Actor-Critic

ML 23-2 policy gradient (Supplementary Explanation)

ML 23-3 RL

interact with environments

机器学到的行为会影响下一步的发展，所有的action 当成整体看待

components

env、reward function不能控制，只能调整actor的行为

alt

critic

alt

评估critic：

Monre-Carlo：

alt

Temporal defference：

alt

Q

alt

actor 如果⽆法穷举则会爆炸，采用PDPG

pathwise derivative policy gradient

alt

Asynchronous A3C

alt

imitation learning

alt alt alt

类似GAN:

alt

DRL - 02 Proximal Policy Optimization (PPO)

policy gradient

on-policy and off-policy

add constraint

DRL - 03 Q-learning

introduction of Q-learning

Tips of Q-learning

Q-learning for Continuous Actions

DRL - 04 Actor-critic

AC A2C A3C

pathwise derivative policy gradient

DRL - 05 Sparse Reward

reward shaping

curriculum learning

hierarchical RL

DRL - 06 Imitation Learning

behavior cloning

inverse reinforcement learning

举报

收藏

赞

评论加载中...