马尔科夫决策过程(Markov decision process)

一般用五元组表示(S, A, P, R, γ)
A-state

累计奖赏

How to estimate V π ( s ) V^\pi(s) Vπ(s)
1.Monte-Carlo
S a V π V π ( s a ) G a S_a\longrightarrow V^\pi \longrightarrow V^\pi(s_a) \longrightarrow G_a SaVπVπ(sa)Ga
S b V π V π ( s b ) G b S_b\longrightarrow V^\pi \longrightarrow V^\pi(s_b) \longrightarrow G_b SbVπVπ(sb)Gb
larger variance
G a G_a Ga is the summation of many steps
V a r ( k x ) = k 2 V a r ( x ) Var(kx) =k^2Var(x) Var(kx)=k2Var(x)
2.Temporal-Difference
, s t , a t , r t , s t + 1 , \dots,s_t,a_t,r_t,s_{t+1},\dots ,st,at,rt,st+1,
V π ( s t ) = V π ( s t + 1 ) + r t V^\pi(s_t)=V^\pi(s_{t+1})+r_t Vπ(st)=Vπ(st+1)+rt

s t V π V π ( s t ) s_t\longrightarrow V^\pi\longrightarrow V^\pi(s_t) stVπVπ(st)
s t + 1 V π V π ( s t + 1 ) s_{t+1}\longrightarrow V^\pi\longrightarrow V^\pi(s_{t+1}) st+1VπVπ(st+1)
V π ( s t ) = V π ( s t + 1 ) + r t \Longrightarrow V^\pi(s_t)=V^\pi(s_{t+1})+r_t Vπ(st)=Vπ(st+1)+rt
smaller variance
maybe inaccurate

Q-Learning 核心公式

Q ( s , a ) = Q ( s , a ) + α [ R ( s , a ) + γ Q ( s , a ) Q ( s , a ) ] Q(s,a)=Q(s,a)+\alpha[R(s,a)+\gamma Q(s,a)-Q(s,a)] Q(s,a)=Q(s,a)+α[R(s,a)+γQ(s,a)Q(s,a)]
其中 α \alpha α是学习率, γ \gamma γ是折扣因子