强化学习笔记(4)-值迭代与策略迭代
Value iteration & Policy iteration
Value iteration
求解步骤
初始给定一个随机值 $v_0$,使用迭代法 \(v_{k+1} = f(v_k) = \max_\pi (r_\pi + \gamma P_\pi v_k), \quad k = 1, 2, 3, \dots\)
- Step 1:policy update,给定 $v_0$, 利用迭代法求解
其中的 \(\pi_{k+1}(s) = \arg \max_\pi \sum_a \pi(a|s) \underbrace{\left( \sum_r p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_k(s') \right)}_{q_k(s,a)}\)
采用贪心算法
将$v_k$带入求解出$q_k(s,a)$,对应最大的 $q_k(s,a)$的action记作 $a_k^*(s)$,得到 \(\pi_{k+1}(s) = \begin{cases} 1 & a = a_k^*(s) \\ 0 & a \neq a_k^*(s) \end{cases}\)
- Step 2:value update
注意,$v_k$不一定是 state value
其中的 \(v_{k+1}(s) = \sum_a \pi_{k+1}(a|s) \underbrace{ \left( \sum_r p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_k(s')\right)}_{q_k(s,a)}\) 对应的$v_{k+1}(s)$ 就是 $q_k$ 里的最大值 \(v_{k+1}(s) = \max_a q_k(a,s)\)
伪代码
while $||v_k - v_{k-1} || > goal ${ for everyState s in S { for everyAction a in A { q-valuel: $q_k(s,a) = \sum_r p(r|s,a)r + \gamma \sum_{s’} p(s’|s,a)v_k(s’)$ }
maximumActionValue: $a_k^*(s) = \arg \max_a q_k(a,s)$ policyUpdate: $\pi_{k+1}(a|s) = 1$ if $a = a_k^8$, and $\pi_{k+1}(a|s) = 0$ otherwise valueUpdate: $v_{k+1}(s) = \max_a q_k(a,s)$ } }
Policy iteration
求解步骤
初始给定一个随机策略 $\pi_0$
-
Step 1: policy evaluation(PE)
求解Bellman公式 \(v_{\pi_k} = r_{\pi_k} + \gamma P_{\pi_k}v_{\pi_k}\) 得到$v_{\pi_k}$:其中又嵌套了一个小的迭代 \(v_{\pi_k}^{(j+1)} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k}^{(j)}, \quad j = 0,1,2,\dots\) 求解近似最优的 $v_{\pi_k}$
-
Step 2:policy improvement(PI)
将 $v_{\pi_k}$ 代入求解得到更优的策略 $\pi_{k+1}$ \(\pi_{k+1} = \arg \max_a (r_\pi + \gamma P_\pi v_{\pi_k})\)
可以证明当$\pi_{k+1} = \arg \max_a (r_\pi + \gamma P_\pi v_{\pi_k})$时,有$v_{\pi_{k+1}} \geq v_{\pi_k}$,因此有 \(v_{\pi_0} \leq v_{\pi_1} \leq \dots \leq v_{\pi_k} \leq v^*\)
当 ${ \pi_k }_{k=0}^\infty $ 收敛时,$v_k$ 也会收敛到最优值 $v^*$
伪代码
white policy has not converged {
arbitrary initial guess $v_{\pi_k}^{(0)}$
while $v_{\pi_k}^{(j)}$ has not converged {
for everyState s in S {
$v_{\pi_k}^{(j+1)}(s) = \sum_a \pi_k(a s)\left[ \sum_rp(r s,a)r+ \gamma \sum_{s’} p(s’ s,a)v_{\pi_k}^{(j)}(s’) \right]$ }
}
for everyState s in S {
for everyAction a in A {
$q_{\pi_k}(s,a) = \sum_r p(r s,a)r + \gamma \sum_{s’} p(s’ s,a)v_{\pi_k}(s’)$ }
$a_k^*(s) = \arg \max_a q_{\pi_k}(s,a)$
$\pi_{k+1}(a s) = 1$ if $a = a_k^*$ else $\pi_{k+1}(a s) = 0$ }
}
Truncated policy iteration
是value iteration和policy iteration的一般化
三者之间的本质关于与policy iteration中,内部计算$v_{\pi_k}^{(j)}$迭代的次数有关
- value iteration:仅迭代了一次
- policy iteration:迭代无穷多次得到$v_{\pi_k}$
- truncated policy iteration:迭代次数介于一次和无穷次之间
求解步骤
在policy iteration的内嵌迭代中,将while修改为for指定迭代次数