1 minute read

Value iteration & Policy iteration

Value iteration

求解步骤

初始给定一个随机值 $v_0$,使用迭代法 \(v_{k+1} = f(v_k) = \max_\pi (r_\pi + \gamma P_\pi v_k), \quad k = 1, 2, 3, \dots\)

  1. Step 1:policy update,给定 $v_0$, 利用迭代法求解
\[\pi_{k+1} = \arg \max_\pi (r_\pi + \gamma P_\pi v_k)\]

其中的 \(\pi_{k+1}(s) = \arg \max_\pi \sum_a \pi(a|s) \underbrace{\left( \sum_r p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_k(s') \right)}_{q_k(s,a)}\)

采用贪心算法

将$v_k$带入求解出$q_k(s,a)$,对应最大的 $q_k(s,a)$的action记作 $a_k^*(s)$,得到 \(\pi_{k+1}(s) = \begin{cases} 1 & a = a_k^*(s) \\ 0 & a \neq a_k^*(s) \end{cases}\)

  1. Step 2:value update
\[v_{k+1} = r_{\pi_{k+1}} + \gamma P_{\pi_{k+1}}v_k\]

注意,$v_k$不一定是 state value

其中的 \(v_{k+1}(s) = \sum_a \pi_{k+1}(a|s) \underbrace{ \left( \sum_r p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_k(s')\right)}_{q_k(s,a)}\) 对应的$v_{k+1}(s)$ 就是 $q_k$ 里的最大值 \(v_{k+1}(s) = \max_a q_k(a,s)\)

伪代码

while $||v_k - v_{k-1} || > goal ${ for everyState s in S { for everyAction a in A { q-valuel: $q_k(s,a) = \sum_r p(r|s,a)r + \gamma \sum_{s’} p(s’|s,a)v_k(s’)$ }
maximumActionValue: $a_k^*(s) = \arg \max_a q_k(a,s)$ policyUpdate: $\pi_{k+1}(a|s) = 1$ if $a = a_k^8$, and $\pi_{k+1}(a|s) = 0$ otherwise valueUpdate: $v_{k+1}(s) = \max_a q_k(a,s)$ } }

Policy iteration

求解步骤

初始给定一个随机策略 $\pi_0$

  1. Step 1: policy evaluation(PE)

    求解Bellman公式 \(v_{\pi_k} = r_{\pi_k} + \gamma P_{\pi_k}v_{\pi_k}\) 得到$v_{\pi_k}$:其中又嵌套了一个小的迭代 \(v_{\pi_k}^{(j+1)} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k}^{(j)}, \quad j = 0,1,2,\dots\) 求解近似最优的 $v_{\pi_k}$

  2. Step 2:policy improvement(PI)

    将 $v_{\pi_k}$ 代入求解得到更优的策略 $\pi_{k+1}$ \(\pi_{k+1} = \arg \max_a (r_\pi + \gamma P_\pi v_{\pi_k})\)

    可以证明当$\pi_{k+1} = \arg \max_a (r_\pi + \gamma P_\pi v_{\pi_k})$时,有$v_{\pi_{k+1}} \geq v_{\pi_k}$,因此有 \(v_{\pi_0} \leq v_{\pi_1} \leq \dots \leq v_{\pi_k} \leq v^*\)

    当 ${ \pi_k }_{k=0}^\infty $ 收敛时,$v_k$ 也会收敛到最优值 $v^*$

伪代码

white policy has not converged {

​ arbitrary initial guess $v_{\pi_k}^{(0)}$

​ while $v_{\pi_k}^{(j)}$ has not converged {

​ for everyState s in S {

​ $v_{\pi_k}^{(j+1)}(s) = \sum_a \pi_k(a s)\left[ \sum_rp(r s,a)r+ \gamma \sum_{s’} p(s’ s,a)v_{\pi_k}^{(j)}(s’) \right]$

​ }

​ }

​ for everyState s in S {

​ for everyAction a in A {

​ $q_{\pi_k}(s,a) = \sum_r p(r s,a)r + \gamma \sum_{s’} p(s’ s,a)v_{\pi_k}(s’)$

​ }

​ $a_k^*(s) = \arg \max_a q_{\pi_k}(s,a)$

​ $\pi_{k+1}(a s) = 1$ if $a = a_k^*$ else $\pi_{k+1}(a s) = 0$

​ }

}

Truncated policy iteration

是value iteration和policy iteration的一般化

三者之间的本质关于与policy iteration中,内部计算$v_{\pi_k}^{(j)}$迭代的次数有关

  • value iteration:仅迭代了一次
  • policy iteration:迭代无穷多次得到$v_{\pi_k}$
  • truncated policy iteration:迭代次数介于一次和无穷次之间

求解步骤

在policy iteration的内嵌迭代中,将while修改为for指定迭代次数