强化学习笔记(7)-时序差分方法
Temporal-Difference Learning
使用RM算法求解 \(w = \mathbb{E}[R + \gamma v(X)]\) 构造函数 \(\begin{aligned} g(w) &= w - \mathbb{E}[R+\gamma v(X)] \\ \tilde{g}(w,\eta) &= w - [r + \gamma v(x)] \\ &= (w - \mathbb{E}[R + \gamma v(X)]) + (\mathbb{E}[R+\gamma v(X)] -[r+\gamma v(x)]) \\ &= g(w) + \eta \end{aligned}\) 则问题变为了求解RM算法 \(w_{k+1} = w_k - \alpha_k\tilde{g}(w_k,\eta_k) = w_k - \alpha_k[w_k - (r_k + \gamma v(x))]\)
TD Learning
state values
\[\begin{cases} v_{t+1} (s_t) = v_t(s_t) - \alpha_t(s_t) \Big [v_t(s_t) - [r_{t+1} + \gamma v_t(s_{t+1})] \Big ] \\ v_{t+1}(s) = v_t(s), \quad \forall s \neq s_t \end{cases}\]基于数据进行强化学习,由 ${ (s_t, r_{t+1}. s_{t+1})}$ 生成策略 $\pi$
-
TD target
\[\bar{v}_t = r_{t+1} + \gamma v_t(s_{t+1})\]为什么称为target \(\begin{aligned} v_{t+1}(s_t) &= v_t(s_t) - \alpha_t(s_t)[v_t(s_t) - \bar{v}_t] \\ v_{t+1}(s_t) - {\color{blue}\bar{v}_t} &= v_t(s_t) - {\color{blue}\bar{v}_t} - \alpha_t(s_t)[v_t(s_t) - \bar{v}_t] \\ v_{t+1}(s_t) - {\color{blue}\bar{v}_t} &= [1 - \alpha_t(s_t)][v_t(s_t) - {\color{blue}\bar{v}_t}] \\ \end{aligned}\)
因为$|1 - \alpha_t(s_t)| < 1$,故 \(|v_{t+1}(s_t) - {\color{blue}\bar{v}_t}| \leq |v_t(s_t) - {\color{blue}\bar{v}_t}|\)
-
TD error
修正 $v(s_t) $ 接近于目标 $v_\pi(s_t)$
性质
- 给定一个policy,估计它的state value,即 policy evaluation
- 不能估计action value
- 不能给出最优策略
- 在没有模型的情况下,计算贝尔曼公式
action values
Sarsa算法
\[\begin{cases} q_{t+1}(s_t,a_t) = q_t(s_t,a_t) - \alpha_t(s_t,a_t) \Big[ q_t(s_t,a_t) - [r_{t+1} + \gamma q_t(s_{t+1},a_{t+1})]\Big]\\ q_{t+1}(s,a) = q_t(s,a), \quad \forall(s,a) \neq (s_t, a_t) \end{cases}\]需要数据 ${ (s_t,a_t, r_{t+1},s_{t+1},a_{t+1})}$