1 minute read

Temporal-Difference Learning

使用RM算法求解 \(w = \mathbb{E}[R + \gamma v(X)]\) 构造函数 \(\begin{aligned} g(w) &= w - \mathbb{E}[R+\gamma v(X)] \\ \tilde{g}(w,\eta) &= w - [r + \gamma v(x)] \\ &= (w - \mathbb{E}[R + \gamma v(X)]) + (\mathbb{E}[R+\gamma v(X)] -[r+\gamma v(x)]) \\ &= g(w) + \eta \end{aligned}\) 则问题变为了求解RM算法 \(w_{k+1} = w_k - \alpha_k\tilde{g}(w_k,\eta_k) = w_k - \alpha_k[w_k - (r_k + \gamma v(x))]\)

TD Learning

state values

基于数据进行强化学习,由 ${ (s_t, r_{t+1}. s_{t+1})}$ 生成策略 $\pi$

\[\begin{cases} v_{t+1} (s_t) = v_t(s_t) - \alpha_t(s_t) \Big [v_t(s_t) - [r_{t+1} + \gamma v_t(s_{t+1})] \Big ] \\ v_{t+1}(s) = v_t(s), \quad \forall s \neq s_t \end{cases}\]
  • TD target

    \[\bar{v}_t = r_{t+1} + \gamma v_t(s_{t+1})\]

    为什么称为target \(\begin{aligned} v_{t+1}(s_t) &= v_t(s_t) - \alpha_t(s_t)[v_t(s_t) - \bar{v}_t] \\ v_{t+1}(s_t) - {\color{blue}\bar{v}_t} &= v_t(s_t) - {\color{blue}\bar{v}_t} - \alpha_t(s_t)[v_t(s_t) - \bar{v}_t] \\ v_{t+1}(s_t) - {\color{blue}\bar{v}_t} &= [1 - \alpha_t(s_t)][v_t(s_t) - {\color{blue}\bar{v}_t}] \\ \end{aligned}\)

    因为$|1 - \alpha_t(s_t)| < 1$,故 \(|v_{t+1}(s_t) - {\color{blue}\bar{v}_t}| \leq |v_t(s_t) - {\color{blue}\bar{v}_t}|\)

  • TD error

\[\delta_t = v(s_t) - [r_{t_1} + \gamma v(s_{t+1})]\]

​ 修正 $v(s_t) $ 接近于目标 $v_\pi(s_t)$

性质

  • 给定一个policy,估计它的state value,即 policy evaluation
    • 不能估计action value
    • 不能给出最优策略
  • 在没有模型的情况下,计算贝尔曼公式

action values

Sarsa算法

需要数据 ${ (s_t,a_t, r_{t+1},s_{t+1},a_{t+1})}$

\[\begin{cases} q_{t+1}(s_t,a_t) = q_t(s_t,a_t) - \alpha_t(s_t,a_t) \Big[ q_t(s_t,a_t) - [r_{t+1} + \gamma q_t(s_{t+1},a_{t+1})]\Big]\\ q_{t+1}(s,a) = q_t(s,a), \quad \forall(s,a) \neq (s_t, a_t) \end{cases}\]