less than 1 minute read

基本概念

Markov decision process(MPD)

Sets

  1. State: $\mathcal{S}$
  2. Action: $\mathcal{A}(s)$,在状态$s$下执行的动作集合,$s \in \mathcal{S}$
  3. Reward:$\mathcal{R}(s,a)$,在状态$s$下执行$a$动作对应的回报,

Probability distribution

  1. 状态转移可能(State transition probability):$p(s’ s,a)$
  2. 获得回报的可能(Reward probability):$p(r s,a)$

Policy

  • 策略:在状态$s$下,执行的动作动作$a$的可能性,$\pi(a s)$

Markov property

  • 无记忆性,即当前状态下执行动作和获得回报的可能与之前的历史动作无关

    $p(s_{t+1} a_{t+1},s_t,\dots,a_1,s_0) = p(s_{t+1} a_{t+1},s_t)$
    $p(r_{t+1} a_{t+1},s_t,\dots,a_1,s_0) = p(r_{t+1} a_{t+1},s_t)$

Return

对应一条trajectory的所有reward相加的值。

  1. return:$return = 1 + 0 + 0 + 1 = 2$

  2. discounted return 无限项相加,引入权重discount rate $\gamma \in [0, 1)$ \(\begin{aligned} \text{return} &= \gamma \cdot 1 + \gamma^2 \cdot 0 + \gamma^3 \cdot 0 + \gamma^4 \cdot 1 + \cdots \\ &= \gamma^3 (1 + \gamma + \gamma^2 + \cdots) \\ &= \gamma^3 \cdot \frac{1}{1 - \gamma} \end{aligned}\)

Episodic and continuing tasks

Episodic task 会停止的任务,如到达终点

task have no terminal states(和环境的交互不会停止),称为continuing tasks