强化学习笔记(1)-基本概念
基本概念
Markov decision process(MPD)
Sets
- State: $\mathcal{S}$
- Action: $\mathcal{A}(s)$,在状态$s$下执行的动作集合,$s \in \mathcal{S}$
- Reward:$\mathcal{R}(s,a)$,在状态$s$下执行$a$动作对应的回报,
Probability distribution
-
状态转移可能(State transition probability):$p(s’ s,a)$ -
获得回报的可能(Reward probability):$p(r s,a)$
Policy
-
策略:在状态$s$下,执行的动作动作$a$的可能性,$\pi(a s)$
Markov property
-
无记忆性,即当前状态下执行动作和获得回报的可能与之前的历史动作无关
$p(s_{t+1} a_{t+1},s_t,\dots,a_1,s_0) = p(s_{t+1} a_{t+1},s_t)$ $p(r_{t+1} a_{t+1},s_t,\dots,a_1,s_0) = p(r_{t+1} a_{t+1},s_t)$
Return
对应一条trajectory的所有reward相加的值。
-
return:$return = 1 + 0 + 0 + 1 = 2$
-
discounted return 无限项相加,引入权重discount rate $\gamma \in [0, 1)$ \(\begin{aligned} \text{return} &= \gamma \cdot 1 + \gamma^2 \cdot 0 + \gamma^3 \cdot 0 + \gamma^4 \cdot 1 + \cdots \\ &= \gamma^3 (1 + \gamma + \gamma^2 + \cdots) \\ &= \gamma^3 \cdot \frac{1}{1 - \gamma} \end{aligned}\)
Episodic and continuing tasks
Episodic task 会停止的任务,如到达终点
task have no terminal states(和环境的交互不会停止),称为continuing tasks