强化学习笔记（1）-基本概念

April 12, 2025 less than 1 minute read

基本概念

Markov decision process（MPD）

无记忆性，即当前状态下执行动作和获得回报的可能与之前的历史动作无关

$p(s_{t+1} a_{t+1},s_t,\dots,a_1,s_0) = p(s_{t+1} a_{t+1},s_t)$

$p(r_{t+1} a_{t+1},s_t,\dots,a_1,s_0) = p(r_{t+1} a_{t+1},s_t)$

对应一条trajectory的所有reward相加的值。

return：$return = 1 + 0 + 0 + 1 = 2$
discounted return 无限项相加，引入权重discount rate $\gamma \in [0, 1)$ $\begin{aligned} \text{return} &= \gamma \cdot 1 + \gamma^2 \cdot 0 + \gamma^3 \cdot 0 + \gamma^4 \cdot 1 + \cdots \\ &= \gamma^3 (1 + \gamma + \gamma^2 + \cdots) \\ &= \gamma^3 \cdot \frac{1}{1 - \gamma} \end{aligned}$

Episodic task 会停止的任务，如到达终点

task have no terminal states（和环境的交互不会停止），称为continuing tasks