强化学习笔记（5）-蒙特卡洛方法

April 15, 2025 1 minute read

Monte Carlo Learning

Monte Carlo estimation

抛硬币的例子

Model-based

基于已知的概率计算期望值
Model-free

基于大数定理

先进行$N$ 次实验，估计出近似的概率，再计算近似期望值

MC Basic 算法

将policy iteration中基于model的部分修改为model-free的算法

策略迭代的两个步骤

\[\begin{cases} \text{Policy evaluation:} \quad v_{\pi_k} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k} \\ \text{Policy improvement:} \quad \pi_{k+1} = \arg\max_{\pi} \left( r_{\pi} + \gamma P_{\pi} v_{\pi_k} \right) \end{cases}\]

The elementwise form of the policy improvement step is:

\[\begin{aligned} \pi_{k+1}(s) &= \arg\max_{\pi} \sum_a \pi(a \mid s) \left[ \sum_r p(r \mid s, a) r + \gamma \sum_{s'} p(s' \mid s, a) v_{\pi_k}(s') \right] \\ &= \arg\max_{\pi} \sum_a \pi(a \mid s) q_{\pi_k}(s, a), \quad s \in \mathcal{S} \end{aligned}\]

其中的关键是 $q_{\pi_k}(s, a) $

不依赖于模型的算法： $q_{\pi_k} = \mathbb{E} [ G_t | S_t = s, A_t = a]$

MC Expolring Starts

Visit

在MC Basic算法中

initial-visit method

更高效的方法（Data-efficient method）

first-visit method：只有首次访问到某个$(s,a)$时，才会使用对应的reward来更新$(s,a)$的价值
every-visit method：每次访问到某个$(s,a)$时，都会使用对应的reward来更新$(s,a)$的价值

更新policy的方法

Generalized policy iteration（GPI）

first method：从一个$(s,a)$出发收集完所有的episode之后，再计算average return来估计action value，最后改进policy。效率比较低
second method：得到一个episode之后使用return立即估计action value并改进policy

MC $\varepsilon$-Greedy algorithm

Soft Policy

\[\begin{cases} \text{deterministic} \leftarrow \text{greey policy} \\ \text{stochastic} \leftarrow \text{soft policy}, \varepsilon \text{-greedy} \end{cases}\]

$\varepsilon\text{-greedy}$

\[\pi(a|s) = \begin{cases} 1 - \frac{\varepsilon}{\mathcal{A}(s)|}(|\mathcal{A}(s)|-1), \quad \text{for the greedy action} \\ \frac{\varepsilon}{|\mathcal{A}(s)|}, \quad \text{for the other $|\mathcal{A}(s)| - 1$ action} \end{cases}\]

其中 $\varepsilon \in [0, 1]$ ，$

\mathcal{A}(s)

$ 为在状态 $s$ 下action的个数

可以证明采取 greedy action的概率比other action的概率要大
用于平衡exploitation（利用）和exploration（探索）

在求解$\pi_{k+1}(s)$时，在所有的$\varepsilon$-greedy的策略$\Pi_\varepsilon$中寻找 $\pi_{k+1}(s) = \arg \max_{\pi \in \Pi_\varepsilon} \sum_a \pi(a|s) q_|{\pi_k}(s,a)$ 得到的最优策略为 $\pi_{k+1}(a|s) = \begin{cases} 1 - \frac{|\mathcal{A}| - 1}{|\mathcal{A}|}\varepsilon, & a = a_k^* \\ \frac{1}{|\mathcal{A}|}\varepsilon, & a \neq a_k^* \end{cases}$ 其中的 $\varepsilon$ 决定了探索性的强度，$\varepsilon$ 越大，探索性越强，但是对应的最优性会降低。

Twitter Facebook LinkedIn

强化学习笔记（5）-蒙特卡洛方法

Monte Carlo Learning

Monte Carlo estimation

MC Basic 算法

MC Expolring Starts

Visit

更新policy的方法

MC $\varepsilon$-Greedy algorithm

Soft Policy

$\varepsilon\text{-greedy}$

You May Also Enjoy

强化学习笔记（7）-时序差分方法

强化学习笔记（6）-随机梯度下降

强化学习笔记（4）-值迭代与策略迭代

强化学习笔记（3）-贝尔曼最优公式