490 lines
11 KiB
Markdown
490 lines
11 KiB
Markdown
# Reinforcement Learning
|
||
|
||
## MDP
|
||
|
||
### 马尔科夫过程
|
||
|
||
$$
|
||
<S,P>
|
||
$$
|
||
|
||
### 马尔科夫奖励过程
|
||
|
||
$$
|
||
<S,P,r,\gamma>
|
||
$$
|
||
|
||
- 回报
|
||
|
||
$$
|
||
G_t=R_t+\gamma R_{t+1}+\gamma^2R_{t+2}+\cdots=\sum^{\infty}_{k=0}\gamma^k R_{t+k}
|
||
$$
|
||
- 价值函数
|
||
|
||
$$
|
||
\begin{aligned} V(s)&= \textrm{E}[G_t|{S}_t={s}] \\
|
||
&=\textrm{E}[R_t+\gamma V({S}_{t+1})|{S}_t={s}] \\
|
||
贝尔曼方程:\\
|
||
&=r(s)+\gamma \sum_{s'\in S}P(s'|s)V(s')
|
||
\end{aligned}
|
||
$$
|
||
|
||
### 马尔科夫决策过程
|
||
|
||
$$
|
||
<S,A,P(s'|s.a),r(s,a),\gamma>
|
||
$$
|
||
|
||
- 策略
|
||
|
||
$$
|
||
\pi(a|s)=P(A_t=a|S_t=s)
|
||
$$
|
||
- 状态价值函数
|
||
|
||
$$
|
||
V^{\pi}(s)=\textrm{E}_{\pi}[G_t|S_t=s]
|
||
$$
|
||
- 动作价值函数
|
||
|
||
$$
|
||
Q^{\pi}(s,a)=\textrm{E}_{\pi}[G_t|S_t=s,A_t=a]
|
||
$$
|
||
- 贝尔曼期望方程
|
||
|
||
$$
|
||
\begin{aligned} Q^{\pi}(s,a)&=\textrm{E}_{\pi}[R_t+\gamma Q^{\pi}(s',a')|S_t=s,A_t=a]\\
|
||
&=r(s,a)+\gamma\sum_{s'\in S}P(s'|s,a)\sum_{a'\in A}\pi(a'|s')Q^{\pi}(s',a')\\
|
||
V^{\pi}(s)&=\textrm{E}_{\pi}[R_t+\gamma V^{\pi}(s')|S_t=s]\\
|
||
&=\sum_{a\in A}\pi(a|s)\left(r(s,a)+\gamma \sum_{s'\in S}P(s'|s,a)V^{\pi}(s')\right)
|
||
\end{aligned}
|
||
$$
|
||
- 蒙特卡洛方法
|
||
|
||
重复随机抽样,然后运用概率统计方法来从抽样结果中归纳出目标的数值估计
|
||
|
||
- MDP状态价值
|
||
|
||
$$
|
||
\begin{aligned} &用策略在\textrm{MDP}上采样很多条序列,计算从这个状态出发的回报再求其期望\\
|
||
&V^\pi(s)=\textrm{E}_\pi[G_t|S_t=s]\approx \frac{1}{N}\sum^N_{i=1}G_t^{(i)}
|
||
\end{aligned}
|
||
$$
|
||
- 一条序列只计算一次回报,也就是这条序列第一次出现该状态是计算后面的累积奖励,而后面再出现该状态时,该状态就被忽略了
|
||
|
||
$$
|
||
\begin{aligned} &(1)使用策略\pi采样若干条序列:\\
|
||
&s_0^{(i)} \xrightarrow{a_0^{(i)}} r_0^{(i)},s_1^{(i)}\xrightarrow{a_1^{(i)}} r_1^{(i)},s_2^{(i)}\xrightarrow{a_2^{(i)}} \cdots \xrightarrow{a_{T-1}^{(i)}} r_{T-1}^{(i)},s_T^{(i)}\\
|
||
&(2)对每一条序列中的每一时间步t的状态s进行以下操作:\\
|
||
&a.\quad更新状态s的计数器N(s)\leftarrow N(s)+1;\\
|
||
&b.\quad更新状态s的总回报M(s)\leftarrow M(s)+G;\\
|
||
&(3)每一个状态的价值被估计为回报的期望\\
|
||
&V(s)=M(s)/N(s)\\
|
||
&根据大数定律,当N(s)\rightarrow \infty时,有V(s)\rightarrow V^\pi(s),所以还有一种增量更新方法:\\
|
||
&a.\quad N(s)\leftarrow N(s)+1;\\
|
||
&b.\quad V(s)\leftarrow V(s)+\frac{1}{N(s)}(G-V(s));\\
|
||
\end{aligned}
|
||
$$
|
||
|
||
- 贝尔曼最优方程
|
||
|
||
$$
|
||
\begin{aligned} V^*(s)&=\max_{\pi}V^{\pi}(s) \qquad \forall s\in S\\
|
||
&=\max_{a\in A}Q^*(s,a)\\
|
||
&=\max_{a \in A}\{r(s,a)+\gamma \sum_{s'\in S}P(s'|s,a)V^*(s')\}\\
|
||
Q^*(s,a)&=\max_\pi Q^\pi(s,a) \qquad \forall s\in S,a\in A\\
|
||
&=r(s,a)+\gamma \sum_{s'\in S}P(s'|s,a)V^*(s')\\
|
||
&=r(s,a)+\gamma \sum_{s'\in S}P(s'|s,a)\max_{a'\in A}Q^*(s',a')
|
||
\end{aligned}
|
||
$$
|
||
|
||
## value-based
|
||
|
||
### DP
|
||
|
||
- 策略迭代
|
||
|
||
策略评估与策略提升不断循环交替,直至最后得到最优策略
|
||
|
||
- 策略评估
|
||
|
||
$$
|
||
\begin{aligned}[h] &V^{k+1}(s)=\sum_{a\in A}\pi(a|s)\left( r(s,a)+ \gamma\sum_{s'\in S}P(s'|s,a)V^k(s') \right)\\
|
||
&当k\rightarrow \infty时,
|
||
序列\{V^{k}\}会收敛到V^\pi\\
|
||
&实际中,当\max_{s\in S}|V^{k+1}(s)-V^k(s)|\leq\epsilon时,结束策略评估
|
||
\end{aligned}
|
||
$$
|
||
- 策略提升
|
||
|
||
$$
|
||
\begin{aligned} &\textrm{if}\quad \exists \quad \pi'\\
|
||
&\textrm{st.}\quad \forall s\in S,
|
||
\quad Q^\pi(s,\pi'(s))\geq V^\pi(s)\\
|
||
&\textrm{then}\quad V^{\pi'}(s)\geq V^\pi(s)\\
|
||
&\pi'(s)=\arg \max_a Q^\pi(s,a)=\arg\max_a\{r(s,a)+\gamma\sum_{s'}P(s'|s,a)V^\pi (s')\}
|
||
\end{aligned}
|
||
$$
|
||
|
||
- 价值迭代
|
||
|
||
$$
|
||
\begin{aligned} &V^{k+1}(s)=\max_{a\in A}\{r(s,a)+\gamma\sum_{s'\in S}P(s'|s,a)V^k(s)\}\\
|
||
&\textrm{when}\quad V^{k+1}=V^k\\
|
||
&\pi(s)=\arg\max_a\{r(s,a)+\gamma\sum_{s'}P(s'|s,a)V^{k+1}(s)\}
|
||
\end{aligned}
|
||
$$
|
||
|
||
### TD
|
||
|
||
- 价值函数的蒙特卡洛
|
||
|
||
$$
|
||
V(s_t)\leftarrow V(s_t)+\alpha[G-V(s_t)]
|
||
$$
|
||
- 价值函数的时序差分
|
||
|
||
$$
|
||
\begin{aligned} &V(s_t)\leftarrow V(s_t)+\alpha[r_t+\gamma V(s_{t+1})-V(s_t)]\\
|
||
&其中,r_t+\gamma V(s_{t+1})-V(s_t)称为时序差分
|
||
\end{aligned}
|
||
$$
|
||
- Sarsa
|
||
|
||
-
|
||
|
||
$$
|
||
\begin{aligned} &初始化Q(s,a)\\
|
||
&\textbf{for}序列e=1\rightarrow E\quad \textbf{do:}\\
|
||
&\quad得到初始状态s\\
|
||
&\quad用\epsilon -贪婪策略根据Q选择当前状态s下的动作a\\
|
||
&\quad \textbf{for} 时间步t=1\rightarrow T\quad\textbf{do:}\\
|
||
&\qquad得到环境反馈r,s'\\
|
||
&\qquad用\epsilon -贪婪策略根据Q选择当前状态s'下的动作a'\\
|
||
&\qquad Q(s,a)\leftarrow Q(s,a)+\alpha[r+\gamma Q(s',a')-Q(s,a)]\\
|
||
&\qquad s\leftarrow s',a\leftarrow a'\\
|
||
&\quad \textbf{end for}\\
|
||
&\textbf{end for}
|
||
\end{aligned}
|
||
$$
|
||
- 在线策略
|
||
|
||
$$
|
||
\begin{aligned} &行为策略(采样数据表的策略)与\\
|
||
&目标策略(用这些数据更新的策略)相同,\\
|
||
&如a和a'皆有当前策略\epsilon-Greedy(Q)采样得到
|
||
\end{aligned}
|
||
$$
|
||
- 多步Sarsa
|
||
|
||
- 使用n步的奖励,然后使用之后状态的价值估计
|
||
|
||
$$
|
||
G_t=r_t+\gamma r_{t+1}+\cdots +\gamma^nQ(s_{t+n},a_{t+n})
|
||
$$
|
||
- 动作价值函数更新
|
||
|
||
$$
|
||
Q(s_t,a_t)\leftarrow Q(s_t,a_t)+\alpha[r_t+\gamma r_{t+1}+\cdots +\gamma^nQ(s_{t+n},a_{t+n})-Q(s_t,a_t)]
|
||
$$
|
||
|
||
- Q-Learning
|
||
|
||
-
|
||
|
||
$$
|
||
\begin{aligned} &初始化Q(s,a)\\
|
||
&\textbf{for}序列e=1\rightarrow E\quad \textbf{do:}\\
|
||
&\quad得到初始状态s\\
|
||
&\quad用\epsilon -贪婪策略根据Q选择当前状态s下的动作a\\
|
||
&\quad \textbf{for} 时间步t=1\rightarrow T\quad\textbf{do:}\\
|
||
&\qquad得到环境反馈r,s'\\
|
||
&\qquad Q(s,a)\leftarrow Q(s,a)+\alpha[r+\gamma \max_{a'}Q(s',a')-Q(s,a)]\\
|
||
&\qquad s\leftarrow s'\\
|
||
&\quad \textbf{end for}\\
|
||
&\textbf{end for}
|
||
\end{aligned}
|
||
$$
|
||
- 离线策略
|
||
|
||
$$
|
||
\begin{aligned} &行为策略(采样数据表的策略)与\\
|
||
&目标策略(用这些数据更新的策略)不同,\\
|
||
&如a由行为策略\epsilon-Greedy(Q)采样得到,\\
|
||
&a'由当前策略\max(Q)采样得到,
|
||
\end{aligned}
|
||
$$
|
||
|
||
## policy-based
|
||
|
||
### 一般方法
|
||
|
||
- 目标函数
|
||
|
||
$$
|
||
J(\theta)=E_{S}[V_{\pi}(S)]
|
||
$$
|
||
- 策略梯度定理
|
||
|
||
$$
|
||
\frac{\partial J(\theta)}{\partial \theta}=\nabla_{\theta}J(\theta)=\mathbb{E}_{S}[\mathbb{E}_{A\backsim \pi(\cdot | S;\theta)}[\frac{\partial \ln\pi(A|S;\theta)}{\partial \theta}\cdot Q_{\pi}(S,A)]]
|
||
$$
|
||
- 随机梯度——策略梯度的无偏估计
|
||
|
||
$$
|
||
{g}(s,a;\theta)\triangleq Q_\pi(s,a)\cdot \nabla_{\theta}\ln\pi(a|s;\theta)
|
||
$$
|
||
- 策略网络提升
|
||
|
||
$$
|
||
\theta\gets\theta+\beta\cdot g(s,a;\theta)
|
||
$$
|
||
|
||
### 带基线的策略梯度方法
|
||
|
||
- 带基线的策略梯度定理——b是不依赖于A的任意函数
|
||
|
||
$$
|
||
\nabla_{\theta}J(\theta)=\mathbb{E}_{S}[\mathbb{E}_{A\backsim \pi(\cdot | S;\theta)}[(Q_{\pi}(S,A)]-b)\cdot \nabla_{\theta}\ln\pi(A|S;\theta)]
|
||
$$
|
||
- 随机梯度
|
||
|
||
$$
|
||
g_b(s,a;\theta)=[Q_{\pi}(S,A)-b]\cdot\nabla_{\theta}\ln\pi(A|S;\theta)
|
||
$$
|
||
|
||
### REINFORCE
|
||
|
||
- 折扣回报
|
||
|
||
$$
|
||
U_t=\sum_{k-t}^{n}\gamma^{k-t}\cdot R_k
|
||
$$
|
||
- 动作价值是折扣回报的期望
|
||
|
||
$$
|
||
Q_\pi(s_t,a_t)=\mathbb{E}[U_t|S_t=s_t,A_t=a_t]
|
||
$$
|
||
- 用折扣回报的观测值蒙特卡洛近似动作价值
|
||
|
||
$$
|
||
\tilde{g}(s_T,a_t;\theta)=u_t\cdot\nabla_{\theta}\ln\pi(a_t|s_t;\theta)
|
||
$$
|
||
- 策略网络提升
|
||
|
||
$$
|
||
\theta_{new}\gets\theta_{now}+\beta\cdot\sum_{t=1}^{n}\gamma^{t-1}\cdot\tilde{g}(s_t.a_t;\theta_{now})
|
||
$$
|
||
|
||
### 带基线的REINFORCE
|
||
|
||
- 策略网络
|
||
|
||
- 折扣回报
|
||
|
||
$$
|
||
u_t=\sum_{k-t}^{n}\gamma^{k-t}\cdot r_k
|
||
$$
|
||
- 基线——价值网络做出的预测
|
||
|
||
$$
|
||
\hat{v_t}=v(s_t;\omega)
|
||
$$
|
||
- 带基线的策略梯度
|
||
|
||
$$
|
||
\tilde{g}(s_t,a_t;\theta)=(u_t-\hat{v_t})\cdot\nabla_{\theta}\ln\pi(a_t|s_t;\theta)
|
||
$$
|
||
- 梯度上升
|
||
|
||
$$
|
||
\theta\gets\theta+\beta\cdot\tilde{g}(s_t,a_t;\theta)
|
||
$$
|
||
|
||
- 价值网络
|
||
|
||
- 损失函数
|
||
|
||
$$
|
||
L(\omega)=\frac{1}{2n}\sum_{t=1}^{n}[v(s_t;\omega)-u_t]^2
|
||
$$
|
||
- 损失函数的梯度
|
||
|
||
$$
|
||
\nabla_{\omega}L(\omega)=\frac{1}{n}\sum_{t=1}^{n}[v(s_t;\omega)-u_t]\cdot\nabla_\omega v(s_t;\omega)
|
||
$$
|
||
- 梯度下降
|
||
|
||
$$
|
||
\omega\gets\omega-\alpha\cdot\nabla_{\omega}L(\omega)
|
||
$$
|
||
|
||
### Actor-Critic
|
||
|
||
- 策略网络
|
||
|
||
- 用价值网络近似动作价值
|
||
|
||
$$
|
||
\hat{g}(s,a;\theta)\triangleq q(s,a;\omega)\cdot\nabla_{\theta}\ln\pi(a|s;\theta)
|
||
$$
|
||
- 策略网络提升
|
||
|
||
$$
|
||
\theta\gets\theta+\beta\cdot\hat{g}(s,a;\theta)
|
||
$$
|
||
|
||
- 价值网络
|
||
|
||
- TD目标
|
||
|
||
$$
|
||
\hat{y_t}\triangleq r_t+\gamma\cdot q(s_{t+1},a_{t+1};\omega)
|
||
$$
|
||
- 损失函数
|
||
|
||
$$
|
||
L(\omega)\triangleq \frac{1}{2}[q(s_t,a_t;\omega)-\hat{y_t}]^2
|
||
$$
|
||
- 损失函数梯度
|
||
|
||
$$
|
||
\nabla_{\omega}L(\omega)=[q(s_t,a_t;\omega)-\hat{y_t}]\cdot\nabla_{\omega}q(s_t,a_t;\omega)
|
||
$$
|
||
- 梯度下降
|
||
|
||
$$
|
||
\omega\gets\omega-\alpha\cdot\nabla_{\omega}L(\omega)
|
||
$$
|
||
|
||
### Advantage Actor-Critic (A2C)
|
||
|
||
- 策略网络
|
||
|
||
- 贝尔曼公式
|
||
|
||
$$
|
||
Q_\pi(s_t,a_t)=\mathbb{E}_{S_{t+1}\backsim p(\cdot|s_t,a_t)}[R_t+\gamma\cdot V_\pi(S_{t+1})]
|
||
$$
|
||
- 优势函数(Advantage function)
|
||
|
||
$$
|
||
Q_\pi(s,a)-V_\pi(s)
|
||
$$
|
||
- 近似策略梯度
|
||
|
||
$$
|
||
\begin{aligned}g(s_t,a_t;\theta)&=[Q_\pi(s_t,a_t)-V_\pi(s_t)]\cdot\nabla_\theta\ln\pi(a_t|s_t;\theta)\\ &=[\mathbb{E}_{S_{t+1}}[R_t+\gamma\cdot V_\pi(S_{t+1})]-V_\pi(s_t)]\cdot\nabla_\theta\ln\pi(a_t|s_t;\theta)\\
|
||
蒙特卡洛近似\\
|
||
&\thickapprox[r_t+\gamma\cdot V_\pi(s_{t+1})]-V_\pi(s_t)]\cdot\nabla_\theta\ln\pi(a_t|s_t;\theta)\\
|
||
用价值网络v(s;\omega)替换状态价值函数V_\pi(s)\\
|
||
\tilde{g}(s_t.a_t;\theta)&\triangleq [r_t+\gamma\cdot v_\pi(s_{t+1};\omega)]-v_\pi(s_t;\omega)]\cdot\nabla_\theta\ln\pi(a_t|s_t;\theta)
|
||
\end{aligned}
|
||
$$
|
||
- 策略网络提升
|
||
|
||
$$
|
||
\theta\gets\theta+\beta\cdot\tilde{g}(s,a;\theta)
|
||
$$
|
||
|
||
- 价值网络
|
||
|
||
- 贝尔曼公式
|
||
|
||
$$
|
||
V_\pi(s_t)=\mathbb{E}_{A_t\backsim\pi(\cdot|s_t;\theta)}[\mathbb{E}_{S_{t+1}\backsim p(\cdot|s_t,A_t)}[R_t+\gamma\cdot V_\pi(S_{t+1})]]
|
||
$$
|
||
- TD目标
|
||
|
||
$$
|
||
\hat{y_t}\triangleq r_t+\gamma\cdot v(s_{t+1};\omega)
|
||
$$
|
||
- 损失函数
|
||
|
||
$$
|
||
L(\omega)\triangleq\frac{1}{2}[v(s_t;\omega)-\hat{y_t}]^2
|
||
$$
|
||
- 损失函数的梯度
|
||
|
||
$$
|
||
\nabla_\omega L(\omega)=[v(s_t;\omega)-\hat{y_t}]\cdot\nabla_\omega v(s_t;\omega)
|
||
$$
|
||
- 梯度下降
|
||
|
||
$$
|
||
\omega\gets\omega-\alpha\cdot\nabla_{\omega}L(\omega)
|
||
$$
|
||
|
||
### TRPO
|
||
|
||
- 策略目标
|
||
|
||
## Multi-agent
|
||
|
||
### 完全合作关系
|
||
|
||
- multi-agent cooperative A2C(MAC-A2C)
|
||
|
||
- 策略网络
|
||
|
||
$$
|
||
\pi(A^i|S;\theta^i)
|
||
$$
|
||
|
||
- 动作A的概率密度函数
|
||
|
||
$$
|
||
\pi(A|S;\theta^1,\cdots,\theta^m)\triangleq \pi(A^1 |S;\theta^1)\times\cdots\times \pi(A^m |S;\theta^m)
|
||
$$
|
||
- 合作关系 MARL 的策略梯度定理
|
||
|
||
$$
|
||
\nabla_{\theta^i}J(\theta^1,\cdots,\theta^m)=\mathbb{E}_{S,A}\left[\left(Q_\pi(S,A)-b\right)\cdot \nabla_{\theta^i}\ln\pi(A^i|S;\theta^i)\right]
|
||
$$
|
||
- 随机梯度——策略梯度的无偏估计
|
||
|
||
$$
|
||
\begin{aligned} g^i(s,a;\theta^i)&\triangleq \left(Q_\pi(s,a)-V_\pi(s)\right)\cdot\nabla_{\theta^i}\ln\pi(a^i|s;\theta^i)\\
|
||
用r_t+\gamma\cdot v(s_{t+1};\omega)近似Q_\pi(s_t,a_t),\\用v(s_t;\omega)近似V_\pi(s_t)\\
|
||
\tilde{g}^i(s_t,a_t^i;\theta^i)&\triangleq \left(r_t+\gamma\cdot v(s_{t+1};\omega)-v(s_t;\omega)\right)\cdot \nabla_{\theta^i}\ln\pi(a_t^i|s_t;\theta^i)
|
||
\end{aligned}
|
||
$$
|
||
- 策略网络提升
|
||
|
||
$$
|
||
\theta^i\gets\theta^i+\beta\cdot\tilde{g}^i(s_t,a_t^i;\theta^i)
|
||
$$
|
||
|
||
- 价值网络
|
||
|
||
$$
|
||
v(s_t,\omega)
|
||
$$
|
||
|
||
- 观测
|
||
|
||
$$
|
||
s=[o_1,\cdots,o_m]
|
||
$$
|
||
- TD目标
|
||
|
||
$$
|
||
\hat{y_t}\triangleq r_t+\gamma\cdot v(s_{t+1};\omega)
|
||
$$
|
||
- 损失函数
|
||
|
||
$$
|
||
L(\omega)\triangleq\frac{1}{2}[v(s_t;\omega)-\hat{y_t}]^2
|
||
$$
|
||
- 损失函数的梯度
|
||
|
||
$$
|
||
\nabla_\omega L(\omega)=[v(s_t;\omega)-\hat{y_t}]\cdot\nabla_\omega v(s_t;\omega)
|
||
$$
|
||
- 梯度下降
|
||
|
||
$$
|
||
\omega\gets\omega-\alpha\cdot\nabla_{\omega}L(\omega)
|
||
$$
|
||
|