解决报错

代码文件在 spinup/alogs/pytorch/vpg/vpg.py . 我们尝试运行代码, 然后就报错了…

...
usage: ipykernel_launcher.py [-h] [--env_name ENV_NAME] [--render] [--lr LR]
ipykernel_launcher.py: error: unrecognized arguments: -f xxxx.json

An exception has occurred, use %tb to see the full traceback.

SystemExit: 2

xxxx/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:xxxx: UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)

嗯, 先看篇这篇文章解决. 然后再次运行, 又报错… 这次又是啥 ?!

...
--> 342     mpi_fork(args.cpu)  # run parallel code with mpi
...
CalledProcessError: Command '['mpirun', '-np', '4', 'xxxx/anaconda3/bin/python', 'xxxx/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py', '-f', 'xxxx/.local/share/jupyter/runtime/kernel-342bc725-7d2c-4cba-95f4-32c9b625dd61.json']' returned non-zero exit status 1.

观察了一下, 于是直接粗暴的删掉这一行 (就是这么任性) . 再次运行, 还是报错 ???

1 2	... DependencyNotInstalled: No module named 'mujoco_py'. (HINT: you need to install mujoco_py, and also perform the setup instructions here: https://github.com/openai/mujoco-py/.)

这个问题我弄了好久, 最后发现好像是环境的问题, 我们将参数中 env 的值 HalfCheetah-v2 改为 CartPole-v0 , 也就是

1	parser.add_argument('--env', type=str, default='HalfCheetah-v2')

改成

1	parser.add_argument('--env', type=str, default='CartPole-v0')

然后再次运行. 终于, 成功运行了, 这下能够愉快的开启我们的代码研究之旅了.

Vanilla Policy Gradient

伪代码

使用 $\text{GAE-Lambda}$ (广义优势估计) 来进行优势估计. 因此需要拟合价值函数 $V^{\pi}(s_t)$ , 进而计算策略梯度进行优化. 有关广义优势估计的文章在这.

代码详解

VPGBuffer

"""
A buffer for storing trajectories experienced by a VPG agent interacting
with the environment, and using Generalized Advantage Estimation (GAE-Lambda)
for calculating the advantages of state-action pairs.
"""

从注释以及变量名中我们可以看出 VPGBuffer 是用来储存采样轨迹的各种信息的.

store

储存轨迹中的变量, 一个很简单的函数.

finish_path

结束一个 epoch 时调用的函数, 用之前储存的变量来计算 adv_buf (广义优势) 与ret_buf (回报).

self.adv_buf

计算广义优势估计.

last_val 的作用是方便计算 deltas , 而 deltas 就是 $\{\delta_1^V,\delta_2^V,\delta_3^V,\dots\}$ . (见广义优势估计)

其中计算优势时调用了一个重要的函数 core.discount_cumsum 这个函数在 core.poy 中有定义. 注释如下

"""
magic from rllab for computing discounted cumulative sums of vectors.

input: 
    vector x, 
    [x0, 
     x1, 
     x2]

output:
    [x0 + discount * x1 + discount^2 * x2,  
     x1 + discount * x2,
     x2]
"""

确实很 magic . 而由于输入是 deltas 与 self.gamma * self.lam 而由注释看出计算的其实就是广义优势估计 $ \hat{A}_t^{\mathrm{GAE}(\gamma,\lambda)}$ .

self.ret_buf

计算有折损状态函数 $V^{\pi,\gamma}(s_t)$.

VPG

注释中已经详细介绍了参数的意义和作用. 中间有很多保存变量, 多线程的东西, 我们都略过, 只讲算法主体部分.

ac

1	ac = actor_critic(env.observation_space, env.action_space, **ac_kwargs)

由 actor_critic 对象生成, actor_critic 是对象 core.MLPActorCritic , 该对象在 core.py 中被定义, 由其从 torch.nn.Module 继承可知这是个神经网络.

class MLPActorCritic(nn.Module):


    def __init__(self, observation_space, action_space, 
                 hidden_sizes=(64,64), activation=nn.Tanh):
        super().__init__()

        obs_dim = observation_space.shape[0]

        # policy builder depends on action space
        if isinstance(action_space, Box):
            self.pi = MLPGaussianActor(obs_dim, action_space.shape[0], hidden_sizes, activation)
        elif isinstance(action_space, Discrete):
            self.pi = MLPCategoricalActor(obs_dim, action_space.n, hidden_sizes, activation)

        # build value function
        self.v  = MLPCritic(obs_dim, hidden_sizes, activation)

    def step(self, obs):
        with torch.no_grad():
            pi = self.pi._distribution(obs)
            a = pi.sample()
            logp_a = self.pi._log_prob_from_distribution(pi, a)
            v = self.v(obs)
        return a.numpy(), v.numpy(), logp_a.numpy()

    def act(self, obs):
        return self.step(obs)[0]

self.pi 与 self.v 分别是动作函数与价值函数.

其中出现了判断 action_space 是 Box 还是 Discrete 类型的代码. Box 与 Discrete 都是 Space 对象, 描述当前动作或环境. 其中 Box 表示多维连续空间, Discrete 表示一维离散空间. MLPGaussianActor 与 MLPCategoricalActor 都分别刻画了一个神经网络. 其输入 obs_dim 个数据, 输出 action_space.n 或者 action_space.shape[0] 个数据, 并且隐层由 hidden_sizes 指定.