Skip to content

Week 3 - OpenAI Project

Posted on:June 22, 2018 at 08:00 AM

Imitation Learning and Mujoco

Walk 2d

This week I worked on Homework 1: Imitation Learning from the Fall 2017 CS294 course at Berkeley. Professor Levine is an amazing lecturer and the information he covers in one lecture is quite dense.

Imitation Learning is a form of Supervised machine learning for behavior. For this exercise, we were supplied with expert policies for six different OpenAI Gym Mujoco environments. Each environment has different observation and action spaces:


The task was to train a Neural Network on these expert policies (Behavioral Cloning), compare it to the expert results, and, lastly, enhance the Neural Network with an additional aggregation step (DAgger).

A simple Neural Network with non-linear activations is typically used, although a RNN can be deployed for non-Markovian tasks where behavior is dependent on all previous observations, instead of just the current observation.

The input to the network is an observation and the output is an action. Loss is calculated using the mean squared error between the predicted actions and expert actions.

The DAgger algorithm adds an additional step, where observations are generated from the trained policy, then passed to the expert policy for labeling with actions. This new experience is then aggregated into the dataset.

DAgger algorithm

Above diagram from Sergey Levine’s CS294 Lecture 2: Supervised Learning of Behaviors


My code can be found here:

Here are the dependencies:

Generate Rollout Data

Before training, roll-out data must be generated from the expert policy files. A single roll-out is the result of a single episode executed until done or maximum timesteps are reached.

import run_expert


The network has two fully-connected layers with 50 units per layer, followed by a ReLU non-linearity. The observation data is normalized before training. I used a batch size of 32 and learning rate of .001. For behavioral cloning, I trained for 100 epochs and for DAgger, 40 epochs.

TF Graph

To train the model, run:


Training Results

EnvironmentRoll-outsExpert RewardsBC RewardsDAgger Rewards
Ant-v12504747 (459)905 (1)896 (2)
HalfCheetah-v1104161 (69)4197 (76)4139 (57)
Hopper-v1103780 (1)3581 (598)3775 (2)
Humanoid-v125010402 (107)354 (7)385 (24)
Reacher-v110-3.8 (1)-14 (5)-12 (3)
Walker2d-v12505513 (49)4993 (1058)5460 (132)

The standard deviation of the rewards is in parenthesis. I was able to get good results on HalfCheetah, Hopper, and Walker2d. Below shows the validation loss comparison between DAgger and Behavioral Cloning (BC). DAgger was able to train faster and better in 40 epochs than BC in 100 epochs on Walker2d.

Walker2D Validation Loss

Behavioral Cloning:


The Walker2d video of the DAgger version looks a little smoother than the BC version.


To view loss charts after training execution, run:

tensorboard --logdir=results

Next Week:

Policy Gradients!