Stanford University: Training Smarter Bots for the Real World

In the fall of 2021, dozens of computer scientists submitted their best work to an AI bot challenge hosted by the Conference on Neural Information Processing Systems (NeurIPS), an annual machine learning event for distinguished researchers. Participants spent months preparing their agents to outperform the competition on four “almost lifelike” tasks in the virtual world of Minecraft:

Find a cave
Make a waterfall
Create a village animal pen
Build a village house
To mimic the complexity of real-world situations, the organizers required each agent in the competition to learn the tasks by watching human demonstrations, without the use of rewards that typically reinforce a desired robot behavior. This was a significant change from prior contest rules, and it meant that most teams would have to cope with a slower and more complicated bot training process.

For Divyansh Garg and Edmund Mills, who entered the competition as Team Obsidian just weeks before the deadline, the requirement presented an opportunity to shine. With less time and fewer resources than other teams, they rose to the top of the leaderboard and placed first in the Imitation Learning category (designated for agents that interact with their environments to learn rewards or policies). To their surprise, Team Obsidian also placed second overall — a noteworthy achievement because their agent did not use human feedback to boost its performance while playing the game, while many of their competitors did.

The key to Team Obsidian’s remarkable success is a breakthrough approach to Imitation Learning called IQ-Learn. In the months leading up to the competition, which is officially known as the MineRL Benchmark for Agents that Solve Almost Lifelike Tasks (BASALT) challenge, Garg was developing this new method in collaboration with Stefano Ermon, an associate professor in the Department of Computer Science at Stanford. IQ-Learn already could play classic Atari games better than a human expert. And it was fast becoming the state-of-the-art for training AI agents that work in dynamic environments.

A Passion for Deep Learning
Today’s industrial robots are very good at learning to repeat an exact task through a process called behavioral cloning. But when something changes in the environment that the machine has not encountered before, it cannot adjust on the fly. The mistakes compound and the machine never recovers. If we expect one day to have AI agents that can drive cars, wash the dishes, or do the laundry as well or better than humans do, we need different ways of teaching them.

Read the paper, IQ-Learn: Inverse soft-Q Learning for Imitation

As a student of computer science at Stanford with experience in robotic learning and generative modeling, Garg recognized that the next frontier for intelligent machines would involve building versatile agents that could learn to do complex tasks in ever-changing environments.

“What a human can learn in an hour, a robot would need 10 years,” he says. “I wanted to design an algorithm that could learn and transfer behavior as efficiently as humans.”

Imitation of an Expert
During an internship with machine learning researcher Ian Goodfellow at Apple, Garg had come to understand several key concepts that informed how scientists were training smarter agents:

Reinforcement Learning (RL) methods enabled an agent to interact with an environment, but researchers had to include a reward signal for the robot to learn a policy, or desired action.
A subfield of RL called Q Learning allowed the agent to start with a known reward and then learn what the Deep Learning community calls an energy-based model or Q-function. Borrowed from the field of statistical physics, a Q-function can find relationships within a small dataset and then generalize to a larger dataset that follows the same patterns. In this way, the Q-function can represent the intended policy for the robot to follow.
A related approach known as Imitation Learning held promise because it empowered an agent to learn the policy from watching visual demonstrations of an expert (human) doing the task.
Inverse Reinforcement Learning had been considered state-of-the-art for the past five years, because, in theory, it took Imitation Learning a step further. In this case, instead of trying to learn a policy, the agent’s goal is to figure out a reward that explains the human example. The catch here is that Inverse RL requires an adversarial reinforcement process — meaning the model must solve mathematically for two unknown variables: a reward and a policy. According to Garg, this process is difficult to stabilize and does not scale well to more complex situations.
With these concepts as the backdrop, Garg began thinking about how to achieve better results with a simpler approach to Imitation Learning. A nagging question began to keep him up at night: “What if you could solve for just one unknown variable instead of two?” If the two variables of reward and policy could be represented by a single, hidden Q-function, he reasoned, and if the agent learned this Q-function from watching human demonstrations, it could circumvent the need for problematic adversarial training.

Garg spent his winter break working out an algorithm and coding it. He was surprised when it worked the first time around. After one month of development, the algorithm was beating every other existing method on simple tasks and had proved to be exceptionally stable.

He recalls, “Professor Ermon looked at the results and said, ‘This is great, but why does it work?’ We didn’t know of any theory that could explain it, so I took on the challenge to write a mathematical framework that could prove the algorithm was optimal.”

Expert-Level Performance
Fast-forward to the summer of 2021, and this new method of inverse soft-Q learning (IQ-Learn for short) had achieved three- to seven-times better performance than previous methods of learning from humans. Garg and his collaborators first tested the agent’s abilities with several control-based video games — Acrobot, CartPole, and LunarLander. In each game, the agent reached expert-level performance faster than any other methods.

An agent plays Atari games including Pong and Space Invaders

Next, they tested the model on several classic Atari games — Pong, Breakout, and Space Invaders — and discovered their innovation also scaled well in more complex gaming environments. “We exceeded previous bests by 5x while requiring three times fewer environment steps, reaching close to expert performance,” Garg recalls. (An environment step refers to number of variations in the state that the agent introduced for the bot to reach this level of performance.)

The resulting scientific paper received a Spotlight designation going into the 2021 NeurIPS Conference. It was with this level of confidence and momentum that Garg proposed trying IQ-Learn in the MineRL challenge.

Success Without a Human in the Loop
To be sure, some of the “almost lifelike” tasks in Minecraft were difficult for Team Obsidian. At one point in the challenge, their AI bot accidentally built a skyscraper by tiling up fences. It also managed to cage a villager instead of an animal. But Garg is pleased with the results. Their AI bot learned to make walls, build columns, and mount torches successfully. The first-place team overall used 82,000 human labeled images to help recognize scenes in the game and spent about five months coding domain expertise for each task. By comparison, Garg and Mills earned their place without adding any domain knowledge to the model and with only three weeks to prepare.

“IQ-Learn is performing beyond our own expectations,” Garg says. “It’s a new paradigm for scaling intelligent machines that will be able to do everything from autonomous driving to helping provide health care.”

Someday, Garg imagines, we’ll be able to teach robots how to grasp objects in any situation simply by showing them videos of humans picking up objects or maybe even by responding to voice commands. If we want to train agents to perceive and act in a multidimensional world, we need to enable faster models that perform well, given limited data and time. Efficiency, it seems, is the determining factor in how useful robots will be in real life.

Comments are closed.