AI Makes Strides in Virtual Worlds More Like Our Own | Quanta Magazine

That’s not to say the work is finished. “It’s much less real than the real world, even the best simulator,” said Daniel Yamins, a computer scientist at Stanford University. With colleagues at MIT and IBM, Yamins co-developed ThreeDWorld, which puts a strong focus on mimicking real-life physics in virtual worlds — things like how liquids behave and how some objects are rigid in one area and soft in others.

“This is really hard to do,” said Savva. “It’s a big research challenge.”

Still, it’s enough for AI agents to start learning in new ways.

Comparing Neural Networks

So far, one easy way to measure embodied AI’s progress is by comparing embodied agents’ performance to algorithms trained on the simpler, static image tasks. Researchers note these comparisons aren’t perfect, but early results do suggest that embodied AI agents learn differently — and at times better — than their forebears.

In one recent paper, researchers found an embodied AI agent was more accurate at detecting specified objects, improving on the traditional approach by nearly 12%. “It took the object detection community more than three years to achieve this level of improvement,” said Roozbeh Mottaghi, a co-author and a computer scientist at the Allen Institute for AI. “Simply just by interacting with the world, we managed to gain that much improvement,” he said.

Other papers have shown that object detection improves among traditionally trained algorithms when you put them into an embodied form and allow them to explore a virtual space just once, or when you let them move around to gather multiple views of objects.

Researchers are also finding that embodied and traditional algorithms learn fundamentally differently. For evidence, consider the neural network — the essential ingredient behind the learning abilities of every embodied and many nonembodied algorithms. A  neural network is a type of algorithm with many layers of connected nodes of artificial neurons, loosely modeled after the networks in human brains. In two separate papers, one led by Clay and the other by Grace Lindsay, an incoming professor at New York University, researchers found that the neural networks in embodied agents had fewer neurons active in response to visual information, meaning that each individual neuron was more selective about what it would respond to. Nonembodied networks were much less efficient and required many more neurons to be active most of the time. Lindsay’s group even compared the embodied and nonembodied neural networks to neuronal activity in a living brain — a mouse’s visual cortex — and found the embodied versions were the closest match.

Lindsay is quick to point out that this doesn’t necessarily mean the embodied versions are better — they’re just different. Unlike the object detection papers, Clay’s and Lindsay’s work comparing the underlying differences in the same neural networks has the agents doing completely different tasks — so they could need neural networks that work differently to accomplish their goals.

But while comparing embodied neural networks to nonembodied ones is one measure of progress, researchers aren’t really interested in improving embodied agents’ performance on current tasks; that line of work will continue separately, using traditionally trained AI. The true goal is to learn more complicated, humanlike tasks, and that’s where researchers have been most excited to see signs of impressive progress, particularly in navigation tasks. Here, an agent must remember the long-term goal of its destination while forging a plan to get there without getting lost or walking into objects.

In just a few years, a team led by Dhruv Batra, a research director at Meta AI and a computer scientist at the Georgia Institute of Technology, rapidly improved performance on a specific type of navigation task called point-goal navigation. Here, an agent is dropped in a brand-new environment and must navigate to target coordinates relative to the starting position (“Go to the point that is 5 meters north and 10 meters east”) without a map. By giving the agents a GPS and a compass, and training it in Meta’s virtual world, called AI Habitat, “we were able to get greater than 99.9% accuracy on a standard data set,” said Batra. And this month, they successfully expanded the results to a more difficult and realistic scenario where the agent doesn’t have GPS or a compass. The agent reached 94% accuracy purely by estimating its position based on the stream of pixels it sees while moving.

“This is fantastic progress,” said Mottaghi. “However, this does not mean that navigation is a solved task.” In part, that’s because many other types of navigation tasks that use more complex language instructions, such as “Go past the kitchen to retrieve the glasses on the nightstand in the bedroom,” remain at only around 30% to 40% accuracy.

But navigation still represents one of the simplest tasks in embodied AI, since the agents move through the environment without manipulating anything in it. So far, embodied AI agents are far from mastering any tasks with objects. Part of the challenge is that when the agent interacts with new objects, there are many ways it can go wrong, and mistakes can pile up. For now, most researchers get around this by choosing tasks with only a few steps, but most humanlike activities, like baking or doing the dishes, require long sequences of actions with multiple objects. To get there, AI agents will need a bigger push.

Here again, Li may be at the forefront, having developed a data set that she hopes will do for embodied AI what her ImageNet project did for AI object recognition. Where once she gifted the AI community a huge data set of images for labs to standardize input data, her team has now released a standardized simulated data set with 100 humanlike activities for agents to complete that can be tested in any virtual world. By creating metrics that compare the agents doing these tasks to real videos of humans doing the same task, Li’s new data set will allow the community to better evaluate the progress of virtual AI agents.

Once the agents are successful on these complicated tasks, Li sees the purpose of simulation as training for the ultimate maneuverable space: the real world.

“Simulation is one of the most, in my opinion, important and exciting areas of robotic research,” she said.

The New Robotic Frontier

Robots are, inherently, embodied intelligence agents. By inhabiting some type of physical body in the real world, they represent the most extreme form of embodied AI agents. But many researchers are now finding that even these agents can benefit from training in virtual worlds.

“State-of-the-art algorithms [in robotics], like reinforcement learning and those types of things, usually require millions of iterations to learn something meaningful,” said Mottaghi. As a result, training real robots on difficult tasks can take years.