DeepMind’s Gato is the Swiss Army Knife of AI models

Listen to this story

The arrival of deep neural networks has been a watershed moment in artificial intelligence history. We have made huge strides in natural language understanding and object recognition in a short period. However, we don’t have AI models that do both.

Enter Gato. 

DeepMind has leveraged the advances in large-scale language modelling to build a single generalist agent beyond the scope of text outputs. Gato is a multi-modal, multi-task, multi-embodiment generalist policy: The same network with the same weights can play Atari, caption images, chat and stack blocks with a real robot arm.

How does Gato work?

To train Gato, the researchers collected data from different tasks and modalities. The data was then serialized into a flat sequence of tokens, then batched and processed by a transformer neural network. “While any general sequence model can work for next token prediction, we chose a transformer for simplicity and scalability,” the researchers stated in the paper. The researchers have used a 1.2 billion parameter decoder-only transformer with 24 layers and an embedding size of 2048.

Gato is trained on many datasets with information about agent experience in simulated and real-world environments. Natural language and image datasets were also used. 

A prompt is tokenized during the deployment phase to form the initial sequence. Following this, the environment yields the first observation, tokenized and appended to the sequence. Next, Gato samples the action vector autoregressively. It comprehends one token at a time, and once all tokens have been sampled, Gato decodes the action and sends it to the environment. The environment then yields a new observation, and the process is repeated in a loop. “The model always sees all previous observations and actions within its context window of 1024 tokens,” the researchers said. 

How does Gato stack up against other models?

The success stories of GPT-3, Gopher and Flamingo inspired the DeepMind researchers to push the limits of generalist language models and generalist visual language models.

Early this year, Google introduced Pathways Language Model (PaLM), building on the Pathways system announced before. The 540-billion parameter, dense decoder-only Transformer model, trained with the Pathways system, was able to train a single model across multiple TPU v4 Pods efficiently. With Pathways, Google Research’s end game is to build a single model that could generalize across domains and tasks while being highly efficient. PaLM achieved state-of-the-art few-shot performance across hundreds of language understanding and generation tasks, and in many cases, by significant margins.

In January, Meta AI released data2vec, the first high-performance self-supervised algorithm for multiple modalities. The data2vec outperformed the previous best single-purpose algorithms for computer vision and speech and was competitive on NLP tasks. The algorithm marks a paradigm shift in holistic self-supervised learning. data2vec brings us closer to building machines that can make sense of the world. 

DeepMind’s Gopher is a 280-billion-parameter NLP model based on the Transformer architecture and trained on 10.5TB of MassiveText. Gopher surpassed the current state-of-the-art on 100 evaluation tasks. The model was also tested on NLP benchmarks, including the Massive Multitask Language Understanding (MMLU) and BIG-bench, and the performance was compared to other baseline models. Gopher showed steady improvement on knowledge-intensive tasks but not so much on reasoning-heavy tasks.In the same league as Gopher, Google’s Generalist Language Model (GLaM) is a trillion weight model that achieves a competitive advantage on multiple few-shot learning tasks. GLaM is a mixture of experts model with different submodels specialized for different inputs. It achieves competitive performance on multiple few-shot learning tasks. GLaM was on-par on seven tasks while using 5x less computation during inference. The tasks included open domain question answering, commonsense reading, in-context reading comprehension, the SuperGLUE tasks and natural language inference.