A 2021 NLP Retrospective

’Tis the season to review some of the year’s highlights in Natural Language Processing

Image by Lotus Head; licenced under the Creative Commons Attribution 3.0

Much has happened in the field of Natural Language Processing (NLP) in the past year and I wanted to take some time and reflect on some of my personal highlights.

For those in a rush:

Let’s start off with a fun one: Can a Fruit Fly Learn Word Embeddings? This paper investigates the relationship between biology and neural networks. While taking high-level inspiration from biology, the current generation of deep learning methods are not necessarily biologically realistic. This raises the question whether biological systems can further inform the development of new network architectures and learning algorithms that can lead to competitive performance on machine learning tasks or offer additional insights into intelligent behaviour.

To do so, the researcher use a simulated brain of a fruit fly, one of the best studied networks in neuroscience. And, indeed, they were able to show that, surprisingly, this network can indeed learn the correlations between the words and their context, and produce high quality word embeddings.

Tracking progress in Natural Language Generation (NLG) is tricky, because by their very nature, NLG tasks don’t have a fixed definition of correct vs incorrect. To overcome this challenge and track progress in NLG models, a global project involving 55 researchers from 44 institutions proposed GEM (Generation, Evaluation, and Metrics), a living benchmark environment for NLG with a focus on evaluation.

The GEM project’s ultimate goal is to enable in-depth analysis of data and models rather than focusing on a single leaderboard score. By measuring NLG progress across 13 datasets spanning many NLG tasks and languages, it’s hoped the GEM benchmark could also provide standards for future evaluation of generated text using both automated and human metrics.

The researchers have opened the project to the NLG research community, and senior members will be available to help newcomers contribute. The GEM benchmark is at gem-benchmark.com and more information can also be found on the Dataset Hub on Hugging Face.

Disclaimer: I’m 100% biased on this next one as I work for AWS, but I honestly think it’s very cool 🙂

The partnership between Hugging Face and AWS has literally changed my work. I’m sure that no one reading this blog needs an introduction to Hugging Face. The partnership that was announced in March of this year introduced new Hugging Face Deep Learning Containers (DLCs) that make it easier than ever to train and deploy Hugging Face Transformer models in Amazon SageMaker.

This amazing Github repository by Philipp Schmid lets you try out all the new features, from distributed training to model deployment & auto-scaling.

The departments of chemistry and physics at Cambridge University published an extraordinary paper in April in which they describe how they trained a language model of a different kind.

The researchers have used sequence embedding, a well-known NLP technique, to convert protein sequences into 200-dimensional embedding vectors. And in case you were wondering, 200 dimensions is indeed considered a low-dimensional representation of such complex information! This technique allowed the teams to train a language model that outperformed several existing machine learning methods for predicting protein liquid–liquid phase separation (LLPS) using publicly available datasets.

Now, I’m not going to pretend that I understand what LLPSs are, but from my understanding they are fundamental to understanding the molecular grammar of proteins and spotting potential mistakes. It could be the first step in a breakthrough for research on cancer and neurodegenerative diseases like Alzheimer’s, Parkinson’s and Huntington’s.

I’m pretty sure you have, at some point, tried starting a proper conversation with one of your smart home assistants. I know I have and it never carried on for that long. The assistant usually wasn’t able to carry the context of the conversation beyond one or two exchanges and the attempt usually ended with a frustrating “I’m not sure I understand this one”.

At Google I/O in May this year, the company announced its latest advancement in the area of conversational AI, LaMDA (Language Model for Dialogue Applications). It is a conversational language model that seems to be able to carry on conversations for much longer. The demos in which they talked to Pluto and a paper plane certainly were impressive. Kudos also to the fact that they disclosed that it is still early days and pointed out some of the limitations of the model. I do hope that at some point Google releases a version to play around with.

If you like stories in which the underdog takes on the powerful incumbent, this one might be for you:

Image by author

This is an exchange between Connor Leahy and Leo Gao that started EleutherAI, a decentralized grass-roots collective of volunteer researchers, engineers, and developers focused on AI alignment, scaling, and open source AI research. Founded in July of 2020, their flagship project is the GPT-Neo family of models designed to replicate those developed by OpenAI as GPT-3. Their Discord server is open and welcomes contributors.

In June they released their latest model, GPT-J, which has 6 billion parameters, compared to GPT-3’s 175 billion. Despite being much smaller, GPT-J outperforms its big cousin in specialised tasks such as writing code.

I find this trend highly encouraging and am excited to see what comes next from EleutherAI.

In July the New Yorker published an article about biases in language models. This wasn’t a new topic amongst the NLP community. However, for a magazine like the New Yorker to pick up a topic like this highlights the importance of and concerns about modern NLP models. It reminded me of the Guardian article in 2020 about GPT-3 — a moment in time when niche topics are picked up by mainstream media.

The New Yorker article focuses on how language models are a reflection of our language and eventually ourselves. Especially this line stuck with me: “We are being forced to confront fundamental mysteries of humanity as technical issues: how little we know about the darkness in our hearts, and how faint our control over that darkness is.”

This next story strikes a similar chord as the New Yorker article, because in August Margaret Mitchell joined Hugging Face. Mitchell was a Google researcher on ethical AI until she was let go in February 2021. She co-authored (under her pseudonym Shmargaret Shmitchel) a paper on the costs and risks associated with large NLP models:

We have identified a wide variety of costs and risks associated with the rush for ever larger LMs, including: environmental costs (borne typically by those not benefiting from the resulting technology); financial costs, which in turn erect barriers to entry, limiting who can contribute to this research area and which languages can benefit from the most advanced techniques; opportunity cost, as re- searchers pour effort away from directions requiring less resources; and the risk of substantial harms, including stereotyping, denigration, increases in extremist ideology, and wrongful arrest, should humans encounter seemingly coherent LM output and take it for the words of some person or organization who has accountability for what is said.

I was very happy to see Mitchell join Hugging Face, a company that drives ope-source machine learning and a thriving community. If you would like to know more about her work at Hugging Face, check out her video about the values to keep in mind when developing a Machine Learning project.

Speaking of open-source NLP, Explosion has had a great year, too. This is the company behind spaCy, one of the most popular NLP libraries. And in September they raised $6 million in a Series A funding round on a $120 million evaluation.

I have to admit, I haven’t kept up-to-date with spaCy in 2021. I was mainly focused on upskilling myself on the Transformers library. And so, I was quite surprised to see all the new features that spaCy released earlier this year with spaCy 3.0. I will definitely turn my attention to spaCy again in 2022.

And Explosion not only provides one of the most popular NLP libraries, they also created Prodigy, a modern annotation tool. This is significant, because one potential way to create better models is to create better training data in the first place — this is where data annotation tools come in handy.

And seeing a woman (Ines Montani) being the CEO of an AI company is a nice change of pace 🙂

October saw the NLP Summit 2021 being held. This conference showcases NLP best practices, real-world case studies, challenges in applying deep learning & transfer learning in practice — and the latest open source libraries, models & transformers you can use today.

Many speaker well known in the NLP circles spoke at this conference, and some of the highlights were:

  • Why & how to care about NLP ethics?
  • Extreme Summarization of Scientific Documents
  • Leveraging AI for Recruitment towards Economic Recovery

You can still access all the talks on-demand on their website.

Hugging Face had quite the year, and I have to mention them one more time. In November the company published the second part of their course that helps you get started quickly with state-of-the-art NLP models. The course takes you on a journey, first starting with the high-level Pipeline API that lets you leverage NLP technology with two lines of code. It then gradually dives deeper into the Transformers stack — and before you realise it you will have created your own causal language model from scratch.

The launch of the second part in November was accompanied by a series of lectures and talks, which you can find here.

Very fittingly, the last item in this list also provides an outlook of things to come in the NLP space. In December, Louisa Xu published her article about the Golden Age For Natural Language in Forbes.

It’s a great piece, featuring three of the most influential NLP companies currently. And her summary and outlook are so well put I just let her speak for herself:

Every company that derives value from language stands to benefit from NLP, the branch of machine learning that has the most transformative potential. Language is the lowest common denominator in almost all of our interactions, and the ways in which we can capture value from language has changed dramatically over the last three years. Recent advancements in NLP have outsized potential to accelerate business performance. It even has the promise of bringing trust and integrity back to our online interactions. Large incumbents have been the first to jump onboard, but the real promise lies in the next wave of NLP applications and tools that will translate the hype around artificial intelligence from ideology into reality.

So, there you have it, these are my personal highlights of 2021 in NLP. I hope you enjoyed this summary and it’d be great to hear about your personal highlights from the past 12 months in NLP. Please comment on this blog post or reach out directly.