Project Spock at Tubi: Understanding Content Using Deep Learning for NLP

Authors: John Trenkle, Jaya Kawale and the Tubi ML Team

Understanding the nuanced relationships between movies, TV shows, viewers, actors, genres is like trying to understand a complex tapestry of fractals — — it’s tricky, but there is structure there. (Illustration 118225042 © Keilaneokow | Dreamstime.com)

In this blog series, we aim to highlight the nuances of Machine Learning in Tubi’s Ad-based Video on Demand (AVOD) space as practiced at Tubi. Machine Learning helps solve myriad problems involving recommendations, content understanding and ads. We extensively use PyTorch for several of these use cases as it provides us the flexibility, computational speed and ease of implementation to train large-scale deep neural networks using GPUs.

With 33 million active monthly users and over 2.5 billion hours of content watched last year, Tubi is one of the leading platforms providing free high-quality streaming movies and TV shows to a worldwide audience. We have curated the largest catalog of premium content in the streaming industry including popular titles, great horror, and nostalgic favorites. To maintain and grow our enthusiastic audience and expanding catalog, we leverage information from our platform combined with a selection of trusted publicly-available sources in order to understand not only what our current audience wants to watch now, but also what our expanding audience wants to watch next. Viewers can watch Tubi on dozens of devices, sign in, and have a seamless viewing experience with relevant ads presented at half the load of cable.

In our last Medium post, we discussed how Tubi uses Machine Learning. To recap, Tubi embraces a data-driven approach and is on a constant mission to explore the ever-growing universe of Machine Learning, Deep Learning, Natural Language Processing (NLP), and Computer Vision).

Tubi uses an Advertising-based Video on Demand (AVOD) business model. It’s similar to subscription-based Video on Demand (SVOD) services, except it’s free because viewers see minimal commercials. This makes a big difference in the problems we tackle and how we use Machine Learning (ML) to solve them.

In our previous post, we also talked in-depth about the three pillars of AVOD that guide our work. Today, we’ll talk about how we use ML to improve our content — Tubi’s library of TV shows and movies. Our main focus is to better understand our content, and feed those insights back into X teams.

Content Understanding revolves around digesting data on movies and TV shows hosted on Tubi. This data includes structured data (metadata) and unstructured (text) forms. Once we collect that data, we develop representations that capture the essence of those movie and TV show titles. With the analogy of linear algebra, we can say we are attempting to project title vectors from the universe to our Tubi library with as much fidelity as possible to ascertain potential value for each target use case.

Let’s look at what information we have for an individual movie or TV show title. Generally, for every piece of content on Tubi, we have text, video and image data from various partner sources like IMDB, Gracenotes, and Rotten Tomatoes. We also have various complementary data from sources such as Wikipedia and others.

Industry-Standard Data Sources for Media

Data we collect includes information like genre, language, year of release, plot synopsis and summary, ratings, Rotten Tomatoes scores, user reviews, box office information, and more. This vast amount of data not only helps us to understand the content, but also to understand the user’s preferences for different movies or TV shows titles. One of the major goals of any VOD service is to have relevant titles appearing on the user’s home page. If the title suggestions are relevant, viewers will be eager to watch it, and they’ll come back for more.

Illustrating the rich information that can be leveraged per Title

There are a few main ways we use this data to create better content for our users.

Deciding which movies or TV shows to buy

If we better understand how our users engage with TV or film titles, we can help Tubi’s content team determine which titles to buy next. We pose the task of assessing the potential value of all titles in the universe with respect to how they might perform on Tubi as one of lead discovery. The process is as follows:

  1. Determine a relative measure of the historical performance of all titles that have ever played on Tubi. Threshold this at the desired level to give the dependent variable.
  2. Leverage all metadata and embeddings for the shows as the independent variables.
  3. Use binary classification modeling to predict which titles would be high-performers.
  4. Evaluate all titles in our database using this model.
  5. Rank them by the confidence that they are high performers.

Not all titles can be blockbusters and these ML-based suggestions tend to reveal less-well-known titles to make sure users have enough shows they enjoy to keep watching. A good content library can lure viewers who come for a Blockbuster movie, but stick around to check out a new TV show or film they may not have discovered otherwise.

We also use our content understanding to facilitate the cold starting of titles that have never played on Tubi. To do this, we combine measures of popularity and performance from our huge bank of metadata for universal titles to rank new titles starting at the beginning of a month. We need to guarantee that highly-ranked shows get the attention they deserve. We then use PyTorch to build a fully-connected network that allows us to map from a high-dimensional embedding space that captures relationships from metadata and text narratives to the collaborative filtering model in Tubi’s recommendation system. We call this process beaming. We use beamed embeddings to understand which viewers may be interested in one of the new titles and to place that title adjacent to an existing title that is very similar in the beamed space.

Several of the applications tackled by Project Spock can be seen in this figure:

Use cases for Content Understanding in Tubi’s Ecosystem

Project Spock is our code name for the umbrella program that tackles the myriad challenges that crop up in the content understanding ecosystem. One of the main goals of the project is to understand the textual data using Natural Language Processing (NLP) techniques. NLP helps computers to understand natural language in order to perform useful tasks such as answering questions. NLP includes multiple tasks including simple keyword search, review classification, topic extraction, embedding generation, semantic analysis, machine translation and answering questions. Needless to say, NLP is hard because there is a lot of ambiguity in representation and learning.

Project Spock supports a platform for data ingestion, preprocessing and cleaning. The data is ingested via first and third-party sources. Once the data is ingested, it goes through several cleaning steps. Data preprocessing is an important step in machine learning, especially for NLP. Apart from cleaning the data via typical preprocessing algorithms such as stemming and removing stop words, punctuation, HTML tags, emojis and numbers, we also have to clean user-generated content like reviews. One of the problems with reviews is that they may contain grammatical errors and oftentimes may reflect only the sentiment of the users such as “I hated this movie” with no narrative aspects. For some use cases such as determining if a particular audience segment may be interested in a title, the signal of sentiment can be leveraged. For other applications such as search, this type of text has no value.

Tubi’s ML Team believes that there is no single algorithm or representation, or combination thereof, that is the best solution for all tasks — a high-level interpretation of the No Free Lunch Theorem. In that vein, Project Spock aspires to support a wide range of composable features — both families of embeddings and raw metadata and value metrics — to increase the likelihood of finding robust solutions to our myriad problems through a process of discovery. The platform maintains a variety of embeddings powering the different use cases across the product. Embeddings are the lifeblood of modern NLP that deserve their own section.

Deeper Dive on Embeddings

Let’s focus on the task of embedding generation using mathematical and NLP techniques. There are many eloquent and beautifully illustrated descriptions of each of the algorithms that we’ll highlight here — a cursory search will reveal thousands. We’ll attempt to succinctly describe the algorithms, the rapid evolution of embedding approaches over the last decade, and how these serve as the building blocks for robust solutions to AVOD use cases.

Word Vectors are the most common representations seen in NLP tasks. One of the simplest representations is to create a one-hot vector: a vocabulary that produces a long vector with one slot for each word. A single word is then a long vector with a ‘1’ in its designated slot and the rest being ‘0’. A sentence is then a vector with a ‘1’ for every word slot. This is a sparse representation with many limitations. One big limitation is the curse of dimensionality. The length of the vector scales linearly with the length of the vocabulary. Another problem is that the semantics are lost in the representation. Words that may be close in meaning are not close in vector representation.

An alternative approach is to have dense representations of real numbers in each slot useful for capturing the semantics of rich text. The word2vec algorithm was revolutionary for word embedding. It is a simple two-layer neural network trained to reconstruct linguistic contexts of words that yields powerful feature vectors. In a similar vein, the GloVE algorithm produces word vectors by training on word co-occurrence statistics. These seminal techniques have fueled a great deal of NLP work over the past several years and are still go-to’s as a de facto first choice when starting a project.

Doc2Vec is a step up from word-level models. It expands and builds upon word2vec by adding another layer in the shallow network. It is then tasked with aggregating word vectors into document vectors so as to holistically capture the essence of a sentence, paragraph, or document nonlinearly without resorting to word vector averaging. This method can be used to characterize movies based on text summaries and narratives or for representing the salient features of a collection of shows.

As an aside, the geometric properties of embedding spaces for completing analogies such as the famous “Man is to Woman as King is to ______” is well known. Let’s look at something similar in our Doc2Vec-based embedding where the vectors represent movies. The examples below show the addition of embeddings and the closest title after the addition. We see very interesting patterns from embedding additions making us believe that the embeddings capture the semantic relationship between the titles.

Example 1: Combining Contextual Embeddings on the left; the most similar titles on the right
Example 2: Combining Contextual Embeddings on the left; the most similar titles on the right
Example 3: Combining Contextual Embeddings on the left; the most similar titles on the right

The recent wave of significantly more powerful approaches to text embedding generation leverage variants of transformers. Bidirectional Encoder Representations from Transformers (BERT) and its numerous variants are the most commonly deployed models of this type. These are true Deep Learning models that create deep bidirectional representations from unlabeled text by jointly conditioning on both the left and right context. A key differentiator between these and earlier models is that there is not a dictionary of word vectors underlying the models — vectors are specific to the context in which they are seen; there are no global word vectors. This group of embedding generators can capture the spirit of titles with appreciably more precision for downstream jobs than shallow techniques.

The evolutionary path from simple embedding techniques to more sophisticated and powerful language models

While many of these approaches were originally implemented and developed in straight C code, today modules exist in other languages and frameworks including PyTorch. Thus, one can work in a single consistent ecosystem while exploring the breadth and depth of the algorithmic terrain. That is not to say each of these modules is the best in class — invoking NFL — but many are, and the remainder are usually good enough to support experimentation.

In addition to these mostly text-centric methods, we also utilize Neural Collaborative Filtering, Autoencoders, and Single- and Multi-task Deep Neural Networks to create families of highly descriptive embeddings. We have christened this collection of embeddings the Embedding Menagerie. Members of this group can be seen in the

Embedding Menagerie: The collection of embeddings that support Project Spock

PyTorch supports the construction of many of these types of embeddings along with pre-trained models, which is extremely useful since it is always necessary to do some level of fine-tuning or customization to one’s domain. We also use Hugging Face, which provides access to state-of-the-art NLP models via PyTorch.

Project Spock leverages embeddings created using virtually all of these approaches to tackle different problems. Many of these representations use the superset of available data and can thus create embeddings for the approximately one million titles that we track. The textual data poses a lot of challenges and we learned several lessons while incorporating them into embeddings. For example, not all text is the same. Reviews are very different from subtitles which are very different from plot summaries. Needless to say, different tasks require us to focus on different texts. For example, you might use reviews to extract sentiment analysis in conjunction with Rotten Tomatoes scores to yield high-coverage metrics that we could employ to determine which titles that have never played on Tubi may be the most likely to be well-received by one of our existing Audience segments. We also learned that averaging embeddings, while widely used to summarize information, may not always be the right solution. For example, when attempting to capture the essence of a single film or series based on a large corpus of text describing that show, one is better served by adopting embeddings extracted from single- or multi-task Deep Neural Nets that integrate the information non-linearly to deliver more robust comparisons at this level.

Furthermore, there is “No Free Lunch” in terms of the choice of algorithm and representation. That is to say that no one method for creating embeddings can single-handedly capture all of the rich information inherently contained in the vast collection of metadata and text we have for each show. Most of the current generation of sophisticated transformer-based text embedding algorithms rely on the sequential nature of words in a language. These expectations are not met for metadata such as cast, genre, or year; however, we can still use approaches such as doc2vec, GloVe or LightFM to combine text and metadata as Bags or Words with no inherent ordering. In this fashion, the co-occurrence of the metadata and text can be leveraged to infer similarity relationships between these items. Furthermore, Autoencoders and specific task-oriented neural networks are the most flexible for building powerful embedding spaces that can leverage any kind of inputs — binary, numeric, categorical and even other embeddings — to yield robust models.

Let’s take a look at a more concrete example of how we generate specialized embeddings using PyTorch. A good example of creating a deep neural network using PyTorch that fulfills two goals for an AVOD use case is what we call the Genre Spectrum. In a nutshell, Tubi’s Genre Spectrum Model is designed to digest the potentially numerous and subjective assignment of genres to titles and yield a couple of useful products:

  1. A map of up to four genres that capture the essence of the title ranked with a weight that indicates how much is explained by each genre. These are proportions and add up t0 1. For example, {action:0.45, western:0.35, adventure:0.2}
  2. An embedding space that can be used to capture the holistic essence of genre for downstream modeling tasks

The premise is that genres are continuous and not discrete and it behooves us to maintain multiple representations that can fit various use cases we encounter. The following plot illustrates the Genre Spectrum embedding projected to 2D using UMap.

UMAP projection of Genre Spectrum is very geographical and demonstrates that similar genres are close together

In order to build the Genre Spectrum model, we tap into our Embeddings Menagerie, which is a veritable treasure trove of independent variables. The learning task for genre prediction is a multi-class, multi-label problem. That is, each title is assigned one or more genres from a small set and the goal of the task is to predict which classes describe a title. The dependent variable looks like this:

Multi-hot targets representing genre truth

The independent variables are selected from the Embedding Menagerie and focus on features that should carry a semblance of the genres of a title. Contextual vectors of various types from Doc2Vec and BERT are powerful on this front as are the cast and other metadata that capture the essence of a title. Thus, the records for training look like this:

Records for Training Genre Spectrum

Even though we have hundreds of thousands of exemplars to use to train this model, we always ensure that we have sufficient regularization to eliminate over-training so that our models are as powerful as they can be and generalize well to newly added titles. To this end, we leverage the DataLoaders in PyTorch to do data augmentation, dynamic rebalancing of mini-batches to handle imbalanced distributions, and other techniques to squeeze as much out of the precious truthed data as possible. Be aware that the coverage of metadata across so many titles can be sparse considering the domain: over 100 years of movies, a wide variety of quality, limited measurements before the advent of the Internet, and other factors. This means that values must be imputed or predicted for many titles.

For our purposes, the problem is posed as multi-class classification and uses the MultiLabelSoftMarginLoss loss function. The most interesting thing we do is to use the Dataset object to localize several regularization methods in mini-batch construction to maximize the utility of our data. To focus on one aspect of our training, we assume that linear combinations of members of the same class are likely to also be members of that class. So, we define a variable that indicates what fraction of the records in a mini-batch should be artificial combinations — not actual records — for an individual title. In practice, this allows us to have an effectively infinite number of records to train on. Thus, we minimize the potential for memorizing or over-fitting the dataset while also enabling a dynamic balance of class distributions. The only downside is that we cannot extrapolate effectively and hypothesize about things we haven’t seen. In practice, though, we have found that training on 100% combos yields some of the best results. Furthermore, we can also add noise, dropout dimensions and randomly perturb the dependent variables (randomly removing or adding genres), each of which can make network predictions more robust.

The Genre Spectrum models developed using PyTorch are powerful for characterizing all inventory we track consistently and systematically, while the embeddings serve as great features for enhancing personalization.

In conclusion, we’ve told you all about how Tubi leverages PyTorch to support Content Understanding especially for building the collection of features that we call the Embeddings Menagerie. The road ahead is leading us to knowledge graphs to facilitate a more structured representation of facts consisting of entities, relationships and semantic descriptions. Knowledge graphs allow one to model a large amount of information intuitively and expressively. As NLP continues to advance, it can help us with future applications, such as unifying objects of interest — movies, series, stars, characters, genres and other nouns — in the world of movies and tv shows and allowing us to compare them in a direct way.

Simple Knowledge Graph

In future posts, we’ll take a deeper look at other aspects of ML at Tubi. Stay tuned! If you’re interested in learning more, follow PyTorch on Medium.