7 Data Science Libraries That Will Make Your Life Easier in 2022

Photo by Florian Olivo on Unsplash

These Python libraries will save you a lot of time this year

When doing Data Science, you can end up wasting a lot of time coding and waiting for the computer to run something. I have selected a few Python libraries that can save you time in both situations. Even if you incorporate just one of them to your arsenal, you can still save precious time next time you work on a project.

Optuna is an open source hyperparameter optimization framework. That means it helps you find the best hyperparameters for your machine learning models.

The most basic (and probably well-known) alternative is sklearn’s GridSearchCV, which will try multiple combinations of hyperparameters and select the best one, based on cross-validation.

GridSearchCV will try combinations within a space previously defined by you. For a Random Forest Classifier, for instance, you might want to test a few different values for the number of estimators and the maximum depth of a tree. So you would give GridSearchCV all the possible values for each of these hyperparameters, and it would look at all the combinations.

With Optuna, on the other hand, you start by suggesting a search space, where it will start looking. It then uses the history of its own attempts to determine which values to try next. The method it uses for that is a Bayesian optimization algorithm called “Tree-structured Parzen Estimator”.

That different approach means that, instead of naively trying out arbitrary values, it looks for the best candidates before trying them, which saves time that would otherwise be spent trying unpromising alternatives (and possibly yields better results too).

Finally, it’s framework-agnostic, meaning that you can use it with TensorFlow, Keras, PyTorch or any other ML framework.

ITMO_FS is a feature selection library, meaning it helps you select features for your ML model. The less observations you have, the more you have to be cautious with having too many features, in order to avoid overfitting. By being “cautious”, I mean you should regularize your model. It is also generally better to have a simpler model (less features), since it’s easier to understand and explain.

ITMO_FS can help you with that, with algorithms split into 6 different categories: supervised filters, unsupervised filters, wrappers, hybrid, embedded, ensembles (although it focuses mostly on supervised filters).

A simple example of a “supervised filter” algorithm would be selecting features according to their correlation with the target variable. A well-known example of a “wrapper” is Eminem. Just kidding 🙂 I meant
“backward selection”, where you try removing the features one by one, to see how that will affect your model predictive power.

Here’s a vanilla example of how to use ITMO_FS and the impact it can have in model scores:

>>> from sklearn.linear_model import SGDClassifier
>>> from ITMO_FS.embedded import MOS
>>> X, y = make_classification(n_samples=300, n_features=10, random_state=0, n_informative=2)
>>> sel = MOS()
>>> trX = sel.fit_transform(X, y, smote=False)
>>> cl1 = SGDClassifier()
>>> cl1.fit(X, y)
>>> cl1.score(X, y)
>>> cl2 = SGDClassifier()
>>> cl2.fit(trX, y)
>>> cl2.score(trX, y)

ITMO_FS is a relatively new library, so it’s still a bit unstable and its documentation could be a bit better, but I still suggest you give it a try.

So far, we have seen libraries for feature selection and hyperparameter tuning, but why not both at the same time? That’s the promise of shap-hypetune.

Let’s start by understanding what’s “SHAP”:

“SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model.”

SHAP is one of the most widely used libraries for the interpretation of models, and it works by yielding the importance of each feature on the model’s final predictions.

shap-hypertune, on the other hand, benefits from that approach to select the best features while also selecting the best hyperparameters. Why would you want that? Selecting features and tuning hyperparameters independently can result in suboptimal choices, in the sense that you are not taking the interactions between them into account. Doing both at the same time not only takes that into consideration, but it also saves you some coding time (although it might increase run time due to the increase in the search space).

The search can be done in 3 ways: grid-search, random-search, or bayesian-search (plus, it can be parallelized).

One important caveat, though: shap-hypertune only works with Gradient Boosting Models!

PyCaret is an open-source, low-code machine learning library that automates machine learning workflows. It covers exploratory data analysis, preprocessing, modelling (including explainability) and MLOps.

Let’s look at some practical examples from their website to see how it works:

# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')
# init setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'Class variable')
# compare models
best = compare_models()
Source: PyCaret website

In just a few lines of code you have tried multiple models and compared them throughout the main classification metrics.

It also allows you to create a basic app to interact with your model:

from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice, target = 'Purchase')
lr = create_model('lr')

Finally, you can easily create an API and Docker files for your model:

from pycaret.datasets import get_data
juice = get_data('juice')
from pycaret.classification import *
exp_name = setup(data = juice, target = 'Purchase')
lr = create_model('lr')
create_api(lr, 'lr_api')

It can’t get much easier than that, right?

It is such a complete library that it’s hard to cover it all in here, so I’ll probably dedicate a full article to it in the near future, but I suggest you download it now and start playing with it to get a sense of some of its capabilities in practice.

floWeaver generates Sankey diagrams from a dataset of flows. If you don’t know what a Sankey diagram is, here’s an example:

SevenandForty, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons

They can be really helpful when displaying data about conversion funnels, marketing journeys or budget distributions in a company or in the government (the example above). The entry data should be in the following format: “source x target x value”, and it will take you one line of code to create this type of plot (which is quite specific, but also very intuitive).

If you’ve read Agile Data Science, you know how helpful it can be to have a front-end interface that let’s your end user interact with the data from the start of a project. Even for you, it helps you get acquainted with the data and spot any inconsistencies. One of the most used tools for this is Flask, but it is not very beginner-friendly, it requires multiple files and some knowledge of html, css, and so on.

Gradio allows you to create simple interfaces by setting your types of input (text, checkbox, etc.), your function and your outputs. Although it seems to be less customizable than Flask, it is much more intuitive.

Plus, since Gradio has now joined Huggingface, they provide the infrastructure to permanently host your Gradio model on the internet, for free!

The best way to understand Terality is thinking of it as “Pandas, but faster”. That doesn’t mean replacing pandas altogether and having to re-learn how to work with dataframes: Terality has the exact same syntax as Pandas. Actually, they even suggest you “import Terality as pd”, and keep on coding the same way you are used to.

How much faster is it? Their website sometimes says it’s 30x faster, sometimes 10–100x faster.

Another big feature is that Terality allows for parallelization and it doesn’t run locally, which means your 8GB RAM laptop will stop throwing MemoryErrors!

But how does it work behind the scenes? A good metaphor to understand Terality is thinking that they added a Pandas front-end, that you use locally, to a Spark back-end, that runs on their infrastructure.

Basically, instead of running things on your computer you would be using theirs, in a full serverless way (meaning no infrastructure setup is needed).

What’s the catch then? Well, you can only process up to 1TB of data per month for free. If you need more than that, you’ll have to pay at least $49 per month. 1TB/month might be more than enough for testing the tool and for personal projects, but if you need it for an actual company usage, you will probably need to pay.