Introduction to Logistic Regression in PySpark | by Gustavo Santos

Tutorial to run your first classification model in Databricks

Gustavo Santos
Towards Data Science
Photo by Ibrahim Rifath on Unsplash

Big Data. Large datasets. Cloud…

Those words are everywhere, following us around and in the thoughts of clients, interviewers, managers and directors. As data gets more and more abundant, datasets only increase in size in a manner that, sometimes, it is not possible to run a machine learning model in a local environment — in a single machine, in other words.

This matter requires us to adapt and find other solutions, such as modeling with Spark, which is one of the most used technologies for Big Data. Spark accepts languages such as SQL, Python, Scala, R and it has its own methods and attributes, including its own Machine Learning library [MLlib]. When you work with Python in Spark, it is called PySpark, for example.

Furthermore, there’s a platform called Databricks that wraps Spark in a very well created layer that enables data scientists to work on it just like Anaconda.

When we’re creating a ML model in Databricks, it also accepts Scikit Learn models, but since we’re more interested in Big Data, this tutorial is all created using Spark’s MLlib, which is more suited for large datasets and also this way we add a new tool to our skill set.

Let’s go.

The dataset for this exercise is already inside Databricks. It’s one of the UCI datasets, Adults, that is an extract from a Census and labeled with individuals that earn less or more than $50k per year. The data is publicly available in this address:

Our tutorial is to build a binary classifier that tells whether a person makes less or more than $50k of income in a year.

In this section, let’s go over each step of our model.

Here are the modules we need to import.

from pyspark.sql.functions import col
from import UnivariateFeatureSelector
from import RFormula
from import StringIndexer, VectorAssembler
from import…

Source link