Apache Spark with PySpark

4 min readOct 28, 2021

In Data Science, we are expected to work with extremely large quantities of data, so much so that a typical run of the mill computer will not have the proper power to handle all the data. One of the programs we can utilize to work with this data (often called Big Data) is Spark.

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. — as defined by Wikipedia

Spark is particularly useful to Data Scientists because of its capabilities for data analytics. Spark works by letting us take our data into the cloud. Through the use of many virtual machines processing bits and parts of the data, we can achieve our results much faster.

Spark was originally written in Scala, a programming language, but since Python has become such a favorite, a Python API for Spark has been developed. It is called PySpark. Through PySpark we can work with Spark and Big Data in a programming language we Data Scientists know and love.

For the code notebook I have provided, I am not actually connected to the cloud, I am using the notebook as a demonstration for simple procedures in Spark. They are similar to our beloved Pandas, but because of some syntax and rules, I thought it would be helpful (for myself as well) to run through some of the basics:

The dataset I am using is the one I created from this blog. It is a small dataset showing information on the top 25 broadway shows. While the dataset is not exactly a large one, the syntax will be the same regardless. In order to connect to Spark, we have to create a session. This will allow us to use Spark and all its capabilities. We establish a connection like this:

spark = pyspark.sql.SparkSession.builder.getOrCreate()

The spark variable will be our connection to Apache Spark and allow us to do everything we need to do. Next thing we’ll do is read in the data.

broadway = spark.read.csv('broadway.csv', header=True)broadway.show(5)

The above is the equivalent of df.head() in Pandas, the default amount of rows is 20. Next thing we would do is to check the datatypes. In PySpark we do that with this method:

broadway.printSchema()

We can now see the datatypes and whether or not null values are allow in the columns (nullable = True means yes). The only column that would give us trouble here is ‘Current Ticket Cost’. This should be a float. We can change it very easily:

broadway = broadway.withColumn('Current Ticket Cost', broadway['Current Ticket Cost'].cast('double'))

Double is PySpark’s version of a float. And just like Pandas, we can overwrite the DataFrame to save it in place. We are essentially overwriting that column having the values be of the datatype ‘double’. Now, let’s say we want to do some string manipulation. We don’t want the word ‘Theatre’ in any of the values of the Theater column. This is how we do this in PySpark:

from pyspark.sql import functions as Fbroadway = broadway.withColumn('Split', F.split(broadway.Theater, 'Theatre'))broadway = broadway.withColumn('Theater', broadway.Split.getItem(0))broadway = broadway.drop('Split')

In the pyspark.sql module we can import functions to give us the ability to access a multitude of functions that can be used as needed. A lot of these functions, like split, are ones that we already know of. A list can be found here. So, first we create a column, ‘Split’ to do the manipulation, then we overwrite the ‘Theater’ column to grab the part of the manipulation we desire. When we are done, we then delete the column we no longer need. The syntax to do this is a little different from Pandas and requires more steps, but it can be done.

Next, with the DataFrame being in check, we are going to save it as a .parquetfile. This is an Apache Spark special file that is used for faster processing. I would recommend doing this. We can covert it like this:

broadway.write.parquet('broadway.parquet', mode = 'overwrite')#renaming the parquet file to differentiate it from the dataframe
shows = spark.read.parquet('broadway.parquet')

Now that we have it in this format, we can do many things. Just like Pandas, we can get summary statistics of numerical columns:

shows.describe().show()

We can also filter the parquet to find specific conditions:

#looking for the shows that cost less than $60
shows['Shows', 'Cost'].filter(shows.Cost < 60).show()

We can even put the parquet into a SQL table to write SQL queries if desired:

#creating a temporary table
shows.createOrReplaceTempView('top25')#looking for the top 3 most expensive shows
spark.sql('SELECT Shows, Cost FROM top25 WHERE Cost > 65.00 ORDER BY Cost DESC LIMIT 3').collect()

In conclusion, learning how to work with Big Data can be as easy as learning how to work with Spark. Thanks to the Python API for Spark, it can be even easier. It is important to understand the differences in the syntax from what we are used to in Pandas. Also, it is important to know about the parquet file. This can help us to make our analysis process even faster.

References:

Apache Spark — https://en.wikipedia.org/wiki/Apache_Spark

Big Data — https://en.wikipedia.org/wiki/Big_data

Parquet — https://parquet.apache.org/

PySpark — http://spark.apache.org/docs/latest/api/python/

PySpark Functions — https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/functions.html

Apache Spark with PySpark

Written by Zachary Greenberg