Data Cleaning/Analysis: Python(Pandas) v. SQL
In data science, every data set needs to be analyzed whether it’s in a csv, tsv, excel, or even a SQL database. For Python, I believe the easiest way to analyze data is using Pandas. And as data is more commonly stored in a database it is also important to know how to do some of these things in SQL too. For the purposes of my demonstration, I am going to show ways to do things in Pandas and in SQL. The dataset I will be using is an automobile dataset showing miles per gallon and other details of cars. It can be found here. I have put all of this code in a Jupyter Notebook to demonstrate everything. For the SQL component, I have chosen to use PostgreSQL directly into my notebook to easily switch back and forth between the two. On that note, keep in mind that SQL is a programming language that has many different ‘dialects’ — if you will — and the syntax can vary from which type of SQL you are using.
Checking Your Columns & DataTypes:
Once you successfully connect to your data into your environment, whether it be Pandas or SQL, it is important to check the names of your columns and your datatypes. The columns will give you insight on what information is being presented and the datatypes will tell you the kind of information it is, whether it be numerical, categorical, textual, or something else.
Checking your datatypes in Pandas is very simple, it can be done with one line of code:
df.dtypes
The result would get you something like this:
In SQL, the way to do this with the syntax will vary on the type of SQL that you are using. I have loaded this csv file into a database with a table called ‘cars’. Once you connect to your database, if you use the %load_ext sql magic command (which I highly recommend), you can directly type SQL commands into the notebook. Here is how to get the column names and datatypes in SQL:
%%sql #<- if you use this magic command, you can type SQL queries
#directly in your Jupyter NotebookSELECT
column_name,
data_type
FROM
information_schema.columns
WHERE
table_name = 'cars';
The output will look like this:
Changing Your DataType:
One of the most important things to know how to do is to change your datatype so that it can be in the correct format. In this dataset, mpg is a float. Let’s say we would rather have it as an integer, so we would have to change the datatype. Here is how to do this:
#Pandasdf['mpg'] = df['mpg'].astype('int')#SQL%%sqlALTER TABLE cars
ALTER COLUMN mpg TYPE integer;
For Pandas, you simply use the .astype() method, and for SQL you use the ALTER TABLE / COLUMN command. Both of these codes are set up in a way that would modify the data inplace which means that the stored data would be permanently affected by this change.
Looking at your Summary Statistics:
With the numerical data, it is important to look at the summary statistics so that we understand the information in front of us. We can look at things like the mean, standard deviation, percentiles, and the counts. This is very easy to do for both Pandas and SQL:
#Pandasdf.describe()#SQLSELECT COUNT(mpg), AVG(mpg) AS mean, STDDEV(mpg),MIN(mpg), PERCENTILE_CONT(0.25) WITHIN GROUP(ORDER BY mpg) AS TWENTYFIVE, PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY mpg) AS FIFTY,
PERCENTILE_CONT(0.75) WITHIN GROUP(ORDER BY mpg) AS SEVENTYFIVE, MAX(mpg)
FROM cars;
With the Pandas .describe() method, there will be a table output with ALL of the numeric data with information including count, mean, standard deviation, minimum, 25th percentile, 50th percentile (or the median), 75th percentile, and the maximum value. These descriptive statistics will give you a preliminary look at the distribution of your data.
For the SQL component of this, you would need to do this for multiple columns. In the example above, I completed this information in 1 column of data.
Checking for Missing Values:
Normally, in a dataset, there will be some missing values that need to be accounted for. It is important to check for them. We can do this like so:
#Pandasdf.isnull().sum()#SQLSELECT COUNT(*) AS missing
FROM cars
WHERE cylinders IS NULL;
When both of these codes are run, the output will show that there is 1 missing value. In Pandas, we will get a table of ALL of the columns and the number of missing values present. In SQL, we would have to get these calculations individually by column of interest. In the example above I did this for the cylinders column.
Filling in Missing Values:
Depending on the situation, missing values should either be removed or imputed. Normally if the dataset is large, it would be acceptable to remove a few rows. If not, or if other circumstances are present, we should impute these values which means to fill them with another value. If the distribution of these values is normal, the mean is a great way to fill them in. It is very easy to do this:
#Pandasdf['cylinders'].fillna(df['cylinders'].mean(), inplace = True)#SQL%%sqlUPDATE cars
SET cylinders = (SELECT AVG(cylinders) FROM cars) WHERE cylinders IS NULL;
So, you see data cleaning and data analysis are routine parts of investigating a dataset. Seeing this from both a Python and a SQL perspective, we know that there are multiple ways to go about it. You can clean a DataFrame using Python’s Pandas module and/or you can clean a database using SQL. Either way, the cleaning and analysis of data is imperative and will set you up for success in your next steps.
ALSO for a great blog on how to use SQL in a Jupyter Notebook please check out Andrei Teleron’s blog. It really helped me out!
References:
SQL in Jupyter — https://medium.com/analytics-vidhya/postgresql-integration-with-jupyter-notebook-deb97579a38d
Pandas documentation — https://pandas.pydata.org/docs/
PostgreSQL documentation — https://www.postgresql.org/docs/
Data — https://www.kaggle.com/uciml/autompg-dataset?select=auto-mpg.csv