In beginner SQL, the majority of queries are centered around returning a single piece of information, or a single piece of information per group. As we get more and more advanced, it becomes clear that by doing this, some of the information can get lost and that there is much more to be seen. The way to access more information in SQL is through the utilization of window functions. We must open a window to get this data, no pun intended.

‘A window function performs a calculation across a set of table rows that are somehow related to the current…


SQL to a relational database is like a translator for a person speaking another language. This ‘other language’ is data of course. It is crucial that data scientists to learn a good amount of SQL so that we can do our job of making the data talk. Much like a large SQL database, any language can be vast. There are a few concepts that data scientists should be aware of that will make it easier to communicate on both ends. What I am referring to specifically are common table expressions and views.

These two concepts are much like the WHERE…


Image Source

Big O Notation is an important concept for Data Scientists to consider. Our job deals with large amounts of data on a daily basis. The larger the data, presumably the longer time it will take for our codes and programs to run. When we are mindful of Big O Notation, we can often find more optimal solutions to make our code more considerate of time.

What is Big O Notation?

If you could not guess from the paragraph above, here is the technical definition:

Big O notation is a mathematical notation that describes the limiting behavior of a function when…


Image Source

In data science, every data set needs to be analyzed whether it’s in a csv, tsv, excel, or even a SQL database. For Python, I believe the easiest way to analyze data is using Pandas. And as data is more commonly stored in a database it is also important to know how to do some of these things in SQL too. For the purposes of my demonstration, I am going to show ways to do things in Pandas and in SQL. The dataset I will be using is an automobile dataset showing miles per gallon and other details of cars…


Image Source

Teamwork makes the dreamwork is not only a great book by John Maxwell, it is a saying that proves to be true when it comes to machine learning models. What is being referred to here is ensemble methods. Ensemble methods refer to a specific modeling process in which multiple models come together to generate a singular outcome. There is a method to this madness of multiple models. These models, together, act as a similar process to cross validation hoping to reduce overfitting and overall model error. …


Source

To ensure success for the day, nutritionists will recommend people eat a balanced breakfast. To ensure a successful classification model, data scientists will recommend that your dataset is balanced. What this means is that your dataset should have approximately equal representation of the categories/classes present. If you are performing a classification algorithm on data that is imbalanced, the model would possibly be comprised when introduced to new data. The model would believe that a typical dataset would look like the imbalanced one you are training on and most likely make tons of errors. Not to worry, there are methods we…


In my last post, I wrote about the metrics typically considered for continuous data. Now, I will continue the conversation and move on to part two, which is the metrics utilized for categorical data.

What is categorical data?

‘Categorical data is a collection of information that is divided into groups. I.e, if an organisation or agency is trying to get a biodata of its employees, the resulting data is referred to as categorical.’ — as defined by FormPlus.

When we think of categorical data in data science, we typically think of classification algorithms. …


In data science, we have the capability to predict on both continuous and categorical data. These are the two general camps of data today — notice I did not say categories purposefully to avoid any confusion — and for the types of models we create around them we have various metrics we use to evaluate them. It is of paramount importance that we choose the best one. It is also important to understand that the best one for the model should be heavily dependent on the context of the problem we are trying to solve. …


The image above is great visualization of the need for regularization. The right most shows overfitting where the model is well fit to the particular dataset. The left most is more generalized, which is what we are looking for. Through the use of regularization, the model can look more like the left most graph.

‘Regularization is a way to avoid overfitting by penalizing high-valued regression coefficients. In simple terms, it reduces parameters and shrinks (simplifies) the model.’ — as defined by StatisticHowTo

Regularization is a great technique data scientists can use to avoid overfitting. Overfitting is when the model is so well trained on the training data that it is unable to pick up on trends of the validation and/or test data set. With overfitting, the model is unable to generalize, creating more complexity and high variance. …


What is object oriented programming?

‘Object-oriented programming (OOP) is a computer programming model that organizes software design around data, or objects, rather than functions and logic. An object can be defined as a data field that has unique attributes and behavior.’ — Alexander S. Gillis & Sarah Lewis

In Python, everything is an object. Each object contains its own attributes and methods. How do we know this? Check this out:

x = 'string 'x.capitalize() #will be 'String 'x.isalpha() == True #will be Truex.rstrip() #will be 'string'

In the example about we created a string, known by the…

Zachary Greenberg

Data Scientist / Singer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store