More Data Please!
Need more data? No problem. Let’s say you have a very small dataset for your classification model and you want to give it more data to train on. One thing you can do is generate synthetic data. I am going to show you how to do this using a Python library called faker.
I have downloaded an advertising dataset from Kaggle for demonstration purposes. This is a preview of it:
This classification dataset is used to predict whether or not a person clicked on an ad. To create the mock data you start by importing the faker library and creating an instance of the Faker class.
#import the necessary libraries
from faker import Faker#create an instance of the Faker class
create = Faker()
Before I continue I want to point out how versatile this class is. It has many available default methods that allow you create all sorts of fake data for your model. You can even access the .seed() method to ensure the same results each time! Now we will check the dtypes of our dataset so we understand how to generate it in appropriately:
For the floats and integer types, it is very simple. In the cell below I will show you how generate this information. I set the min and max values below to demonstrate some of the parameters that can be specified. The values I chose mimic the ones in the existing dataset:
#You can use the pyfloat or pyint methods to get numerical data.#Daily Time Spent on Site
round(create.pyfloat(min_value = 32, max_value = 91),2)#Age
create.pyint(min_value = 19, max_value = 61)#Area Income
round(create.pyfloat(min_value = 13996, max_value = 79484),2)#Daily Internet Usage
round(create.pyfloat(min_value = 104, max_value = 269))#Clicked on Ad:
create.pyint(min_value=0, max_value=1)
This library is not just for numeric data, there are many built-in options for your object data as well. It can generate things such as addresses, credit card numbers, email addresses and even boolean values. This can come in handy when it comes to anonymizing personal information that may be stored in your datasets. In the cell below, I am going to show you how to get data that is specific to the advertising dataset:
#You can create random text and specify characters, generate cities, and even countries#Ad Topic Line
' '.join(create.text().split(' ')[:3])#City
create.city()#Country
create.country()
Also, there are options for datetime objects as well. Below I have generated random dates within the boundaries of the original dataset:
#You can generate all sorts of date information with their many methods.import datetimestart = datetime.date(2016,1,1) #Min start date of dataset
end = datetime.date(2016,7,24) #Max start date of dataset#Timestamp
create.date_time_between_dates(
datetime_start=start,
datetime_end=end
)
You may have noticed I left out the Sex column. Sex is not a defined method in this class, but that is not a problem. The values for this column were not readily available in faker, so I created my own way to do this below with faker:
from faker.providers import BaseProvider
import numpy as np# create a new provider class to define your method
class MyProvider(BaseProvider):
def sex(self):
sex= ['M', 'F']
return np.random.choice(sex)# use the add_provider method to add your creation(s)
create.add_provider(MyProvider)#access your new method like so:
create.sex()#Sex
create.sex()
I took this code and wrapped it in a function that outputs the new fake dataset. Here a sample of the finished product:
Conclusion:
The faker library is a quick way to generate artificial data to train your models. It has a wide variety of built in methods to fill various datatypes, and even allows you to create custom methods for specific data you may be seeking. I encourage you to further explore this library for all its offerings.