How to Manipulate Datetimes:

4 min readNov 11, 2021

As a Data Scientist, it is important to understand how to manipulate date and time data, as these factors can be contributing variables to what is being studied or measured or predicted. Time series literally calls for the use of time data as it requires time to be the index of your dataset. Thankfully, Pandas makes it easy for us to work with this type of data.

For the purpose of this demonstration, I have downloaded a Taxi Trip Record dataset for the month of November of 2020 in New York City. You can find all of my code here. NOTE: I have also condensed it as it was rather large. The original dataset can be found on NYC.gov. November is a particularly interesting time for taxi cabs. It is the start of the holiday season, and the weather gets colder so one can assume that people do not walk as often as they would in the summer months.

The dataset we are looking at has 3 columns, the only ones we are truly interested in are pickup times and drop off times. I have trimmed the original dataset to only contain trips in November, as there were a few misprints in the data. Here is a sample of the dataset:

We should always check the datatypes when dealing with datetimes, as Pandas may not always catch that when we upload the data. We can change the columns appropriately like this:

df[['tpep_dropoff_datetime', 'tpep_pickup_datetime']] = df[['tpep_dropoff_datetime', 'tpep_pickup_datetime']].apply(pd.to_datetime, format = '%Y%m%d %H:%M:%S')

I am using the apply method here because I am changing multiple columns to datetime format. The format parameter is optional, and can be utilized to format the layout of the datetime column to your liking. Typically, the format expressed above is the standard.

Datetimes are pretty remarkable, we can very easily extract information from them using attributes and methods. For example, we can easily extract the month and the hour like this:

sample_datetime = datetime(2020,11,1,0,51,45)sample_datetime.monthsample_datetime.hour

We can also grab the year, day, minutes, and seconds very easily. Something even more special is that we can grab the day of the week. This involves the weekday method:

sample_datetime.weekday()

For our sample datetime: November 1st 2020 12:51:45 AM (keep in mind the hour is expressed out of 24 hours), the output of the code above is 6. What this means is that the weekday is a Sunday. The weekday method outputs a number between 0–6. 0 is the indicator of a Monday all the way up to 6 being the indicator of a Sunday.

Now, a great piece of information we can often use when working with datetime data is elapsed time or the difference between two datetime objects. This is commonly referred to as a timedelta. In Pandas, performing these differences is truly as easy as this:

df['delta'] =  df.tpep_dropoff_datetime-df.tpep_pickup_datetime

And now we have a new ‘delta’ column that looks like this:

We can do so many things with this information. For example we can look at the average trip duration for the entire month of November:

df.delta.mean()#the output is Timedelta('0 days 00:14:22.929241522')

To keep things more consistent in terms of a measurement, it’s good practice to convert these deltas into seconds. We can do that with the total_seconds method:

#I am overwriting the delta column converting these deltas into seconds
df['delta'] = df['delta'].dt.total_seconds()

From here we can do many things. Because we have the deltas and the pickup times column, we can look at the average deltas over a specified unit of time. We can use the resample method to change the measure of time for different comparisons. In the code below, I am looking at the average duration of trips by day:

df.resample('D', on = 'tpep_pickup_datetime')['delta'].mean().head()

We know that anytime a pickup is happening, there is a trip. By essentially grouping the dates by day, we can look at the duration of trips for each day in the month of November.

Finally, we know that Thanksgiving is a large travel holiday, if we isolate the data by that particular day, we can assess which hours were the most popular for taxi trips. This information can possibly be used to tell whether or not there may be traffic at a certain time.

First, we isolate the data for the 26th of November, which is Thanksgiving:

thanksgiving = df[df['tpep_pickup_datetime'].dt.day == 26]

Next, we look at the hour marker for each of pickup times, these are really the trip numbers. Using the value_counts method we can look at the frequency of trips by hour. We can even put it into a nice graph to visualize this information:

thanksgiving['tpep_pickup_datetime'].dt.hour.value_counts().head().plot.bar(color = 'red')
plt.title('Popular Trips by Hour on Thanksgiving')
plt.xlabel('Hour (in 24 hours)')
plt.ylabel('Frequency');

In summary, datetimes are important tidbits of information that can be easily manipulated and studied for great use. Knowing how to work with this data can be tricky, but thankfully Pandas makes it quite easy to do. We can look at things like extracting units of time, looking at deltas, and even isolating particular parts of a datetime object. When we know how to do this, we can really dig into a dataset for better understanding of the data in front of us.

References:

Data — https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Python datetime — https://docs.python.org/3/library/datetime.html

Datetime in Pandas — https://www.geeksforgeeks.org/python-working-with-date-and-time-using-pandas/

How to Manipulate Datetimes:

Written by Zachary Greenberg