This post was published on the yhat blog

An Introduction to Grouping in Pandas

Grouping data is an integral part of many data analysis projects. The functionality for grouping in pandas is vast, but can be tough to grasp initially. Have no fear...we will get through a short introduction together using some data from NYC's beloved bike share program, Citi Bike.

Why would I want to group data?

Most of the time, you have a set of data that lends itself to being categorized or grouped. As a general example, let's say we have data on a wide variety of people. We may perform an analysis where we compare groups in the data based on age, gender, birth month, shoe size, or birth city; the options are as numerous as the data points!

The pandas groupby functionality draws from the Split-Apply-Combine method as described by Hadley Wickham from the land of R. It's a great approach to solving data analysis problems, and his paper on the subject is worth a read (it's linked in the resources section).

To summarize, he states that a common methodology for analyzing data comes from splitting the data into categories or groups based on some criteria, applying some aggregation function to each group (sum, mean, count), then combining the results for analysis, visualization or other means of better understanding.

Here's a graphic I came across illustrating the process:

From Data Analysis in Python by Wes McKinney

Sounds handy, but how do I do it in pandas?

Codes and Stuff

Just so we're on the same page, I’m running pandas 0.18.1.

I was looking around for an intriguing dataset and came across this data from Citi Bike, which is the NYC bike share program. It's pretty medium data at ~250MB CSV for one month's worth of data, and there was potential for some compelling findings with data grouping. Let's start down the rabbit hole...

df = pd.read_csv('data.csv')
df.head()

tripduration	starttime	stoptime	start station id	start station name	start station latitude	start station longitude	end station id	end station name	end station latitude	end station longitude	bikeid	usertype	birth year	gender
538	5/1/2016 00:00:03	5/1/2016 00:09:02	536	1 Ave & E 30 St	40.741444	-73.975361	497	E 17 St & Broadway	40.737050	-73.990093	23097	Subscriber	1986.0	2
224	5/1/2016 00:00:04	5/1/2016 00:03:49	361	Allen St & Hester St	40.716059	-73.991908	340	Madison St & Clinton St	40.712690	-73.987763	23631	Subscriber	1977.0	1
328	5/1/2016 00:00:14	5/1/2016 00:05:43	301	E 2 St & Avenue B	40.722174	-73.983688	311	Norfolk St & Broome St	40.717227	-73.988021	23049	Subscriber	1980.0	1
1196	5/1/2016 00:00:20	5/1/2016 00:20:17	3141	1 Ave & E 68 St	40.765005	-73.958185	237	E 11 St & 2 Ave	40.730473	-73.986724	19019	Customer	NaN	0
753	5/1/2016 00:00:26	5/1/2016 00:13:00	492	W 33 St & 7 Ave	40.750200	-73.990931	228	E 48 St & 3 Ave	40.754601	-73.971879	16437	Subscriber	1981.0	1

It looks like there is a good opportunity to break the data down into groups to look for some interesting trends. Some ideas are:

Group on the gender column and see if there are more male or female riders.
Do specific stations get used more than others? We can group on the station start or finish id.
Group the data on the day of the week, to see if there is more utilization for a particular day, on average.

How about a few examples?

If we want to group by just the gender, then we pass this key (column name) to the groupby function as the sole argument. This example is the simplest form of grouping, so please check out the docs to get all the options!

groupedGender = df.groupby('gender')
print(groupedGender)
<pandas.core.groupby.DataFrameGroupBy object at 0x1133854d0>

The output shows that groupby returns a pandas DataFrameGroupBy object. Pandas has just made some internal calculations about the new gender groups and is ready to apply some operation on each of these groups.

We can take a look at the available methods with the docstring/tab complete functionality of Rodeo!

Counts of groups

Getting back to the data, if we use the count method, we can see the total number of entries for each gender group. For reference, here's what the website says for the gender codes - "Gender (Zero=unknown; 1=male; 2=female)"

groupedGender.size()
gender
0    178710
1    783723
2    249847

# look at the size as a percentage of the whole (using the trip)
total = df.gender.count()
groupedGender.size() / total * 100
gender
0    14.741644
1    64.648679
2    20.609678

It looks like males make up the majority of Citi Bike riders (~65%). I was pretty surprised to see that male riders outnumbered female riders 3 to 1. I wonder if that's true of commuters in general, or if there's some other factor, like females tending to own their own bikes. A question for another post...

Mean and Std Dev of groups

We can use a single column from the DataFrameGroupBy object and apply some aggregation function on it - how about the mean and standard deviation of the trip durations for all three groups?

groupedGender['tripduration'].mean() / 60.
gender
0    35.923658
1    13.778720
2    16.198230

# Don't have to use the bracket notation
groupedGender.tripduration.std() / 60.
gender
0    193.417686
1     94.884313
2     91.675397

Although males make up the majority of Citi Bike riders, there's not much of a difference in their trip durations. Interestingly, gender unknown riders take 2x as long of rides on average. These riders are likely single-use customers (when you purchase a one time pass at a Citi Bike kiosk you are not asked for your gender).

More summary statistics

So there are some summary statistics for these groups (as an aside, you can use the describe function to get these statistics and more in one call). That's a whole lot of spread around the mean, which probably means there are some outliers in the data (maybe people that kept the bike for days). Just a brief look at this even though it's outside the scope, because I'm sure you were all interested 😊

df[df.tripduration > 10000].tripduration.count()
5110

Our suspicions are confirmed - there are many bike rentals outside 2:45 even though the "max" is supposed to be 30 minutes (or 45 if you're a Citi Bike member).

Brief aside / public service announcement, it costs up to $1200 to replace a Citi Bike. Don't be an outlier!

A quick (gg)plot

Okay, back on track...the plot, just because we can (using ggpy of course):

df_short = df[df.tripduration < 10000]
df_short.tripduration = df_short.tripduration / 60.
ggplot(df_short, aes(x='tripduration')) + geom_histogram(bins=30) + xlab("Trip Duration (mins)") + ylab("Count")

Grouping by station name

One last example is looking at which are the five favorite start and end stations. We'll group the data based on the start and end station names, apply the count function, and sort the values is descending order. Here's the code for that:

groupedStart = df.groupby('start station name')
groupedStart['start station name'].count().sort_values(ascending=False)[:5]

start station name
Pershing Square North    12775
West St & Chambers St    10128
Lafayette St & E 8 St     9246
W 21 St & 6 Ave           9220
E 17 St & Broadway        9036

groupedEnd = df.groupby('end station name')
groupedEnd['end station name'].count().sort_values(ascending=False)[:5]

end station name
Pershing Square North    12511
West St & Chambers St    10189
Lafayette St & E 8 St     9459
E 17 St & Broadway        9273
W 21 St & 6 Ave           9268

Wrapping Up

Hopefully, the above examples helped introduce some basic uses for the grouping process in pandas to help enhance your analysis and whet your appetite for more! What ideas do you have for further analysis on this dataset? Can you conquer the last idea of looking at the days of the week? Please, take a look at the resources linked below for further investigation!

Grouping in Pandas