Grouping and aggregation are some of the most frequently used operations in data analysis, especially while performing exploratory data analysis (EDA), where comparing summary statistics across groups of data is common.
As an example, in the retail sales data that we are working with, you may want to compare the average sales of various regions or compare the total profits of two customer segments.
Grouping analysis can be thought of as having three parts, namely:
Let us now learn how to perform grouping over the Pandas dataframes using the same dataset as before. Download the notebook provided below for this segment.
The groupby()
function returns a Pandas object, which can be used further to perform the desired aggregate functions.
Let’s take a look at another aggregate function being applied on the groupby object created above.
One point to note here is that if you use the groupby
command on an index, then you will not face any error while executing the grouping and aggregation commands together. However, when grouping on columns, you should first store the dataframe and then run the aggregate functions on the new dataframe.
Using grouping, you can easily summarise the data stored in different columns. Let’s take a look at another method to summarise the dataframes, namely, Pivots.
A pivot table is a very useful tool to represent a dataframe in a structured and simplified manner. It acts as an alternative to the groupby() function in Pandas. Pivot tables provide Excel-like functionalities to create aggregate tables.
Let’s watch the following video and learn how to create pivot tables in Python using Pandas.
You can use the following command to create pivot tables in Pandas:
df.pivot(columns='grouping_variable_col', values='value_to_aggregate', index='grouping_variable_row')
Let’s take a look at another example of pivoting.
Using the pivot_table()
function, you can also specify the aggregate function that you would want Pandas to execute over the columns provided. It could be the same or different for each column in the dataframe.
df.pivot_table(values, index, aggfunc={'value_1': np.mean,'value_2': [min, max, np.mean]})
The function above, when substituted with proper values, will result in a mean value of value_1 and three values (minimum, maximum and a mean of value_2) for each row.
In the next segment, you will learn how to deal with multiple dataframes.