# Group Data

In pandas **group by** involves one or more of the following steps:

**splitting**the data into groups based on some criteria**applying**a function to each group independently (i.e aggregation, transformation, filtration)**combining**the results into a data structure (based on apply operation type)

## Group by one column, one column aggregated, one aggregation

```
# returns series because we aggregate one column
df.groupby('col_grp_1')['col_agg_1']\
.min()
```

## Group by two columns, one column aggregated, one aggregation

```
# returns series because we aggregate one column
df.groupby(['col_grp_1', 'col_grp_2'])['col_agg_1']\
.min()
```

## Group by two columns, multiple columns aggregated, one aggregation

```
# returns data frame because we aggregate multiple columns, one aggregation
df.groupby(['col_grp_1', 'col_grp_2'])['col_agg_1', 'col_agg_2']\
.min()
```

## Group by one column, one column aggregated, multiple aggregations

```
# returns data frame because of multiple aggregations
df.groupby('col_grp_1')['col_agg_1'].agg(['min', 'max', 'mean'])
```

## Group by one column, multiple columns aggregated, multiple aggregations

```
# returns data frame with hierarchical index in columns
df1 = df.groupby('col_grp_1')['col_agg_1', 'col_agg_2']\
.agg(['min', 'max', 'mean', 'count'])
# flatten the hierarchical index in columns
df1.columns = ['_'.join(col).strip() for col in df1.columns.values]
```

## Group by named aggregation: group by one column, multiple columns aggregated, multiple aggregations

```
# from pandas version >= 0.25.0
df.groupby('col_grp_1')\
.agg(new_col1_name=('col_agg_1', 'mean'),
new_col2_name=('col_agg_2', 'count'))
```

## Group by one column, one column aggregated, one aggregation

```
# giving DataFrame df
>>> df
kind breed height weight
0 cat siamese 9.1 7.9
1 dog chihuahua 6.0 7.5
2 cat persian 9.5 9.9
3 dog labrador 34.0 198.0
4 dog chihuahua 3.5 5.5
# returns series because we aggregate one column
df.groupby('kind')['weight'].min()
kind
cat 7.9
dog 5.5
Name: weight, dtype: float64
```

## Group by two columns, one column aggregated, one aggregation

```
# giving DataFrame df
>>> df
kind breed height weight
0 cat siamese 9.1 7.9
1 dog chihuahua 6.0 7.5
2 cat persian 9.5 9.9
3 dog labrador 34.0 198.0
4 dog chihuahua 3.5 5.5
# returns series because we aggregate one column
df.groupby(['kind', 'breed'])['weight'].min()
kind breed
cat persian 9.9
siamese 7.9
dog chihuahua 5.5
labrador 198.0
Name: weight, dtype: float64
```

## Group by two columns, multiple columns aggregated, one aggregation

```
# giving DataFrame df
>>> df
kind breed height weight
0 cat siamese 9.1 7.9
1 dog chihuahua 6.0 7.5
2 cat persian 9.5 9.9
3 dog labrador 34.0 198.0
4 dog chihuahua 3.5 5.5
# returns data frame because we aggregate multiple columns, one aggregation
df.groupby(['kind', 'breed'])['height', 'weight'].min()
height weight
kind breed
cat persian 9.5 9.9
siamese 9.1 7.9
dog chihuahua 3.5 5.5
labrador 34.0 198.0
```

## Group by one column, one column aggregated, multiple aggregations

```
# giving DataFrame df
>>> df
kind breed height weight
0 cat siamese 9.1 7.9
1 dog chihuahua 6.0 7.5
2 cat persian 9.5 9.9
3 dog labrador 34.0 198.0
4 dog chihuahua 3.5 5.5
df.groupby('kind')['weight'].agg(['min', 'max', 'mean'])
# returns data frame because of multiple aggregations
min max mean
kind
cat 7.9 9.9 8.900000
dog 5.5 198.0 70.333333
```

## Group by one column, multiple columns aggregated, multiple aggregations

```
# giving DataFrame df
>>> df
kind breed height weight
0 cat siamese 9.1 7.9
1 dog chihuahua 6.0 7.5
2 cat persian 9.5 9.9
3 dog labrador 34.0 198.0
4 dog chihuahua 3.5 5.5
# returns data frame with hierarchical index in columns
df1 = df.groupby('kind')['weight', 'height']\
.agg(['min', 'max', 'count'])
# flatten the hierarchical index in columns
df1.columns = ['_'.join(col).strip() for col in df1.columns.values]
print(df1)
weight_min weight_max weight_count height_min height_max height_count
kind
cat 7.9 9.9 2 9.1 9.5 2
dog 5.5 198.0 3 3.5 34.0 3
```

## Filter (within a GroupBy), filters the rows on a property of the group they belong to

```
# will return a subset DataFrame of rows belonging to the groups satisfying condition
df1 = df.groupby('col_grp').filter(lambda x : x['col'].mean() > 20)
```

## Transform (within a GroupBy), calculates a new value for each row based on a property of the group

```
df['grp_mean'] = df.groupby('col_grp')['col'].transform(lambda x : x.mean())
```

## Center the data by subtracting the group-wise mean

```
# will return a new DataFrame with same size as the original
df_new = df.groupby('col_grp').transform(lambda x: x - x.mean())
```

## Apply (within a GroupBy), applies an arbitrary function to the group results

```
def func_normalize(x):
# x is a DataFrame of group values
x['col1'] = x['col1'] / x['col2'].sum()
return x
# normalizes the first column by the sum of the second:
df = df.groupby('col_grp').apply(func_normalize)
```

## Get top n rows for each group of columns in a sorted dataframe

```
# sort DataFrame first by some columns
df = df.sort_values(['col_x','col_y'])
df_top_5 = df.groupby('col_x').head(5)
```

## Get bottom n rows for each group of columns in a sorted dataframe

```
# sort DataFrame first by some columns
df = df.sort_values(['col_x','col_y'])
df_bottom_5 = df.groupby('col_x').tail(5)
```