In this note, I use
df
for DataFrame
, s
for Series
.1import pandas as pd # import pandas package
2import numpy as np
1dataquest_aio = '<https://raw.githubusercontent.com/dinhanhthi/dataquest-aio/master/step-2-data-analysis-and-visualization/>'
2dataset_url = dataquest_aio + 'course-4-data-cleaning-and-analysis/data/World_Happiness_2015.csv'
3df = pd.read_csv(dataset_url) # read the data set
4df.head()
Group
df
by column Region
and then selct the column Western Europe
1df.groupby('Region').get_group('Western Europe') # returns a df
Select just the
Happiness Score
column and then find the mean
1df.groupby('Region')['Happiness Score'].mean()
2# other methods: size, max, min, count
1Region
2Australia and New Zealand 7.285000
3Central and Eastern Europe 5.332931
4Eastern Asia 5.626167
5Latin America and Caribbean 6.144682
6Middle East and Northern Africa 5.406900
7North America 7.273000
8Southeastern Asia 5.317444
9Southern Asia 4.580857
10Sub-Saharan Africa 4.202800
11Western Europe 6.689619
12Name: Happiness Score, dtype: float64
Apply multiple/custom functions,
1def max_min(group):
2 return group.max() - group.min()
3
4df.groupby(['Country', 'Region']).agg([np.mean, np.max, max_min]).head()
If you wanna apply different functions on different columns,
1df.groupby(['Country', 'Region']).agg({
2 'Happiness Rank': max_min,
3 'Happiness Score': ['min', 'max'],
4 'Standard Error': 'count'
5}).head(3)
Or using
apply
and lambda
function,1orders.groupby('shoes').price.apply(lambda x: np.min(x, 25)).reset_index()
1df.pivot_table(values=['GDP', 'City'], index='Region') # returns df
Apply some functions,
1df.pivot_table(['GDP', 'City'], 'Region', aggfunc=[np.mean, np.max], margins=True)
2# margins shows the "All" row
Make values in one columns be columns in a new "pivot" table (ref)
1df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
2 'two'],
3 'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
4 'baz': [1, 2, 3, 4, 5, 6],
5 'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
6
7pivot_1 = df.pivot(index='foo', columns='bar', values='baz')
8pivot_2 = df.pivot(index='foo', columns='bar')['baz']
9pivot_3 = df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])
10
11display_side_by_side(df, pivot_1, pivot_2, pivot_3)
For one who wanna know
display_side_by_side
, check Jupyter notebook.Contrary to
pivot
, we now want to transform several columns into values of a single column (ref).1df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
2 'B': {0: 1, 1: 3, 2: 5},
3 'C': {0: 2, 1: 4, 2: 6}})
4
5df1 = pd.melt(df, id_vars=['A'], value_vars=['B'])
6df2 = pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])
7
8display_side_by_side(df, df1, df2)
- Data Cleaning and Analysis on Dataquest.
- Transforming data with pandas on Dataquest.
- pandas official -- Group By: split-apply-combine