Data Overview

Anh-Thi Dinh
In this note, I use df as DataFrame, s as Series.

Libraries

1import pandas as pd # import pandas package
2import numpy as np

Import and have a look

1df = pd.read_csv('filename.csv', na_values=['none']) # "none" is missing data
2df.head() # read first 5 rows
3df.tail() # last 5 rows
4df.head(10) # first 10 rows

Get general infos

1df.info() # show dtype of dataframe
2df.describe() # numerical features
3df.describe(include=['O']) # categorical features
4df.describe(include='all') # all types
5
6df.shape # dataframe's shape
7df.dtypes # type of each column
8
9df.get_dtype_counts() # count the number of data types
Check distribution of values using KDE (Kernel Density Estimation),
1plt.figure(figsize=(20, 5))
2df['value'].plot.kde()

Get columns' info

1# LIST OF COLUMNS
2df.columns
3len(df.columns) # #cols
1# UNIQUE VALUES IN COL
2df['col'].unique()
3df['col'].unique().size #unique vals
4df['col'].nunique() # number of unique vals

Counting

1# Counting #elements of each class in df
2df.Classes.value_counts() # give number of each 0 and 1
1# count #elements each unique values in a col/series
2df[col].value_counts()

Missing values

👉 Check section "Deal with missing values” in Data Processing & Cleaning.
1# total number of nans in df
2df.isnull().sum().sum()
1# #nans in each col (including zeros)
2df.isnull().sum()
1# #not-nans in each col
2df.count()
3
4# each row
5df.count(axis=1)
1# columns having the nulls (any nan)
2null_columns = df.columns[df.isna().any()].tolist()
3
4# how many?
5df[null_columns].isnull().sum()
1# number of rows having ALL nans
2df.isna().all(axis=1).sum()
1# number of columns having ALL nans
2df.isna().all(axis=0).sum()
1# find index of rows having ALL nans
2df.index[df.isna().all(axis=1)].to_list()
1# number of nans in df
2df.isnull().sum().sort_values(ascending=False)
3# find % of null values
4(df.isnull().sum()/df.isnull().count()*100).sort_values(ascending=False)
1# Visualize the locations of missing values,
2import seaborn as sns
3df = df.set_index('YEAR') # y-axis is YEAR
4sns.heatmap(df.isnull(), cbar=False) # x-axis is columns' name
1# Plot the percentage of nans w.r.t. each column (feature)
2df_tmp = (df.isnull().sum()/df.isnull().count()*100).sort_values(ascending=False).to_frame(name='percentage')
3df_tmp.reset_index().plot(kind='bar', x='index', y='percentage', figsize=(20,5))
4plt.xlabel('features', fontsize=14)
5plt.ylabel('% of nans', fontsize=14)

Duplicates

👉 Check section "Drop duplicates” in Data Processing & Cleaning.
1# Check if there are duplicated values?
2df['col'].duplicated().any() # returns True/False
1# How many duplicates? (only count the first occurs)
2df['col'].duplicated().sum()
1# How many (including the repeated occurs)
2df['col'].duplicated(keep=False).sum()
1# List all duplicated values (LONG EXECUTING!!!)
2pd.concat( g for _, g in df.groupby('col') if len(g)>1 )