Pandas Profiling -One line code for your EDA

Posted by OMEGA MARKOS on April 28, 2020

A link for this blog on Analytics Vidhya:
https://medium.com/analytics-vidhya/pandas-profiling-one-line-code-for-your-eda-ee35d2020ea1?source=friends_link&sk=a8a00511ce597ad0d35e616574baf5b2

Exploratory data analysis (EDA) is the most important part of the data science pipeline. It helps us to fully understand our data and discover patterns, anomalies & test underlying assumptions. It also helps us define the insights we want to get from our data. There are lots of ways to perform our EDA and they all involve a lot of repetitive processes to all the variables involved. This task gets more complicated and is time-consuming, especially when we are dealing with a high dimensional dataset.

The simplest & most commonly used method is the pandas describe() function. It gives us a few important statistics but we still need to perform other tedious tasks to complete our EDA. There is another powerful alternative called pandas profiling.
Pandas profiling is an open-source python module that generates EDA profile reports such as descriptive statistics, quantile statistics, most frequent values, histograms, Correlations, and missing values. All this is done with one line of code.
To get started, first, we need to install pandas_profiling here.
pip install pandas-profiling
Or
conda install -c conda-forge pandas-profiling
The next step is to import pandas profiling & write the one line code. For this blog, I will be using the Forest Fires data set from the UCI Machine Learning Repository.

This code displays a one-page report. But to make it simple, I will try to show the four major sections of the report, which are Overview, Variable, Correlation & Sample.
1.Overview
Dataset info: This section shows general information about our data, such as the number of variables/columns, missing value and the number of observations. It is very similar to the pandas .info() function.
Variable types: In addition to the common numeric & categorical types, other variables such as boolean and date are also recognized.

Warnings:
This shows valuable information such as zero values, duplicates, and rejected variables. To show more info on the Warnings section, I will rerun the report by adjusting the correlation threshold to 0.4 which was by default 0.9.

Now our warnings section shows the correlation coefficient and the rejected variables due to high correlation.

2. Variables
The following statistics are generated for each column:
Quantile statistics: minimum value, Q1, median, Q3, maximum, range, interquartile range
Descriptive statistics: mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness.
Histogram: The frequency distribution for continuous variables and the counts for the categorical variables.
Common Values: The common values and their frequency
Extreme Values: The five minimum and maximum values and their frequencies.


The highly correlated variables will be ignored and excluded from the report. In our example, ‘DC’ is rejected due to a high correlation with ‘DMC’ and it is ignored. But you can include those variables by adjusting the correlation_overrides and supply a list with the rejected columns you want to show.

3.Correlations
This shows the heatmap of highly correlated variables, Spearman, Pearson and Kendall matrices

4.Sample
This is just like pandas head() function and it shows the first five samples of the data.
Finally, you can export the report in HTML format by including this in your code.
I hope this helps to speed up your EDA.
Thanks for reading!
References:
https://pypi.org/project/pandas-profiling/#types
https://medium.com/@InDataLabs/why-start-a-data-science-project-with-exploratory-data-analysis-f90c0efcbe49