What is Exploratory Data Analysis?

What is Exploratory Data Analysis? | Part 1

April 07, 2021

What is EDA (Exploratory Data Analysis)

In layman terms, EDA is nothing but going through the dataset, identifying the features, how they are related to each other, and how those features will be helpful for you in identifying the target values

Now let us try to understand the textbook definition of EDA in Datascience

Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques to maximize insight into a data set.

Techniques used in EDA

Let us step back for a moment and try to understand how we are deciding whether the mobile is good or not and whether the features might satisfy your needs

we will check for RAM / MEMORY / CAMERA / DISPLAY SIZE and many more. and with each one, again we will deep dive and analyze the specifications and whether those are sufficient for our usage

Similar to that even in EDA we have a lot of parameters and techniques to implement and check on our dataset and see whether our dataset is in a good position or do we need to make any further improvements on it.

Let us see those techniques one by one

Exploring EDA techniques with Wine Dataset

For explaining all these techniques, I am using wine dataset from sklearn

shape

It is always good to check the number of rows and columns our dataset had, Shape will give you those details

From the above dataset, we can see that we have 178 rows and 11 columns

info

info provides us the data types of all the columns and it also provides if there are any null values present in each column of the dataset

From the above result, we can infer that we have all numerical values in our dataset and not even a single null value

describe

Describe gives all the statistical values for each column like - mean, median, quartiles [25%, 75%], min, and max

After observing the values of each column, we can see there is a lot of difference between the 75% quartile and max value for some column which means that there are multiple outliers for those variables

Handling NA or Missing Values

There are multiple ways to handle the NULL or Missing values in the dataset. Let us see how we can handle some scenarios.

calculate the missing value ratios for all the columns and set a threshold, ideally, the threshold in market standards will be around (60-70)%. If any column's missing value ratio is more than the threshold value, we can simply drop that column from the dataset
For numerical objects, we can replace the NA values with the mean or median values
Mean will be used when there are no outliers for that column and median will be used when there are a lot of outliers
For non-numerical objects, we can replace the NA values with mode

Till now we have discussed few techniques in EDA, there are many other techniques. which mainly involve graphical representations and we can deduce a lot of information of each column in detail, Those topics will be covered in the Exploratory data analysis part-2 article

Search This Blog

Love the Process, Not the Goal