What is Exploratory Data Analysis? | Part 1

What is EDA (Exploratory Data Analysis)

In layman terms, EDA is nothing but going through the dataset, identifying the features, how they are related to each other, and how those features will be helpful for you in identifying the target values

Now let us try to understand the textbook definition of EDA in Datascience 

Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques to maximize insight into a data set.


Techniques used in EDA 

Let us step back for a moment and try to understand how we are deciding whether the mobile is good or not and whether the features might satisfy your needs

we will check for RAM / MEMORY / CAMERA / DISPLAY SIZE and many more. and with each one, again we will deep dive and analyze the specifications and whether those are sufficient for our usage 

Similar to that even in EDA we have a lot of parameters and techniques to implement and check on our dataset and see whether our dataset is in a good position or do we need to make any further improvements on it. 

Let us see those techniques one by one

Exploring EDA techniques with Wine Dataset

For explaining all these techniques, I am using wine dataset from sklearn 


  • shape 

It is always good to check the number of rows and columns our dataset had, Shape will give you those details 

From the above dataset, we can see that we have 178 rows and 11 columns 


  • info 

info provides us the data types of all the columns and it also provides if there are any null values present in each column of the dataset 

From the above result, we can infer that we have all numerical values in our dataset and not even a single null value 


  • describe 

Describe gives all the statistical values for each column like - mean, median, quartiles [25%, 75%], min, and max 

After observing the values of each column, we can see there is a lot of difference between the 75% quartile and max value for some column which means that there are multiple outliers for those variables


  • Handling NA or Missing Values

There are multiple ways to handle the NULL or Missing values in the dataset. Let us see how we can handle some scenarios.

  1. calculate the missing value ratios for all the columns and set a threshold, ideally, the threshold in market standards will be around (60-70)%. If any column's missing value ratio is more than the threshold value, we can simply drop that column from the dataset 
  2. For numerical objects, we can replace the NA values with the mean or median values
  3. Mean will be used when there are no outliers for that column and median will be used when there are a lot of outliers 
  4. For non-numerical objects, we can replace the NA values with mode

Till now we have discussed few techniques in EDA, there are many other techniques. which mainly involve graphical representations and we can deduce a lot of information of each column in detail, Those topics will be covered in the Exploratory data analysis part-2 article

Comments

Popular posts from this blog

A complete guide to K-means clustering algorithm

COMPARABLE VS COMPARATOR