What is Exploratory Data Analysis? | Part 1
What is EDA (Exploratory Data Analysis)
In layman terms, EDA is nothing but going through the dataset, identifying the features, how they are related to each other, and how those features will be helpful for you in identifying the target values
Now let us try to understand the textbook definition of EDA in Datascience
Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques to maximize insight into a data set.
Techniques used in EDA
Let us step back for a moment and try to understand how we are deciding whether the mobile is good or not and whether the features might satisfy your needs
we will check for RAM / MEMORY / CAMERA / DISPLAY SIZE and many more. and with each one, again we will deep dive and analyze the specifications and whether those are sufficient for our usage
Similar to that even in EDA we have a lot of parameters and techniques to implement and check on our dataset and see whether our dataset is in a good position or do we need to make any further improvements on it.
Let us see those techniques one by one
Exploring EDA techniques with Wine Dataset
For explaining all these techniques, I am using wine dataset from sklearn
- shape
It is always good to check the number of rows and columns our dataset had, Shape will give you those details
From the above dataset, we can see that we have 178 rows and 11 columns
- info
info provides us the data types of all the columns and it also provides if there are any null values present in each column of the dataset
From the above result, we can infer that we have all numerical values in our dataset and not even a single null value
- describe
Describe gives all the statistical values for each column like - mean, median, quartiles [25%, 75%], min, and max
After observing the values of each column, we can see there is a lot of difference between the 75% quartile and max value for some column which means that there are multiple outliers for those variables
- Handling NA or Missing Values
There are multiple ways to handle the NULL or Missing values in the dataset. Let us see how we can handle some scenarios.
- calculate the missing value ratios for all the columns and set a threshold, ideally, the threshold in market standards will be around (60-70)%. If any column's missing value ratio is more than the threshold value, we can simply drop that column from the dataset
- For numerical objects, we can replace the NA values with the mean or median values
- Mean will be used when there are no outliers for that column and median will be used when there are a lot of outliers
- For non-numerical objects, we can replace the NA values with mode
Comments
Post a Comment