Enemies of Data Scientists ! Part - 1

Hello peeps ! Welcome to my new blog !

Ever wondered if data scientists have some enemies? 

Well yes !! They do have enemies (not meaning enemy by person). In fact, when a data is given to a data scientists or analyst, the first thing one should go through is look for their enemies.

So what enemies am I exactly talking about. Keep reading this blog.




When a data is given, there might be two major problems in it (which I am calling it as enemies) -

1. Missing Values

2. Outliers

One cannot move to analyze the data without treating these enemies. But what are they exactly? How do we treat them ? I answer all these questions in this blog.


Missing Values

    The meaning of this lies within the name itself. When a data is missing in a particular variable, we term it as missing values. Generally missing values in a data might be a blank space, NA or null. 

How to find missing values in python?


Figure: Data with missing values

In the above figure, it is a very small data with two predictors - x1 and x2 and one response variable - y.
In x1 variable there are two blanks which is a type of missing values and in x2 variable there is a data called 'NA' which is another type of missing values. 


Figure : Code to find missing value in python

How to treat this missing values?

    These missing values has to be treated as we cannot move to next step. There are many ways of treating missing values.

1. By removing the missing values from main data

        When the number of missing values is very small, we can simply remove the missing value from the data

Figure: Code to remove missing values

Using the above code, only the missing value will be deleted from the data.

'index_number' is the location of the missing data.  

2. By replacing the missing values

        Missing values can be replaced and treat them as other normal data. 

a. If the data is continuous

        There are two ways of treating missing values in a continuous data. The missing values can be replaced with mean or median of the variable. But the question is when to replace with mean and when not to.

Replacing with mean

  • When the variation between the data in that variable is very small
  • When the data of the variable has normal distribution.


Figure: Data with very small variation 

Replacing with median

  • When the variation between the data is large.
  • When the data is has skewness (either right skewed or left skewed).

Figure curtesy: https://www.emathzone.com/tutorials/basic-statistics/skewness.html

b. If the data is categorical

    When the data is categorical, replace the missing values with mode of the variable. Hence, the missing value will be replaced with highest frequency data.


Figure : Categorical variable with a missing data

In the above figure, number of male data = 4 and number of female data = 1. 
Now the missing value will be replaced by male as it is the most occurring.

Codes for filling missing data in python


Figure: Code to fill missing values


3. If data has above 50% of missing values, simply delete the column for analysis


Figure: Code to delete a column




This brings us to the end of the blog. In the next part, treating outliers will be discussed. 
Hope this blog was helpful !
Remember - "Every learner was once a beginner''
Happy Learning
- Tejaswini

Comments

Popular posts from this blog

HOPKINS STATISTICS IN CLUSTERING