What are the different ways to handle missing values

117views

written 2.1 years ago by

binitamayekar ★ 6.4k

Missing Values

Missing values are those values or data that are not stored or not present in the given database.
In the database, there are some instances where a particular element is absent because of various reasons, such as corrupt data, failure to load the information, or incomplete extraction.
In the database, blank, null, or NaN shows the missing values.
But, handling such missing values is one of the big challenges.
To ease the process of handling messing data or values it is very important to analyze each column with missing values carefully.
So that reasons behind the missing values may be found.
This will help to develop an appropriate strategy for handling the missing values.
In general, there are various strategies are used to handle the missing values.
Some of them are as follows:
- Deleting Rows or Columns with missing values
- Replacing missing values With Mean/Median/Mode
- Imputation method for categorical columns
- Predicting The Missing Values
- Using Algorithms Which Support Missing Values

Missing values can be handled by deleting the rows or columns having null values.
If columns have more than half of the rows as null then the entire column can be dropped.
The rows which are having one or more columns values as null can also be dropped.

Advantages -

Complete removal of data with missing values generates a robust and more accurate model.
Deleting a particular row or a column with no specific information is good because it does not have any weightage.

Disadvantages -

More information may be lost.
Appropriate choice only when missing values are high in percentage like 30%, compared with the complete database.

Columns in the database which are having numeric continuous values can be replaced with the mean, median, or mode of remaining values in the column.
This method can prevent the loss of data compared to the previous method.
Replacing approximations like mean, median, or mode is a statistical approach to handle the missing values.

Advantages -

It prevents huge data loss that may be caused due to the removal of the complete rows and columns.
It is a very useful approach for small-sized databases.

Disadvantages -

When missing values is from categorical columns like string or numerical then the missing values can be replaced with the most frequent category.
If the number of missing values is very large then it can be replaced with a new category.

Advantages -

It prevents huge data loss that may be caused due to the removal of the complete rows and columns.
It is a very useful approach for small-sized databases.
Prevent the loss of data by adding a unique category.

Disadvantages -

Used only for categorical attributes.
Addition of new features to the model while encoding, may result in poor performance
Adds less variance.

In this other features are used which don’t have nulls can be used to predict missing values.
The regression or classification model can be used for the prediction of missing values depending on the nature whether categorical or continuous for the missing value.

Advantages -

Gives a better result than previous methods.
Takes into consideration the covariance between the missing value column and other columns.
Creates unbiased estimates of the model parameters.

Disadvantages -

Considered only as a proxy for the true values.
Bias also arises when an incomplete conditioning set is used for a categorical variable.

All the machine learning algorithms don’t support missing values but some ML algorithms are robust to missing values in the dataset.
The k-NN Algorithm can ignore a column from a distance measure when a value is missing.
The Naive Bayes can also support missing values when making a prediction.
The Random Forest works well on non-linear and categorical data. - It adapts to the data structure taking into consideration the high variance or the bias, producing better results on large datasets.
These algorithms can be used when the dataset contains null or missing values.

Advantages -

No need to handle missing values for each column as machine learning algorithms will handle them efficiently.

Disadvantages -

Is a very time-consuming process.
Choice of distance functions can be Euclidean, Manhattan, etc. which is do not yield a robust result.

ADD COMMENT EDIT