Hello, and welcome to this tutorial. In the previous tutorial, we learned how to import the Dataset and import the libraries. Now, we’re finally going to start preparing the data so that our machine learning models run correctly. In most cases, you are going to have to deal with the problem of dealing with missing data. In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. It happens really often so you need to know to take care of missing data.

How to Install R and RStudio for Data Science

Advertisement

Getting the Dataset

I have prepared the dataset that we are going to be using in this tutorial. However, feel free to use any dataset that you may have, and see if you’ll get similar results.

importing the dataset in R for data scienceI have also included a data_preprocessing.R file, and this file contains the template that we are using to prepare our data for Machine Learning. Both of these files are in a zip file. To download the dataset and the template, click here.

Create a folder and give it a name. Move the downloaded zip file into the folder you created and extract the contents of the zip file into that folder.

We’ve been creating this template from the beginning of this Machine Learning course. Therefore, if you have been following this course from the beginning, you will have done this already.

In RStudio, navigate to the folder where you stored the downloaded files and set that folder as the working directory.

Splitting the Dataset into the Training Set and the Test Set

Zoom In/Out

Select all the lines of code in the data_preprocessing.R file and press ‘Ctrl+Enter’ to execute.

Taking care of missing data in R

Zoom In/Out

In our dataset, we have two missing data entries – we have one missing in the age column in Switzerland and another in the salary column in France.

Taking care of missing data in r for data science

Zoom In/Out

There are a number of ways we can deal with the missing data entries. One, we could remove the rows with the missing data. However, this is a very dangerous practice because this dataset could contain very crucial information. It would not make sense to remove an observation.

Two – and this is actually the most common idea to handle missing data – is to use the mean of the columns with the missing data. Certainly, this is the method we are going to use in this tutorial.

Taking care of Missing Data in R using the Mean/Average

In R we are going to find the mean of the two missing data entries separately.

Age Column.

We take the ‘age’ column of the dataset, dataset$Age. Then, we use an if-else statement which will take three parameters. The first parameter is your condition. This is a condition that is used to check if a value in the column is missing or not. The condition is going to be; dataset$Age = ifelse(is.na(dataset$Age),

The second is the value you want to be returned if the condition above is true. If the condition is true, it means we have a missing value and we have to replace the missing value with the average/mean of the column. To compute the average, we use the mean function in R;

ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),

The third parameter is the value you want to be returned if the condition is not true. If the condition is not true, it means we have no missing values in our column. We simply want to return the ‘age’ column;

dataset$Age

That’s done. Our complete function should look like this;

dataset$Age = ifelse(is.na(dataset$Age),

ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),

dataset$Age)

Taking care of missing data in r for data science

Zoom In/Out

Select all the lines of code we just added and press ‘Ctrl + Enter’. If you take a look at our dataset, you’ll see that the missing value has been replaced with the mean of the values in the ‘age’ column.

Taking care of missing data in r for data science

Zoom In/Out

Salary Column

We are going to do the same for the ‘salary’ column. Just replace the dataset$Age with dataset$Salary. Make sure the lines of code are properly aligned. The complete function should look like this;

Taking care of missing data in r for data science

Zoom In/Out

Select the lines of code and press ‘Ctrl + Enter’. Finally,  Our missing value has been replaced by the mean of the values in the ‘Salary’ column.

Taking care of missing data in r for data science

Zoom In/Out

Congratulations, now you know how to take care of missing data using R in Data Science. Henceforth, taking care of missing data in R should not bother you anymore. I look forward to seeing you in the next tutorial- we will talk about dealing with Categorical Data.

What's your reaction?

Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0

You may also like

More in:R

Leave a reply

Your email address will not be published. Required fields are marked *