Hello, and welcome to this tutorial. In the previous tutorial, we learned how to import the Dataset and import the libraries. Now, we’re finally going to start preparing the data so that our machine learning models run correctly. In most cases, you are going to have to deal with the problem of dealing with missing data. In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. It happens really often so you need to know to take care of missing data.
Getting the Dataset
I have prepared the dataset that we are going to be using in this tutorial. However, feel free to use any dataset that you may have, and see if you’ll get similar results.
I have also included a data_preprocessing.R file, and this file contains the template that we are using to prepare our data for Machine Learning. Both of these files are in a zip file. To download the dataset and the template, click here.
Create a folder and give it a name. Move the downloaded zip file into the folder you created and extract the contents of the zip file into that folder.
We’ve been creating this template from the beginning of this Machine Learning course. Therefore, if you have been following this course from the beginning, you will have done this already.
In RStudio, navigate to the folder where you stored the downloaded files and set that folder as the working directory.
Select all the lines of code in the data_preprocessing.R file and press ‘Ctrl+Enter’ to execute.
In our dataset, we have two missing data entries – we have one missing in the age column in Switzerland and another in the salary column in France.
There are a number of ways we can deal with the missing data entries. One, we could remove the rows with the missing data. However, this is a very dangerous practice because this dataset could contain very crucial information. It would not make sense to remove an observation.
Two – and this is actually the most common idea to handle missing data – is to use the mean of the columns with the missing data. Certainly, this is the method we are going to use in this tutorial.
Taking care of Missing Data in R using the Mean/Average
In R we are going to find the mean of the two missing data entries separately.
Age Column.
We take the ‘age’ column of the dataset, dataset$Age. Then, we use an if-else statement which will take three parameters. The first parameter is your condition. This is a condition that is used to check if a value in the column is missing or not. The condition is going to be; dataset$Age = ifelse(is.na(dataset$Age),
The second is the value you want to be returned if the condition above is true. If the condition is true, it means we have a missing value and we have to replace the missing value with the average/mean of the column. To compute the average, we use the mean function in R;
ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),
The third parameter is the value you want to be returned if the condition is not true. If the condition is not true, it means we have no missing values in our column. We simply want to return the ‘age’ column;
dataset$Age
That’s done. Our complete function should look like this;
dataset$Age = ifelse(is.na(dataset$Age),
ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),
dataset$Age)
Select all the lines of code we just added and press ‘Ctrl + Enter’. If you take a look at our dataset, you’ll see that the missing value has been replaced with the mean of the values in the ‘age’ column.
Salary Column
We are going to do the same for the ‘salary’ column. Just replace the dataset$Age with dataset$Salary. Make sure the lines of code are properly aligned. The complete function should look like this;
Select the lines of code and press ‘Ctrl + Enter’. Finally, Our missing value has been replaced by the mean of the values in the ‘Salary’ column.
Congratulations, now you know how to take care of missing data using R in Data Science. Henceforth, taking care of missing data in R should not bother you anymore. I look forward to seeing you in the next tutorial- we will talk about dealing with Categorical Data.