Hello and welcome to this tutorial. We’re halfway in our data preprocessing phase. We’ve learned how to install R and RStudio, import the dataset, and take care of missing data using the R Programming language. Now I’m going you show you how to encode categorical data in R. To Learn how to encode categorical data in Python, go Here.
Getting the Dataset
I have prepared the dataset that we are going to be using in this tutorial. However, feel free to use any dataset that you may have, and see if you’ll get similar results.
I have also included a data_preprocessing.R file, and this file contains the template that we are using to prepare our data for Machine Learning. Both of these files are in a zip file. To download the dataset and the template, click Here.
Create a folder and give it a name. Move the downloaded zip file into the folder you created and extract the contents of the zip file into that folder.
We’ve been creating this template from the beginning of this Machine Learning course. Therefore, if you have been following this course from the beginning, you will have done this already.
In RStudio, navigate to the folder where you stored the downloaded files and set that folder as the working directory.
Select all the lines of code in the data_preprocessing.R file and press ‘Ctrl+Enter’ to execute.
Understanding our Dataset
If you take a look at our dataset, you’ll see that we have two categorical variables. We have the country variables – Netherlands, Switzerland, and France – and we have the Purchased variables – Yes and No.
They’re categorical variables, obviously because they have categories. Since machine learning models are based on mathematical/numerical equations, keeping the text in the categorical variables would definitely cause us some problems. We want to have ‘numbers only’ in our equations. That is why we need to encode the Text into Numbers so that our machine learning models can work with them.
Encoding categorical data / variables in R
We are going to use the factor function. The factor function transforms your categorical variables into numeric categories but still sees them as factors. Even more, the form factor allows you to choose the labels/names of those factors. Let’s take a look at our dataset, then get straight to encoding our categories.
We will transform the county column into a column of factors, and specify what those factors are.
We just take the column country – dataset$Country – then we use the factor function – factor () – and in the factor function we are going to specify 3 things
- First, the dataset we want to transform; – ‘dataset$Country,’
- Second, we’re going to specify the levels, and that’s the names of the categories in the County column; ‘levels = c(‘Netherlands’, ‘Switzerland’, ‘France’),’
- Third, we specify the labels. Which number are we going to assign to Netherlands, Switzerland, and France, each (You can use any numbers you want); ‘labels = c(1, 2, 3)’
That’s it. The whole function should look something like this;
dataset$Country = factor(dataset$Country,
levels = c(‘Netherlands’, ‘Switzerland’, ‘France’),
labels = c(1, 2, 3))
Now, if you take a look at our dataset, the names – Netherlands, Switzerland, and France – have been encoded with the numbers – 1, 2, 3 respectively.
Purchased Column
We are going to do the same for the Purchased column. Just copy everything from the above function. Replace the ‘dataset$County’ with ‘dataset$Purchased’ and the Levels with ‘levels = c(‘No’, ‘Yes’),’. Also, replace the labels with ‘labels = c(0, 1)’. The whole function
dataset$County = factor(dataset$County,
levels = c(‘No’, ‘Yes’),
labels = c(0, 1))
That’s it. Select the code and press ‘Ctrl + Enter’. Likewise, if you look at the dataset, the names – yes and no – have been replaced with 1 and 0 respectively. Let’s take a look at our dataset now.
You’ve seen how you can encode categorical data in R. Next, we are going to separate the Training dataset from the Testing dataset. See you then.
If you found this post useful, please share. Thank you