We’re almost done with Data Preprocessing. We’re about to begin making Machine Learning Models. We just need a few more steps to make our dataset perfectly prepared. In this tutorial, we are going to learn how to split any dataset into a Training Set, and a Test Set, and why it is necessary.
Getting the Dataset
I have prepared the dataset that we are going to be using in this tutorial. However, feel free to use any dataset that you may have, and see if you’ll get similar results.
I have also included a data_preprocessing.R file, and this file contains the template that we are using to prepare our data for Machine Learning. Both of these files are in a zip file. To download the dataset and the template, click here.
Create a folder and give it a name. Move the downloaded zip file into the folder you created and extract the contents of the zip file into that folder.
We’ve been creating this template from the beginning of this Machine Learning course. Therefore, if you have been following this course from the beginning, you will have done this already.
Installing R and RStudio || Importing the Dataset || Taking Care of Missing Data || Encoding Categorical Data
Splitting the Dataset into the Training Set and the Test Set – Intuition.
One very important thing we need to do, is split the dataset into a Training Set and a Test Set. Why is this important?
If you take a look at our dataset, we have ten observations. What we should do with any Machine Learning models, is split our data into two; a Training set and a Test set.
If you take a look at the name itself, Machine Learning is all about teaching a machine how to do something. Your algorithm is going to learn from your data, to make predictions and perform other machine learning objectives.
The machine learning model is going to learn to do something from your dataset, by understanding the correlation that might be in that dataset. Now, imagine if the machine learning model learns too much about the correlation in one dataset. How would it behave with other datasets with slightly different correlations?
We’re going to build our Machine Learning model on one dataset, but then we have to test it on a new set, which is going to be slightly different from the dataset on which we built the Machine Learning Model.
(Okay, No more ‘too lengthy’ sentences)
That’s why we need to make two different sets of data; A Training set on which we build and train the Machine learning model, and a Test set on which we test the performance of the Machine Learning Model.
Preferably, the performance of the Test set, shouldn’t be that different from the performance of the training set. This would mean that the Machine Learning (ML) models understood well the correlation and can adapt to new sets and situations.
That’s the whole idea behind Splitting the Dataset into the Training Set and the Test Set. In this article, we’re going to learn how to do it in R. If you want to learn how to do this in Python, go Here.
Splitting the dataset into the Training Set and the Test Set.
Open RStudio and set the Working Directory. To set the Working Directory in RStudio, just go to Session on the navigation bar, select ‘Set Working Directory’, and then ‘Choose Directory’.
Navigate to the folder you created earlier and click open.
You should see in the console section, that the working directory has been set. In the files section, open the data_preprocessing.R file, select all the lines of code in that file and press Ctrl + Enter, on the keyboard to execute the code.
First, we have to import a library. We’re going to import a library that is going to make a good split of our dataset. This library is called ‘caTools’. To import it, just type; install.packages(‘caTools’)
Select that line of code and execute (Ctrl + Enter).
After it’s done installing, just delete the line or leave it as a comment as you won’t need to install it again. The package will now appear listed in the packages section of RStudio.
We’ve installed the ‘caTools’ package but we still need to activate it to use it. You can activate it by checking the box next to it in the packages section. Alternatively, if you feel like flexing your script-skills a lil bit, just type; library(caTools)
Execute that line of code and finally, we’re good to go.
In Python, we used the ‘Random State’ equals Zero, so that we could get the same results for our example. Well, here it’s going to be the same. We’re going to set a seed to get the same results. In R Studio type; set.seed(123)
You can set a seed of any number you want. To keep things simple, we used 123. Use 123 if you want to get similar results as mine, on this specific example.
Unlike in Python where we made it in one line, here we’re going to prepare a method that we’re going to call, ‘split’. This is the method that’s going to make a split of your dataset into the Training set and the Test set. In R Studio type; split = sample.split()
Tip: To see the arguments that go into a function, just write the function and press F1. For example, split = sample.split and then Press F1
We’re going to have a few arguments. The first one is the Y-axis. Unlike in Python, here we just put the dependent variable vector ‘y’. Our dependent variable is the Purchased column. So; split = sample.split(dataset$Purchased)
The second parameter is going to be the split ratio. This is just the percentage of the observations that you want to put into your Training set. And we want this to be 80%. So the whole line of code should look something like, split = sample.split(dataset$Purchased, SplitRatio = 0.8)
This will return ‘True or False’ for each of your observations. It will be ‘True’ if an observation was taken to the Training set, and ‘False’ if the observation was taken to the test set.
Run the code. Now go to the console and write ‘split’ and press ‘Enter’. You’ll see that you have ten values some ‘TRUE’ and others ‘FALSE’
All we need to do now is create the Training set and the Test set separately.
To create the Training set type; training_set = subset(dataset, split == TRUE)
To create the Test set type; test_set = subset(dataset, split == FALSE)
Run the code. Now, if you go to the Environment pane, you’ll see that you’ve created both sets of data. Just click on either to open.
Congratulations you’re almost there. In the next tutorial, we’re going to be performing Feature Scaling. I’ll tell you why it is important to scale our data and show you how to do it. Subscribe to be notified whenever we post the latest tutorial on Data Science.
meet me dating site free free singles dating dating free site 100% free dating sites no fees