Hello and welcome to this R tutorial. In the previous section we learnt how to create two nonlinear regression models: the polynomial regression model, and the Support Vector regression (SVR) model. And now, I’m going to show you how to create another nonlinear regression model, and that is, the Decision Tree Regression model. This article, will show you how to create Decision Tree Regression models in R.
To learn how to create decision Tree Regression models in Python, go here.
We are going to use the same dataset we used while creating the polynomial regression model and the SVR model. Let’s see how Decision Tree Regression will do, and compare it to previous two nonlinear regression models that we created.
Getting the Dataset
I have prepared the dataset that we are going to be using in this tutorial. However, feel free to use any dataset that you may have, and see if you’ll get similar results.
I have also included a decision_tree_regression.R file, and this file contains the template that we are using to prepare our data for Machine Learning. Both of these files are in a zip file. To download the dataset and the template, click here.
Extract the contents of the zip file you downloaded. This is the folder that we are going to use as the working directory.
We created this template in the first part of this Machine Learning course. Therefore, if you have been following this course from the beginning, you will have done this already.
How to set the working directory in RStudio
Open RStudio and set the Working Directory. To set the Working Directory in RStudio, just go to Session on the navigation bar, select ‘Set Working Directory’, and then ‘Choose Directory’.
Navigate to the folder you created above (the one with the downloaded files), and click open. You should see in the console section, that the working directory has been set.
In the files section, open the decision_tree_regression.R file, select all the lines of code in that file and press Ctrl + Enter, on the keyboard to execute the code.
You should now see the dataset on the ‘Environment’ pane. Just click on it and it’ll appear on our main window.
Our Business Problem
You’re in the Human Resource team in a big company, and you’re about to hire a new employee into the company. You’ve found someone who seems to be great and a very good fit for the job. You are about to make an offer to this person you’re hiring and it’s time to negotiate what his/her salary is going to be. The interviewee tells us that he/she has had 19 years of experience and was receiving a salary of one hundred and sixty thousand in his/her previous job and is asking for nothing less than one hundred and sixty thousand. One of your members in the Human resource team decides to call the interviewee’s previous company and ask if the information the interviewee has provided is true. Unfortunately, the only information the team member gets, is the Position_Salaries.csv file that you downloaded earlier.
The HR team member also finds out that our interviewee has been a regional manager in the previous company for two years, and it takes an average of four years to move from regional manager to partner. This means that our interviewee was half way to becoming partner. He/she was half way between level six and level seven, we can say level 6.5. The HR team member says that he can build a bluffing detector using regression to detect whether the interviewee is bluffing or not. Let’s build a polynomial regression model to build a detector that will predict whether it’s the truth, or a bluff.
We are going to be building on the code in the decision_tree_regression.R file that we downloaded earlier in this tutorial.
In that code, we imported the dataset, and selected the only two columns that we need.
Just a heads up, ‘Levels’ is the independent variable, while ‘Salary’ is our dependent variable. We are going to use the correlation between the two, to train our nonlinear machine learning model, to predict salaries. For example, the salary for an employee in the six and a half level.
The next step would be to split the dataset into the training and test sets. However, this time we won’t do that. We are dealing with a very small dataset of only ten observations, so that we can best understand how machine learning models work. The next step would be feature scaling but we won’t need to do it either. That’s why that whole part has been commented out.
And now let’s build our decision tree regression model in R. Remember, to learn how to perform Decision Tree Regression models in Python, go Here.
Decision Tree Regression in R for Data Science.
Just as with the previous models, we’re going to import a package and then use a function from this package to build our regressor. The package for Decision Tree Regression Model, is the ‘rpart’ package.
If you don’t have the ‘rpart’ package installed, just type the following line of code in RStudio and execute: install.packages(‘rpart’)
After ‘rpart’ has been installed, comment that line of code out by adding a hash in front of it, or by pressing ‘Ctrl + Shift + C’. (While this line is highlighted)
We also need to import the package from the library. To do this add the following line of code: library(rpart).
As usual we’ll start by creating a regressor. We’ll assign an rpart function from the rpart library to this regressor. So, just type: regressor = rpart()
Tip, to see the arguments that go into a function, place the cursor between the function and the parenthesis and press F1.
The first argument is the formula. And this formula is going to be: formula = Salary ~.,
The dot is used to represent all the independent variables.
The second argument is the data. This is the data that we are going to build our model on. It’s going to be our dataset so, data = dataset.
Weights is an optional argument. You can add some weights to make your model more advanced. This is a bit more advanced an we’re not going to cover it in this tutorial.
You also have some other arguments that are optional but that can help you make your model even more robust.
They include some regularization techniques that prevent overfitting and so on. Right now, we just want to build a simple Decision Tree Regression model. Since we have a small dataset, we’ll only need formula and data.
With those few lines of code, our regressor is ready to be built. We can now run the code.
Predicting the result
So now, let’s see what our created model predicts as the salary for the 6.5 level.
Wow!!! Our model predicts the salary for the 6.5 level as 249500. This is way higher than what our interviewee was asking for. What could be the problem?
Let’s visualize the graph and see where/what the problem could be. Just select the code and execute.
From the graph we can see that we have a horizontal line, like we got in the SVR in Python tutorial. In the case of SVR in Python, this straight horizontal line was as a result of not applying feature scaling to our dataset. Could this be the same problem?
With Decision Tree Models, we don’t need to apply feature scaling. This is because this model is built based on conditions on the independent variable, and not on Euclidean distances. Feature scaling is applied on machine learning models with Euclidean distances.
Therefore, in our current situation, feature scaling is not the problem. Even if you apply feature scaling to our model, you’ll still encounter this problem. So, what could be the issue?
The horizontal line is one Decision Tree Model. However, it’s not the best version of Decision Tree model that we want.
The problem we are facing is actually related to the number of splits. The Decision Tree Regression model is made by making some splits based on different conditions. The more conditions you have on your independent variables, the more you have splits.
According to the graph, our model clearly has no splits, as all the predictions are equal to $250000. Our model just took all the salaries in all the level, and made an average.
Like we saw earlier, we have a number of arguments in the rpart library, that we can use to make our model more robust.
The one parameter/argument that we’re most interested in here is the control parameter. This is a very good way of dealing with the problem that we have encountered. Model performance improvement is something that machine learning scientists do very often in their job. In later sections of this course, we’ll look at the best morals for selecting the best parameter/arguments when we cover cross-validation. However, right now we’re just going to use a simple model performance improvement with the control parameter.
So, lets go back to our model, add the control parameter and employ a little trick. To our regressor, add the following line of code: control = rpart.control(minsplit = 1)
We took the rpart.control function, which is part of the rpart library and then we used the minsplit argument to create some splits.
Run the code and execute.
Now let’s perhaps first visualize the results to see if our model is now correct before getting to the final verdict.
And now it makes much more sense because, based on the entropy in the information again it splits the whole range of your independent variable into different intervals.
So here we can clearly see the intervals. The first interval is from 1 to 6.5. The second interval is from 6.5 to 8.5. Then the third interval is from eight point five to 9.5. And finally, the last interval is from nine point five to 10.
The decision tree regression model is considering the average of the dependent variable values in each of the intervals.
If we use our graph and our y-pred function, we see that in both the predicted salary for level 6.5 is going to be $250,000.
That’s all for now. It’s a bit hard to get the concept of Decision Tree Regression using only 1-D. However, I hope this article gave you an idea of how to create and enhance Decision Tree Models in R. I’ll see you in the next one.