Hello and welcome to this tutorial. We have learnt how to create Single and Multiple linear regression models. Now, let’s learn how to create Polynomial regression Models in R and where we would apply it to solve real life problems.
According to Wikipedia, Polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial in x. Polynomial regression fits a nonlinear relationship between the value of x and the correspondent conditional mean of y. In this tutorial we are going to be building a nonlinear regression model. For datasets where there’s no linear relationship between the independent variable and the dependent variable, the nonlinear regression models are very useful.
In this tutorial, I’m going to be showing you how to create Polynomial regression models in R. To learn how to do it in Python, go Here.
Getting the Dataset
I have prepared the dataset that we are going to be using in this tutorial.
I have also included a polynomial_regression.R file, and this file contains the template that we are using to prepare our data for Machine Learning. Both of these files are in a zip file. To download the dataset and the template, click here.
Create a folder and give it a name like ‘Polynomial Regression’. Move the downloaded zip file into the folder you created and extract the contents of the zip file into that folder.
We created this template in the first part of this Machine Learning course. Therefore, if you have been following this course from the beginning, you will have done this already.
Installing R and RStudio || Importing the Dataset || Taking Care of Missing Data || Encoding Categorical Data
Open RStudio and set the Working Directory. To set the Working Directory in RStudio, just go to Session on the navigation bar, select ‘Set Working Directory’, and then ‘Choose Directory’.
Navigate to the folder you created above (the one with the downloaded files), and click open. You should see in the console section, that the working directory has been set.
In the files section, open the polynomial_regression.R file, select all the lines of code in that file and press Ctrl + Enter, on the keyboard to execute the code.
You should now see the dataset on the ‘Environment’ pane. Just click on it and it’ll appear on our main window.
Our Business Problem
You’re in the Human Resource team in a big company, and you’re about to hire a new employee into the company. You’ve found someone who seems to be great and a very good fit for the job. You are about to make an offer to this person and it’s time to negotiate what his/her salary is going to be. The interviewee tells us that he/she has had 19 years of experience and was receiving a salary of one hundred and sixty thousand in his/her previous job and is asking for nothing less than one hundred and sixty thousand.
One of your members in the Human resource team decides to call the interviewee’s previous company and ask if the information the interviewee has provided is true. Unfortunately, the only information the team member gets, is the Position_Salaries.csv file that you downloaded earlier.
The HR team member also finds out that our interviewee has been a regional manager in the previous company for two years, and it takes an average of four years to move from regional manager to partner. This means that our interviewee was half way to becoming partner. He/she was half way between level six and level seven, we can say level 6.5.
The HR team member says that he can build a ‘bluffing detector’, using regression to detect whether the interviewee is bluffing or not.
Let’s build a polynomial regression model, to build a detector that will predict whether it’s the truth, or a bluff.
We are going to be building on the code in the polynomial_regression.R file that we downloaded earlier in this tutorial.
In that code, we imported the dataset, and selected the only two columns that we need.
Just a heads up, ‘Levels’ is the independent variable, while ‘Salary’ is our dependent variable. We are going to use the correlation between the two, to train our nonlinear machine learning model, to predict salaries. For example, the salary for an employee in the six and a half level.
The next step would be to split the dataset into the training and test sets. However, this time we won’t do that. We are dealing with a very small dataset of only ten observations, so that we can best understand how machine learning models work. The next step would be feature scaling but we won’t need to do it either. That’s why that whole part has been commented out.
Linear Regression Vs Polynomial Regression in R
Next, we are going to fit our dataset to a polynomial regression model. However, to best understand how a polynomial regression model is more powerful in our situation, we are going to compare it to a baseline model, a linear regression model.
We are going to build two models, the linear regression model and the polynomial regression model, and compare the graphic results and the predictions. You will be more convinced that a polynomial regression model is more appropriate for this kind of problem. The main reason for that, is that this is a nonlinear problem.
We begin by creating our regressor; lin_reg. Next, we assign the regressor to the lm() function. The lm() function will take two arguments. The first argument is the formula; formula = Salary ~ .,’. The second argument is the data;‘data = dataset’.
Select the code and Press ‘Ctrl + Enter’. We have built our model. Type summary(lin_reg) inside the console area to view our model
Create a regressor and call it, ‘poly_reg’. Assign the regressor to the lm() function as we did in linear regression. The function takes two arguments. The formula and the data, same way we did in linear regression. To transform this from a linear regression to a polynomial regression model, we need to add some polynomial features. The features, are additional independent variables, and these are going to be the observations in the Levels column, in different powers.
The new independent variables are going to compose the matrix of features that we are going to use to apply on multiple linear regression models to make them a polynomial regression model. We are going to add three columns to our dataset. These columns are going to be the observations in the level column squared, cubed, and to the power of four. To do this, we add the following line to on top of the poly_reg regressor; dataset$Level2 = dataset$Level^2’.
The whole code should look like this.
Select code and execute. If we take a look at our regressor now.
Visualizing our Models
We are going to use the ‘ggplot2’ package. If you have not installed this package, check out our ‘Simple Linear Regression’ tutorial to see how we did it. Alternatively, you can use the line; install.packages(‘ggplot2’). We also need to import the library so that we can use it.
To visualize our data, copy the code below into your polynomial regression file. I explained the code in a previous tutorial, so I’m not going to do it again.
Visualizing the Linear Regression model
Select the code and execute.
If you look at our graph, you will see that there is no linear relationship between the level and the salary. Most of our observation points are below the line, while others are way above our line. For most of our observations, the predicted results are way off. Let’s take the CEO for example. If we used our linear regression model, we see that the predicted result for the CEO is about 690k. You can imagine how furious someone we were about to hire as a CEO would be, if we told him/her that he was bluffing, by asking for a 1 million salary.
On the other hand, if we use our linear regression model, we would overpay someone who is in the six and a half level. The predicted result is way higher than the actual observation. That’s why we need to apply a polynomial regression model for this situation. We need a model with a curved line that will help us make some more accurate predictions.
Visualizing the Polynomial Regression model
As you can see, we don’t have a straight line anymore. We have a curve that fits our observations better and more closely
We can try and make our model smoother by creating a new sequence of levels. Which means, we’re going to predict the salaries of more than ten levels. We do it by building a vector of imaginary levels, which will be levels from one to ten, with increments of 0.1. That is, 1, 1.1, 1.2, 1.3… all the way to 9.9. Instead of ten levels, we’ll have ninety levels. We do that by using the following code.
Select the code and execute.
Predicting the results
Let’s use our models to predict the results for our potential employee in the six and a half level and see if he/she was bluffing.
Let’s start with the linear regression model. To predict the results, use the following line of code
Select the code and execute
If we take a look at the predicted result in the console part of RStudio, we see that our model predicted a salary of 330k, which is way above what our potential employee said he/she used to receive.
Let’s now use the polynomial regression model to predict the result. To do this, use the following lines of code
Select the code and execute.
We can see that the polynomial regression model made a more accurate prediction of 158k, which is close to the 160k that our potential employee gave us. We can see that our model works.
Congratulations, you have created your first nonlinear regression model. We’ll be creating a lot more of these in the future. See you in the next one.