Hello, and welcome to this tutorial. We’ve finished the Data Preprocessing part and now it’s time to start making Machine Learning Models. We’re are going to start with the Simple Linear Regression Model and I will show you how to do it in R. To Learn how to do Simple Linear Regressions in Python, go Here.
Before we begin, we need to understand our data and the problem we are trying to solve.
Getting the Dataset
I have prepared the dataset that we are going to be using in this tutorial. However, feel free to use any dataset that you may have, and see if you’ll get similar results.
I have also included a simple_linear_regression.R file, and this file contains the template that we are using to prepare our data for Machine Learning. Both of these files are in a zip file. To download the dataset and the template, click here.
Create a folder and give it a name like ‘Simple Linear Regression’. Move the downloaded zip file into the folder you created and extract the contents of the zip file into that folder.
We created this template in the first part of this Machine Learning course. Therefore, if you have been following this course from the beginning, you will have done this already.
Installing R and RStudio || Importing the Dataset || Taking Care of Missing Data || Encoding Categorical Data
Open RStudio and set the Working Directory. To set the Working Directory in RStudio, just go to Session on the navigation bar, select ‘Set Working Directory’, and then ‘Choose Directory’.
Navigate to the folder you created above (the one with the downloaded files), and click open. You should see in the console section, that the working directory has been set.
In the files section, open the simple_linear_regression.R file, select all the lines of code in that file and press Ctrl + Enter, on the keyboard to execute the code.
Notice that inside our template code, the line ‘install.packages(‘caTools’)’ is commented out. This line of code installs a library that we need to split our dataset into a training set and a test set. To see why this is important, read ‘Splitting the dataset into the Training Set and the Test Set.
If you already have ‘caTools’ installed, leave this line as a comment. Otherwise, if you don’t have ‘caTools’ installed, uncomment it by removing the ‘#’ before it. Select the line of code and execute. After the package has been installed, it’ll appear on the bottom right of RStudio’s interface. Once it has been installed, comment out that line.
You should now see the dataset on the ‘Environment’ pane. Just click on it and it’ll appear on our main window.
Dataset and Business Problem description
If we take a look at our dataset, it’s basically 30 observations, taken from 30 random employees in a company. Each employee was asked how many years of experience they have – not just in that company – but the overall years of experience in the workforce and the amount of Salary that the employee receives. The company has hired you as a Data Scientist, to find out if there is any sort of correlation between the Years of Experience, and the Salary. And, if there is a correlation, what type of correlation is it? The company’s HR is aware that experience matters. However, they don’t want to keep assigning salaries randomly.
Your job as the Data Scientist is to create a model, which will show the best-fitting line for the relationship between the Years of Experience and the Salary. You will show the company how they are currently setting salaries and also give them a more accurate model/set-of-rules on how to set salaries for new employees in the future.
We’re going to use the data in the Salary_Data.csv file to build a Simple Linear Regression Model.
Before making any machine learning model, you have to know which ones are independent variables and which ones the dependent variables. In our case, the independent variable is the number of years of experience while the dependent variable is the salary. We’re trying to predict the dependent variable, based on the information of the independent variable.
Just a quick look at our template code. First, we imported the data into RStudio. Then we split our dataset on a 2/3 split ratio where we set 20 observations as our training set, and 10 observations as our test set.
We’re going to use our training set to train our simple linear regression model. Our model will learn correlations between the Years of experience and the Salary using the training set. Then, we’re going to test the model’s power of prediction on the test set.
The next part would be Feature Scaling, but the Simple Linear Regression package we are going to use here in R takes care of this for us. We won’t need to apply feature scaling manually. The data preprocessing phase is done. We are ready start building the Linear Regression Model.
Building a Simple Linear Regression Model In R
As I stated above, we are going to begin with our Training set
Fitting Simple Linear Regression to the Training Set
We’re going to use what is called the ‘lm()’ function.
Just type lm and then press F1 to get info about the ‘lm()’ function and the arguments that go into it.
- One of them is the formula; Which is going to be, “the dependent variable, expressed as a linear combination of the independent variable. formula = Salary ~ YearsExperience,
- The second one is the Data; This is the data on which we want to train our Simple Linear Regression Model. In our case this is the training set that we created earlier.
There are some other arguments that would go into the lm() function but they are optional and we don’t need them in this case.
Let’s create a new variable that is going to be the simple linear regressor and call it regressor, regressor = lm(). The lm function is going to take our two arguments.
- First; Which is going to be, “the dependent variable, expressed as a linear combination of the independent variable. formula = Salary ~ YearsExperience,
- Second, the data. In this case, we want the training set; data = training_set
The whole code should look like this;
Select the line of code and execute. Our regressor is ready.
If you want to get any information about our regressor, the best way to do it, is to go to the console section and type; summary(regressor) and press enter. You’ll see some really good info about our Simple Linear Regression model.
First it shows you our formula; The Salary being proportional to the Years of Experience. Also, it tells you that the model is built on the training set
Then we have info about the residuals. We won’t discuss that for now
The most important section is the Coefficients section. Not only does it tell us the value of the coefficients in the Simple Linear Regression model, but also the statistical significance of the coefficients. We have 3 stars which means the YearsExperience independent variable is highly statistically significant. You can either have No Stars, which means there is no statistical significance. Three stars mean that there’s a high statistical significance. That’s our first hint of what is going to happen. There will be a strong linear relationship between the Salary and the Years of Experience.
The last is the P-Value. This is another indication of the statistical significance. The lower the P-Value is, the more significant the independent variable is going to be. i.e. the more impact/effect the dependent variable is going to have on the dependent variable.
Normally, a good threshold for the P-Value is 5%. When we’re below 5%, the independent variable is highly significant and when above 5%, the independent variable is less significant.
That’s how you get the information.
We’re done fitting our Simple Linear Regression to our training set. It’s now time to predict the Test set results, to see how our Simple Linear Regression model behaves on a new set of data
Predicting the Test set results
We’ve trained our model and now we want to see how well it would predict new observations. To do this, we are going to create our vector of prediction, y_pred.
We called it ‘y_pred’ as it will contain the predicted results of the dependent variable which is Salary. (It’s on the Y-Axis). We are going to use the predict function; predict()
We’re only going to have two arguments in our function. The first one is our regressor (the simple linear regression model we fitted earlier). So; y_pred = predict(regressor)
Our second argument will be; ‘newdata’. That’s the name of the argument. And this is the data that contains the observations of which we want to predict the results. i.e. the Test set. So; y_pred = predict(regressor, newdata = test_set)
The whole line of code should be;
Highlight the line of code and execute. The Vector of Prediction has been created. Inside the console, type ‘y_pred’ and press ‘Enter’.
Our simple linear regression has predicted the salary for each of the Test set observations. The salary is not exactly the same as the ones we have in the test set. However, since we saw a strong linear dependency between the Years of experience and the Salary, most of the results are pretty close to the real Salaries.
For presentation purposes, you will need to display these results on a graph. Let’s do that;
Visualizing the training set results in Graphs
The first thing we need to do is install and import the ggplot2 library package. It is a really good way of plotting something in R. To install it, write the code; install.packages(‘ggplot2’). After the package has been installed, you can comment out that line of code, as we won’t need to install it again. We just import it using the line of code; library(ggplot2).
We’re going to take a step by step approach to plotting our graph. First, we’re going to plot all the observation points in the training set, then we’re going to plot the regression line, then we add the title and finally the labels to the x and y axis.
The different components we’re going to plot are going to be separated by a Plus (+) sign. Our whole Plotting code should look like this;
We can now see our graph;
Let’s do the same for the test set results. Just copy the code above and edit the first line to change it from training_set to test_set. The block of code should look like this
Now we can see the Test set results on a Graph
We have seen the correlation. Generally, the more the years of experience, the more the salary. We’ve seen that in some cases employees received less/more than they should be getting. We’ve also given the company the best-fitting-line and the model they should use to set salaries in future. Mission Accomplished.
Congratulations, now you know how to create a Simple Linear Regression Model in R. In the next tutorial, we are going to learn how to do Multiple Linear Regression in R. See you then.