Hello and welcome to this tutorial. We are going to learn how to implement a Multiple Linear Regression model in R. This is a bit more complex than Simple Linear Regression but it’s going to be so practical and fun.

Multiple Linear Regression is a data science technique that uses several explanatory variables to predict the outcome of a response variable. A Multiple linear regression model attempts to model the relationship between two or more explanatory variables (independent variables) and a response variable (dependent variable), by fitting a linear equation to observed data. Every value of the independent variable x is associated with a value of the dependent variable y.

We’ll understand this better by using a very practical example.

Getting the Dataset

I have prepared the dataset that we are going to be using in this tutorial. However, feel free to use any dataset that you may have, and see if you’ll get similar results.

importing the dataset in R for data scienceI have also included a simple_linear_regression.R file, and this file contains the template that we are using to prepare our data for Machine Learning. Both of these files are in a zip file. To download the dataset and the template, click here.

Create a folder and give it a name like ‘Simple Linear Regression’. Move the downloaded zip file into the folder you created and extract the contents of the zip file into that folder.

We created this template in the first part of this Machine Learning course. Therefore, if you have been following this course from the beginning, you will have done this already.

Previous Tutorials

Installing R and RStudio || Importing the Dataset || Taking Care of Missing Data || Encoding Categorical Data

Open RStudio and set the Working Directory. To set the Working Directory in RStudio, just go to Session on the navigation bar, select ‘Set Working Directory’, and then ‘Choose Directory’.

Navigate to the folder you created above (the one with the downloaded files), and click open. You should see in the console section, that the working directory has been set.

In the files section, open the simple_linear_regression.R file, select all the lines of code in that file and press Ctrl + Enter, on the keyboard to execute the code.

Notice that inside our template code, the line ‘install.packages(‘caTools’)’ is commented out. This line of code installs a library that we need to split our dataset into a training set and a test set. To see why this is important, read ‘Splitting the dataset into the Training Set and the Test Set.

If you already have ‘caTools’ installed, leave this line as a comment. Otherwise, if you don’t have ‘caTools’ installed, uncomment it by removing the ‘#’ before it. Select the line of code and execute. After the package has been installed, it’ll appear on the bottom right of RStudio’s interface. Once it has been installed, comment out that line.

Multiple Linear Regression in R

You should now see the dataset on the ‘Environment’ pane. Just click on it and it’ll appear on our main window.

Dataset + Business Problem description

Open the 50_Startups.csv file that we downloaded earlier in this tutorial.

Our dataset contains data about 50 Startups. The data is about observations of the amount each startup spent (on Research and Development, administration and marketing), the country in which the startup operates and the profit the startup made. Our challenge is to check if there’s any correlation between the independent variables and the profit. Also, how would we go about creating a model to help a Venture Capitalist Fund understand how knowing the independent variables (R&D Spend, Administration, Marketing and Location), would help them predict the Dependent Variable (profit). More than that, we want to help the investors see which independent variable has the highest effect on the profit. And also, what governs the relationship between the profit and those independent variables.

Just a heads up before we dive into this section; there’s a caveat around building regression models. Linear regressions have assumptions.

Assumptions of Linear Regression

  1. Linearity
  2. Homoscedasticity
  3. Multivariate normality
  4. Independence of errors
  5. Lack of multicollinearity

We won’t focus on the assumptions in this section. However, before you build a linear regression model, always do your research and make sure that these assumptions are true. It’s only after you do make sure that the assumptions are correct, that you can go ahead and follow the steps that I’ll show you in this tutorial.

Unlike in the Simple Linear Regression model where we were dealing with one independent variable and one dependent variable, a multiple linear regression model consists of more than one independent variable. For this reason, we have to remove some columns to make sure our model is more accurate. There are five methods of building a model.

  1. All-in
  2. Backward Elimination
  3. Forward Selection
  4. Bidirectional Elimination
  5. Score Comparison

We’re only going to focus on Backward Elimination in this tutorial because it is the fastest one, and you’re still going to learn how to build a model step-by-step. Without further ado, let’s begin.

Building a Multiple Linear Regression Model in R (Step-by-Step)

Data Preprocessing

In our template, we imported the dataset, encoded the categorical variables and split the data into the training set and the test set. We are going to be building-on on the code in that template.

Fitting the Multiple Linear Regression Model to our training set.

First, we have to introduce the Multiple Linear Regressor and call it ‘regressor’. Next, we introduce the lm function, ‘lm()’, and it will take two arguments, the formula and the training set. The formula is going to be; ‘formula = Profit ~ . ,’. The ‘.’ Is used to represent all the independent variables. The second argument is going to be the training set; ‘data = training_set’.

Multiple Linear Regression in R

Zoom In/Out

Press ‘Ctrl + Enter’. To see our regressor, go to the console and type, ‘summary(regressor)’.

If you take a look at our regressor, we see that some independent variables have a stronger effect than others on the dependent variable. We’re able to see this by looking at the ‘p-value’ column and the significant level column. That’s why we need to do backward elimination, to remain with the most significant independent variable and have a more accurate model.

All we have to do now is predict the test results. We just need one line for this. “y_pred = predict(regressor, newdata = test_set)”.

Simple Linear Regression in R for Data Science

Zoom In/Out

Predicted results. Compare them to the actual observations

Press ‘Ctrl + Enter’. In the console section in RStudio, type y_pred and press Enter. You’ll see that our model’s predictions are not too far from the real observations. It shows that this is not a bad model.

Multiple Linear Regression in R

Zoom In/Out

The most important thing here to understand is that the lower the pr-value, the more statistically significant an independent variable is. The lower the pr-value is, the more the impact it has on the dependent variable.

When we look at our regressor, we can see that only one variable has a high significance level, The R&D Spend. (Shown by the tree stars). This means that we could actually change this into a simple linear regression, and express the profit as a linear expression of the R&D spend only. However, it is better if we remove the less significant variables one by one to get an even more accurate model.

Multiple Linear Regression in R

Zoom In/Out

Backward Elimination in Multiple Linear Regression

We’re going to use the same regressor we used, but we’re going to change a few things. First, we need to write down each independent variable in our formula. This is because, backward elimination involves removing them one by one. The second thing we’re going to change is then data, from training set to dataset.

We’re also going to add ‘summary(regressor)‘, so that we can always see the summary of our regressor.

Multiple Linear Regression in R

Zoom In/Out

Generally, the best threshold to use is the 5 percent threshold. Which means, if the pr-value is lower than 5 percent (0.05), then the independent variable would be highly statically significant. Also, the more the pr-value is higher than the 5 percent, the less statically significant the variable will be. We’ll remove the pr-values that are higher than 5 percent one-by-one.

That is going to be your homework. You can check out the solution in the picture below if you get stuck.

Visualizing the Results

If you want to visualize the prediction model that we created above, just copy the code below.

To get an in-depth tutorial on how to visualize your models, go here.

That’s it for now. I hope you found this article to be useful. Also check out more hands-on tutorials on Lituptech. I’ll see you in the next one.

To buy the whole Machine Learning A-Z course on Udemy, Click Here.


What's your reaction?

In Love
Not Sure

You may also like

More in:R

Leave a reply

Your email address will not be published. Required fields are marked *