Hello and welcome to this tutorial. We are going to learn how to implement a Multiple Linear Regression model in R. This is a bit more complex than Simple Linear Regression but itβs going to be so practical and fun.
Multiple Linear Regression is a data science technique that uses several explanatory variables to predict the outcome of a response variable. A Multiple linear regression model attempts to model the relationship between two or more explanatory variables (independent variables) and a response variable (dependent variable), by fitting a linear equation to observed data. Every value of the independent variable x is associated with a value of the dependent variable y.
We’ll understand this better by using a very practical example.
Getting the Dataset
I have prepared the dataset that we are going to be using in this tutorial. However, feel free to use any dataset that you may have, and see if youβll get similar results.
Create a folder and give it a name like βSimple Linear Regressionβ. Move the downloaded zip file into the folder you created and extract the contents of the zip file into that folder.
We created this template in the first part of this Machine Learning course. Therefore, if you have been following this course from the beginning, you will have done this already.
Previous Tutorials
Installing R and RStudio || Importing the Dataset || Taking Care of Missing Data || Encoding Categorical Data
Open RStudio and set the Working Directory. To set the Working Directory in RStudio, just go to Session on the navigation bar, select βSet Working Directoryβ, and then βChoose Directoryβ.
Navigate to the folder you created above (the one with the downloaded files), and click open. You should see in the console section, that the working directory has been set.
In the files section, open the simple_linear_regression.R file, select all the lines of code in that file and press Ctrl + Enter, on the keyboard to execute the code.
Notice that inside our template code, the line βinstall.packages(‘caTools’)β is commented out. This line of code installs a library that we need to split our dataset into a training set and a test set. To see why this is important, read βSplitting the dataset into the Training Set and the Test Set.
If you already have βcaToolsβ installed, leave this line as a comment. Otherwise, if you donβt have βcaToolsβ installed, uncomment it by removing the β#β before it. Select the line of code and execute. After the package has been installed, it’ll appear on the bottom right of RStudio’s interface. Once it has been installed, comment out that line.
You should now see the dataset on the βEnvironmentβ pane. Just click on it and itβll appear on our main window.
Dataset + Business Problem description
Open the 50_Startups.csv file that we downloaded earlier in this tutorial.
Our dataset contains data about 50 Startups. The data is about observations of the amount each startup spent (on Research and Development, administration and marketing), the country in which the startup operates and the profit the startup made. Our challenge is to check if thereβs any correlation between the independent variables and the profit. Also, how would we go about creating a model to help a Venture Capitalist Fund understand how knowing the independent variables (R&D Spend, Administration, Marketing and Location), would help them predict the Dependent Variable (profit). More than that, we want to help the investors see which independent variable has the highest effect on the profit. And also, what governs the relationship between the profit and those independent variables.
Just a heads up before we dive into this section; thereβs a caveat around building regression models. Linear regressions have assumptions.
Assumptions of Linear Regression
We wonβt focus on the assumptions in this section. However, before you build a linear regression model, always do your research and make sure that these assumptions are true. Itβs only after you do make sure that the assumptions are correct, that you can go ahead and follow the steps that Iβll show you in this tutorial.
Unlike in the Simple Linear Regression model where we were dealing with one independent variable and one dependent variable, a multiple linear regression model consists of more than one independent variable. For this reason, we have to remove some columns to make sure our model is more accurate. There are five methods of building a model.
- All-in
- Backward Elimination
- Forward Selection
- Bidirectional Elimination
- Score Comparison
Weβre only going to focus on Backward Elimination in this tutorial because it is the fastest one, and youβre still going to learn how to build a model step-by-step. Without further ado, letβs begin.
Building a Multiple Linear Regression Model in R (Step-by-Step)
Data Preprocessing
In our template, we imported the dataset, encoded the categorical variables and split the data into the training set and the test set. We are going to be building-on on the code in that template.
Fitting the Multiple Linear Regression Model to our training set.
First, we have to introduce the Multiple Linear Regressor and call it βregressorβ. Next, we introduce the lm function, βlm()β, and it will take two arguments, the formula and the training set. The formula is going to be; βformula = Profit ~ . ,β. The β.β Is used to represent all the independent variables. The second argument is going to be the training set; βdata = training_setβ.


Zoom In/Out
Press βCtrl + Enterβ. To see our regressor, go to the console and type, βsummary(regressor)β.
If you take a look at our regressor, we see that some independent variables have a stronger effect than others on the dependent variable. Weβre able to see this by looking at the βp-valueβ column and the significant level column. Thatβs why we need to do backward elimination, to remain with the most significant independent variable and have a more accurate model.
All we have to do now is predict the test results. We just need one line for this. βy_pred = predict(regressor, newdata = test_set)β.


Zoom In/Out
Predicted results. Compare them to the actual observations
Press βCtrl + Enterβ. In the console section in RStudio, type y_pred and press Enter. Youβll see that our modelβs predictions are not too far from the real observations. It shows that this is not a bad model.


Zoom In/Out
The most important thing here to understand is that the lower the pr-value, the more statistically significant an independent variable is. The lower the pr-value is, the more the impact it has on the dependent variable.
When we look at our regressor, we can see that only one variable has a high significance level, The R&D Spend. (Shown by the tree stars). This means that we could actually change this into a simple linear regression, and express the profit as a linear expression of the R&D spend only. However, it is better if we remove the less significant variables one by one to get an even more accurate model.


Zoom In/Out
Backward Elimination in Multiple Linear Regression
Weβre going to use the same regressor we used, but weβre going to change a few things. First, we need to write down each independent variable in our formula. This is because, backward elimination involves removing them one by one. The second thing weβre going to change is then data, from training set to dataset.
Weβre also going to add βsummary(regressor)‘, so that we can always see the summary of our regressor.


Zoom In/Out
Generally, the best threshold to use is the 5 percent threshold. Which means, if the pr-value is lower than 5 percent (0.05), then the independent variable would be highly statically significant. Also, the more the pr-value is higher than the 5 percent, the less statically significant the variable will be. Weβll remove the pr-values that are higher than 5 percent one-by-one.
That is going to be your homework. You can check out the solution in the picture below if you get stuck.
Visualizing the Results
If you want to visualize the prediction model that we created above, just copy the code below.
To get an in-depth tutorial on how to visualize your models, go here.
Thatβs it for now. I hope you found this article to be useful. Also check out more hands-on tutorials on Lituptech. I’ll see you in the next one.
To buy the whole Machine Learning A-Z course on Udemy, Click Here.